Ensembl: An open source project for genome annotation


Stabenau A, Clamp M, Curwen V, Birney E, Cox T, Cuff J, Durbin R,
Gilbert J, Huminiecki L, Hubbard T, Lijnzaad P, Kasprzyk A, Mongin E,
Pettett R, Potter S, Slater G, Stupka E, Stalker J, Vastrik I.


INTRODUCTION 

The ensembl project is an automatic annotation system for eukaryotic
genomes. Currently the project provides mainly data, software and
support for the human genome annotation, but mouse genome is supported
as well and we expect to annotate a number of other eukaryotic genomes
this year. All Data and source code is freely available.

DESIGN  

Ensembl data is stored in a federation of relational
databases. Development happens mainly with a MySQL database engine,
but there is a successfull oracle port. The core database contains the
sequence data and annotations like predicted genes and repetitive
regions. Additional databases provide diseases, SNP and expression
data.

The software is currently written in object oriented perl. Central
part are the biological objects ( "business objects" ) which represent
Genes, Exons, Features etc. A layer of database access objects (
Adaptors ) provides their connectivity to the underlaying relational
databases. The analysis and annotation is done in the Ensembl-pipeline
modules. A comprehensive set of web-view modules gives a graphical
view on the data via apache and modperl.

BIO-ENSEMBL

Bioperl is used throughout the system and bioperl interfaces are
implemented in sequence providing objects and features. The Biojava
project provided a prototype java layer around the database. A Java
port of the whole EnsEMBL system will provide further Biojava
support. A basic CORBA server for EnsEMBL exists and supports BioCORBA.


I will present the current Ensembl code development, its relationship
to the Bio* projects and future plans.