Newsletter:2002 Summer

Open Bioinformatics Foundation Report

Board Mission Statement

The Open Bioinformatics Foundation is a non profit, volunteer run organization focused on supporting open source programming in bioinformatics. The foundation grew out of the volunteer projects Bioperl, BioJava and Biopython and was incorporated to handle our modest requirements of hardware ownership, domain name management and funding for conferences and workshops. The Foundation does not participate directly in the development or structure of the open source work, but as the members of the foundation are drawn from the member projects, there is clear commonality of direction and purpose. Occasionally the Open-Bio board may make announcements about our direction or purpose (a recent one was on the licensing of academic software) when the board feels there is a need to clarify matters, but in general we prefer to remain simply the support organization for our member projects.

Currently the foundation has a board of 5 people running it with Ewan Birney as President, Chris Dagdigian as Treasurer, Andrew Dalke as Secretary, and Hilmar Lapp and Steve Brenner as board members. Our main activities are:

Underwriting and organizing the BOSC conference
Underwriting and organizing hackathon conference
Management of O|B|F servers and other assets

We have an application pending with the US Internal Revenue Service (IRS) for tax-exempt status as a 501(c)(3) non-profit foundation. Included in this newsletter is a basic overview of our financial status and year 2002 activity. Official numbers that include the financial outcome of the BOSC’2002 conference will be available in our annual report which will be produced at the end of our fiscal year.

The next O|B|F board of directors meeting will occur in Edmonton, Canada at the site of our BOSC’2002 conference. Our meetings are open to the public. Details concerning time and venue will be posted at BOSC’2002. The email contact address for the board members is ‘board@open-bio.org’.

Financial Overview

Bank Balance as of July 21, 2002: $16,322.13

Date ! Payee ! Amount ! Description
2002-01-26 \| Chris Dagdigian\| $128.84 \| Reimburse Chris for paying for 2002 Hackathon lunch(1st day)
2002-01-27 \| Chris Dagdigian \| $124.80 \| Reimburse Chris for paying for 2002 Hackathon lunch(2nd day)
2002-02-13 \| Heller Ehrman Attorneys \| $357.00 \| Reimburse our lawyers for foundation incorporation fees
2002-02-13\| Chris Dagdigian \| $192.00 \| Reimburse Chris for 1 year rental of O\|B\|F post office box
2002-04-17 \| MAPS Inc. \| $150.00 \| 1 year subscription fee to mail-abuse.org anti-spam blackhole list(s)

OBF Financial Transactions (since Jan 1, 2002)

Upcoming expenses we forsee

Domain name renewal fees (minor < $200) Potential BOSC 2002 conference financial loses (unlikely) We do not operate our BOSC conferences with the goal of making lots of money. Traditionally we aim to either break even on expenses or make a small profit. Attendance for BOSC’2002 is looking very good but given some unforeseen travel and housing expenses for the 2002 conference there is a small chance that our expenses will exceed what we take in as registration fees.

Website and mailing list statistics

Site ! Unique Visitors ! Page Views ! Hits ! HTTP Traffic
bioperl.org \| 75,659 \| 475,493 \| 667,491 \| 19.28 GB
biojava.org \| 54,057 \| 411,691 \| 566,718 \| 25.14 GB
biopython.org \| 24,331 \| 33,156 \| 154,156 \| 3.61 GB
open-bio.org \| 20,820 \| 45,633 \| 131,891 \| 818.16 MB

Website statistics Year 2002 to date (through July 21, 2002)

Site ! Unique Visitors ! Page Views ! Hits ! HTTP Traffic
bioperl.org \| 6,770 \| 45,886 \| 61,161 \| 2.21 GB
biojava.org \| 6,687 \| 34,934 \| 48,222 \| 1.86 GB
open-bio.org \| 1,905 \| 4,402 \| 16,128 \| 90.87 MB

Website statistics Month of July 2002 (through July 21, 2002)

list ! Subscribers
Bioperl-l \| 944
Bioperl-announce-l \| 827
Biojava-l \| 583
BioPython \| 252
DAS \| 238
Bosc-announce \| 215
Bioperl-guts-l \| 205
BioXML-dev \| 188
BioPython-announce \| 177
BioXML-announce \| 111
Biopython-dev \| 96
BioBiz \| 90
I3C-techarch \| 73
Open-Bio-l \| 62
moby-l \| 50
Biocorba-l \| 45
Authors \| 41
I3c-pathways \| 32
Biocorba-announce-l \| 29
Biograph \| 26
Infrastructure \| 22
Root-l \| 16
Ontologies \| 15
I3C-roadmap \| 11
MOBY-guts \| 9
Naming-l \| 9
Volunteer \| 7
Webteam \| 7
Biosoap-l \| 7
Dynamite \| 5
Technical \| 5
Mailteam \| 3
DAS-announce \| 1

Mailing list statistics (as of July 21, 2002)

2002 BioHackathon Report

Spanning 6 weeks and 2 continents the first ever “biohackathon” was a great success and one that the O|B|F would like to see established as an annual event. The invitation-only gathering of open source bioinformatics developers was split over two sessions, the first one being at the O’Reilly Bioinformatics Technology Conference in Arizona, USA and the second in Cape Town, South Africa organized by Electric Genetics. The hackathon was additionally supported by AstraZeneca and Dalke Scientific Software. All the code generated was immediately committed to the publicly accessible cvs system on open-bio (instructions at cvs.ope-bio.org).

The hackathon drew together 20+ developers across a number of different open source projects. The aim was to develop an infrastructure for accessing sequence databases transparently that scales from a small single computer in a molecular biology lab to a large scale pipeline project. This infrastructure can be transparently shared between the different language projects - eg, building a sequence database in Bioperl but accessing it from BioJava. The hope is that one can both reduce the time it takes to build and test applications in different languages and, at the same time, reduce the overhead in managing and deploying sequence databases in bioinformatics installations. Aware of the need for snazzy acronyms for standards to allow people to dazzle their managers/sales force/bosses the participants named this the “Open Bioinformatics Database Access” scheme (OBDA for short).

Attendees settled on a standard set of 6 implementations to retrieve sequences, differing in their complexity, network requirements and throughput. In all cases they were taking an existing system from an open source project and wherever possible following existing standards. Having discussed the specifications of these methods participants then implemented the system in 5 languages - Perl, Java, Python, Ruby and C (not all languages got all implementations due to limitations in programming time, but Perl, Java and Python had a full suite). The implementations where then tested between different languages to ensure programmatic and data transfer capabilities. Finally the different methods were performance tested and a number of performance bottlenecks removed.

At the same time a number of other projects were advanced. A framework for Bibliographic objects was discussed and Perl and Java code provided. The Genquire Perl GUI was adapted to work on top of aspects of the OBDA system. Bio::Graphics, a GIF drawing system for Perl was integrated into Bioperl. The OmniGene project became more plug-and-play with BioJava.

One important corollary of the biohackathon was strengthening the common conceptual view of our data. For the last five years all the projects have by and large been sticking to a common core of EMBL/GenBank format information in their data model. It was unclear how to extend this model into other areas without losing cross-project interoperability. The requirement of all projects to read and write to a relational database (BioSQL) forced us to re-examine our common data model away from the perspective of a data format. The result was in fact closer cooperation and a clearer understanding of how to extend our data models in cross project compatible manner. In particular it was decided to make ontology integration an explicit option for information, allowing more flexibility and richness in describing the additional data attached to sequences.

Finally, we had fun. South Africa was a real eye opener for us, with incredible scenery, lovely people and real attention to detail from our hosts, Electric Genetics. But we are also hackers, and all of us got a kick out of simply being able to work together with few distractions and an open 802.11b wireless network. Having a turn around time of minutes in a Q/A session, rather than potential days when people are working via email in different time zones was sensational.

All the Open-Bio.org projects and the O|B|F community in general was strengthened immeasurably by the hackathon. We would like to take this space to sincerely thank the hackathon organizers (Electric Genetics and O’Reilly) and sponsors (Astra Zeneca and Dalke Scientific).

A group picture of the Arizona hackathon attendees (minus Andrew Dalke and Martin Senger). A picture gallery from the biohackathon can be found online at this site.

O|B|F Project Reports

This has been an important year for the O|B|F projects. Bioperl released its 1.0 stable release after 7 years of development, BioJava and Biopython have continued to produce new iterations of their software, and the cross-talk collaboration through the formal creation of O|B|F and the Biohackathons have encouraged the projects to grow together towards collective goals of easy to use software tools for bioinformatics. The addition of a number of projects to the O|B|F family including BioMOBY, BioDAS, BioRuby (hosted in Japan).

Bioperl

The Bioperl project has been very active over the past 9 months. We released our major 1.0 release in March of 2002 and 2 subsequent bugfix point releases in June and July. The most recent release contained over 400 modules and 160k lines of code. The project team has seen an influx of new ideas addressing new (for us) domains in life sciences programming including phylogenetic trees, sequence cluster, sequence rendering, fast and lightweight databases for sequence features, generalized parsers for sequence database search results (like BLAST and FastA), structure, and improvements all around the design of the system. We expect to be expanding the toolkit’s horizon from sequence analysis to tasks surrounding gene expression data, biological ontologies, and comparative genomics.

The BioHackathons held in Arizona and South Africa January and February 2002 allowed many of the Bioperl Core developers to meet and muse on future areas of the toolkit as well as coordinate collaborative projects with other OBF developers. These joint projects include the Open Bioinformatics Database Access standard for sequence databases access that all OBF projects are planning to implement. This standard along with the associated BioSQL project will help developer rely on a defined data access model and focus on the implementation of their client libraries.

A few new sub-projects have been initiated in the past 6 months.

bioperl-pipeline, managed mostly by the Fugu genome research group in Singapore, which is designed to assist centers building analysis pipelines for small to medium size and complexity.

bioperl-run, a collection of modules intended to wrap local and remote execution of analysis programs. This includes wrappers around the EMBOSS package, PAML, PHYLIP, BLAST, and remote execution on the NCBI BLAST Queue and Pasteur’s Pise system.

BioJava

BioJava is a set of open source libraries for bioinformatics developers and researchers, with a current emphasis on handling, analysis, and visualization of biological sequence data. With the project now in its third year, we have released the 1.2x stable series, which includes a wide range of incremental improvements and bug-fixes, plus more graphical components and support for the BioSQL sequence database technology.

More recent developments include support for OBDA, a suite of data-exchange technologies agreed at the O’Reilly and Electric Genetics hackathon meetings, and support for additional file formats. A companion project, biojava-lims, has been started to provide support for scientific workflow management.

BioJava is an open source project (LGPL). All contributions – code, documentation, or ideas – are welcome. For more information see http://www.biojava.org/

Biopython

The BioPython project was started in August 1999 to create a general open source toolkit in python to help manage and analyze biomedical data. It provides modules that can help in every step of typical bioinformatics tasks: retrieving information from databases (local or over a network), parsing the data into general Python objects, analyzing the data with general algorithms, and writing the data back out into common formats. Currently, BioPython can handle nearly 30 databases and applications.

Because of the growth in the capabilities of BioPython, we are currently working on more general code to help manage the different databases and formats. For example, in the current development version, we have code that can autodetect data formats and then automatically parse it into a correct data structure. Similarly, we are unifying the APIs to retrieve, manage, and analyze data. We are excited about these developments and believe it will make BioPython more accessible and powerful. Stay tuned…

The BioMOBY Project

BioMOBY is an Open Source (OS) research project which aims to explore architectures for the discovery and distribution of biological data using web-services; data and services are decentralized, but the availability of these resources, and the instructions for interacting with them, are registered in a central location. In the current architecture, the central registry (“MOBY-Central”) breaks with the web-services paradigm, as exemplified by Universal Data Discovery and Integration (UDDI), by having a lighter, object-driven registry query system. This allows users to traverse expansive and disparate datasets where each possible next step is presented based on the data currently in-hand. Moreover, the path from in-hand to desired end-point data can be automatically discovered using the registry. In addition, the registry itself is itself capable of creating service description (WSDL) documents in response to specific client requests. This greatly simplified simplifies service deployment, with the aim of encouraging the participation of service providers.

Data in BioMOBY is passed in the form of MOBY-Objects, which are (generally) lightweight XML and make up both the query and the response of a SOAP transaction. Object-types are organized in a hierarchy. The Object hierarchy, with both IS A and HAS A relationships, provides several powerful opportunities: discovery of ‘base’ Objects within more complex Objects, allowing complex Objects to be used as input to a broader range of services; backwards compatibility with old clients as new Objects are defined; Server-generated Objects need be only as complex as the Server is capable of, enhancing the number of Object and Services that a service provider can host. Important to note is that the ‘base’ MOBY-Object can be used as a shell around objects from any other object model system, allowing BioMOBY to transport Objects defined by, for example OMG, with no modifications. Finally, cross-links may be included by the service provider in any output Object, enabling the client to branch into related data sources to retrieve supplementary information.

Service-types are also organized into a hierarchy. This allows automated discovery of new instances of service types through querying for a ‘base’ type, and enhances the human-readable descriptive capabilities of the Service vocabulary (e.g. Blast is both an “alignment” Service and a “sequence similarity” Service type, depending on what you were searching for).

A prototype MOBY-Central is currently publicly available, and is regularly being enhanced as the requirements of the BioMOBY system become clear. Services are being deployed increasingly rapidly, though most are currently developed with the aim of solving several use-case biological queries. Currently MOBY-Services are available at TAIR, FlyBase, and PBI-NRC, and these can be discovered by querying MOBY_Central.

Details about the project, including the MOBY-Central API and all code, are available at http://www.biomoby.org.

BioRuby

BioRuby project was started in late 2000, and the first year was mainly spent for building basic frameworks. We knew there were some other open Bio* projects and are leading the scene, we want to have yet another toolkit with our favorite language Ruby. Ruby, the object oriented scripting language born in Japan, has a lot of good features also suitable for bioinformatics, and we love its simple and powerful syntax.

During these 6 months after the BioHackathon, BioRuby became to have some OBDA capabilities including BioRegistry, BioFetch, and BioSQL. Besides, we also provide a BioFetch server at biofetch.bioruby.org. Among other features, remote Fasta/Blast with common APIs against the server in Japan looks mature now.

We will follow the rest OBDA specs such as flatfile indexing, XEMBL, and BioCorba in the near future. We are also working on supporting external applications like EMBOSS, HMMER etc. and challenging to handling pathway data in KEGG database.

Open Bioinformatics Database Access standard - OBDA

Even in the relatively small world of bioinformatics different people prefer different languages and that’s not going to change. Some love the expressiveness of Perl, others the simple power of Python, and others the static typing of Java. Flexibility can be good, but it may mean the tools you want are not available in your language of choice.

There are many ways to let programs written in different languages work with each other. Two programs could exchange files in a well-defined format, or send XML over an HTTP connection, or talk to a common database using SQL, or use an integration tool like CORBA. The implementation choice depends on the requirements.

Twice during this year members of the different Bio* projects met together for a Biohackthon, with the explicit goal of identifying, defining, and implementing standard interfaces and protocols for information exchange between the projects. Here is a short summary of each project. For more information, see http://obda.open-bio.org.

BioSQL

The sequence database is a core part of almost every bioinformatics project. Many people store sequence data in a relational database system like MySQL, PostgreSQL, or Oracle. BioSQL is a schema definition for the sequences, features, cross-references, and other data types found in GenBank/EMBL and related databases. The different language projects then provide bindings on top of the SQL to simplify database searches and convert the remote database information into local objects.

Supported in : Bioperl, Biopython, BioJava, BioRuby

Flatfile indexing

On the other hand, some labs don’t need the complexity of running a full database system and simply need a way to retrieve a flat-file record quickly from a set of files given an identifier or other attribute. The OBF flatfile indexing specification supports this sort of indexing and record lookup. As a result, you could use the BioJava implementation to build an index of all of GenBank, then when your perl-based web application needs record ‘AI129902’ use the Bioperl implementation to get that record and pull out the fields you need.

Supported in: Biopython, BioJava, Bioperl, C

BioFetch

At other times, the easiest way to get a sequence record is over http through the standard CGI interface. BioFetch defines how to compose the CGI request, including the database name, record identifier, and output format. Clients send the GET string to the server, which returns the record in the requested format.

Client support: BioRuby, BioJava, Bioperl, Biopython Server support: BioRuby, Bioperl

XEMBL

SOAP is starting to replace CGI as a way for two programs to communicate over http. XEMBL defines a SOAP protocol to ask for an EMBL record and get the data in an XML format like BSML or AGAVE. The EBI has setup a server which serves up XEMBL as SOAP or as static XML through a simple CGI.

Supported by: (clients can read one of these formats and connect to the website) Bioperl, BioJava

BioCORBA

CORBA is a middle-ware layer where clients and servers can communicate and share objects that are written in different languages such as Perl, Java,C, and Python. The BioCORBA project started in 1999 when a specification (using the Interface Design Language or IDL) was proposed. The specification has been merged with the OMG’s Life Sciences Research group (LSR) and describes sequences, features, annotations, databases, and alignments. This specification was implemented by BioJava, Bioperl, and Biopython using our client libraries to support the various object definitions. Using this specification then a CORBA server implementing the BioCORBA spec can for example serve as a sequence database server. This server can be represented by an object in a program so that programmatic access to a database server is achieved and this object can be either be for example a local instance of an inhouse sequence database or a remote database serving up the complete EMBL dataset.

The BioHackathon allowed us to solidify the specification, work out some bugs, and test cross-platform, cross-language compatibilities to insure that all objects created by, for example, Biopython servers, would behave as expected when used by a Bioperl client. The language bindings are still being finished and tested but we expect to release the complete set of packages as part of OBDA by the end of the year.

Supported by:Bioperl, BioJava, Biopython

Registry

Unfortunately, we’ve just defined five new ways to retrieve a record given an identifier. Mostly though you don’t care to specify where the data came from, you just want to get the data. The BioDirectory Registry is a simple system to specify the different ways to get a database record. Suppose you want GenBank record ‘AI129902’. The Registry knows which services provide access to GenBank and can try each in turn to get the sequence.

Supported in: Biopython, Bioperl, BioJava, BioRuby

GMOD - GBrowse

The Generic Model Organism Database project http://www.gmod.org/ is a collection of software, database schemas, and operating procedures designed to ease the task of building a model organism genome databases. Ultimately it aims to be a “MOD-in-a-box”, a set of off-the-shelf components that will snap together to create a complete model organism database. Currently GMOD includes the Apollo genome editor, the web-based GBrowse genomic annotation browser, the Bio::Graphics for generic feature rendering, a literature search and curation system, and a generic lab protocol documentation toolset. GBrowse recently released version 1.46, which supports semantic zooming, reading frame analysis, third-party annotation support, and a number of useful glyphs.

Apollo was developed as a collaboration between the Berkeley Drosophila Genome Project (part of the FlyBase consortium) and The Sanger Institute in Cambridge, UK. It allows researchers to explore genomic annotations at many levels of detail, and to perform expert annotation curation, all in a graphical environment. Apollo is being used by the FlyBase biologists to make the final annotations on the finished Drosophila melanogaster genome, and will also be the primary vehicle for sharing these annotations with the community. Because of Apollo’s modular, flexible framework, many research groups are using it as a starting point for customizing their own annotation visualization tools.

Apollo and Gbrowse are available at SourceForge: http://sourceforge.net/projects/gmod/. Like all GMOD components, they are distributed under the terms of the Artistic License.

Bibliographic Query Service - BQS

Bibliographic search and citation are central to all scholarly and research activities. Within the domain of life sciences research, bibliographic citation is of particular importance for annotation of large bodies of experimentally developed and computationally derived data and the rapidly increasing corpus of research literature makes efficient and effective bibliographic searches increasingly critical. This was the motivation for adding bibliographic modules into bioperl. The bioperl bibliographic service provide client-side modules allowing standardized access to the repositories, such as MEDLINE.

The core module Bio::Biblio is a central gate for querying bibliographic repositories and retrieving citations from there. A default access method is based on the SOAP technology (a Web Service approach), but the bioperl architecture allows to plug in easily other technologies (another one - biofetch, a traditional HTTP-based access method, is also available).

By default, the Bio::Biblio module queries the MEDLINE repository available as an experimental service from the European Bioinformatics Institute (EBI).

Additionally, there are several modules (grouped around Bio::Biblio::IO) for parsing and converting retrieved citations. These modules are independent on the access method to the repository, and can be used separately, for example to parse PubMed citations stored in the local files.

Finally, there are modules allowing to represent individual citations as Perl objects. The object representation promotes approved standards for bibliographic data, such as the Dublin Core Elements Metadata.

The main URL for the Bibliographic Query service is http://industry.ebi.ac.uk/openBQS/. The Perl modules are described in details in http://industry.ebi.ac.uk/openBQS/Client_perl.html.

Pise

Pise (http://www.pasteur.fr/~letondal/Pise/) is an interface generator for programs running under Unix. More precisely, it is a software system which, given an XML description of a program’s parameters, generates source code for a user interface, as a component of a system where the user can easily chain programs by pull-down menus. Two GUI generators already exist: a Web interface generator (composed of rather basic HTML and CGI scripts), and a Tcl/Tk interface generator, which is currently used in a prototype tool, biok, in our laboratory. Recently, a perl/bioperl API generator has also been developed (a Python API is planned for the end of the year).

About 300 molecular biology programs have been defined under Pise, including various sequence analysis, phylogeny, alignment, structural analysis (RNA, secondary and tertiary structure) and gene prediction programs. Pise has been in production for more than 4 years at the Pasteur Institute (about 1000 submitted jobs a day during the last year) (http://bioweb.pasteur.fr/). The whole system, e.g generators and the complete set of already defined interfaces is also installed in several other sites, namely for interfacing EMBOSS programs. Other users have developed new programs’ interfaces (in genetic analysis, primer design, and imaging analysis). We are also aware of projects for building a new GUI generator.