OBF Google Summer of Code 2014 Wrap-up

In 2014, OBF had six students in the Google Summer of Code 2014™ (GSoC) program mentored under its umbrella of Bio* and related open-source bioinformatics community projects: Loris Cro (Bioruby) with mentors Francesco Strozzi and Raoul Bonnal; Evan Parker (Biopython) with mentors Wibowo Arindrarto and Peter Cock; Sarah Berkemer (BioHaskell) with mentors Christian Höner zu Siederdissen and Ketil Malde; and three students contributed to JSBML: Victor Kofia (mentors: Alex Thomas and Sarah Keating), Ibrahim Vazirabad (mentors: Andreas Dräger and Alex Thomas), and Leandro Watanabe (mentors: Nicolas Rodriguez and Chris Myers).

As a change from earlier years in which OBF participated in GSoC as a mentoring organization, in 2014 we purposefully defined our umbrella as much more inclusive of the wider bioinformatics open-source community, bringing it more in line with the annual Bioinformatics Open-Source Conference (BOSC). In part this was also motivated by " paying it forward", a concept central to growing healthy open-source communities, after the larger domain-agnostic language projects such as SciRuby and PSF had extended an open hand to OBF mentors when OBF did not get admitted as a GSoC mentoring organization in 2013. In the end, four out of the six succeeding student applications were for projects outside of the traditional core Bio* projects, a result with which everyone won: We had a terrific crop of students, our community grew larger and stronger, and open-source bioinformatics was advanced in a more diverse way than would have been possible otherwise.

In addition to our students, huge kudos also go to our mentors (see above), and to Eric Talevich (Biopython) and Raoul Bonnal (Bioruby), who ran our program participation as administrators. They all invested significant amounts of time on behalf of our community and projects. Thank you!

Below follows a short summary of each of the 2014 student projects, starting with the three JSBML students.

JSBML and GSoC 2014

JSBML is an international community-driven, open-source project to develop a Java API library for reading, writing and manipulating SBML, a data format for representing and exchanging computational models in systems biology. SBML has been in use for over a decade but continues to evolve and grow, and hence so does JSBML. JSBML holds two annual development-oriented workshops, and the three 2014 JSBML GSoC students had the opportunity to participate in and present their work at the autumn event, COMBINE (Computational Modeling in Biology Network), which was held in Los Angeles, California, right at the end of GSoC. Furthermore, a scientific publication on a new JSBML release, currently under review at Bioinformatics, highlights some of the work done by the students. Hence, JSBML’s 2014 participation in GSoC was a great success and experience, both for the students as well as the JSBML project and community.

Ibrahim Y. Vazirabad - " Improving the plugin interface for CellDesigner"

CellDesigner is a frequently used program in computational systems biology. It features an easy-to-use GUI, powerful graph editing functions, and a rich simulation functionality, among others. To facilitate rapid prototyping of new algorithms in third-party applications, CellDesigner provides a plug-in interface for Java applications to its robust interface and other features. However, the design and implementation of the plug-in interface made developing software for it very difficult and time consuming. To remedy this, a draft version of a JSBML library had been created to allow developing and testing prospective plug-in modules initially as stand-alone software, which can then be turned into a CellDesigner plug-in with very little effort. The goal of Ibrahim’s project was to improve the interface provided by the library, and importantly, to revise it to support access to one of CellDesigner’s most interesting features, graphical network layout. As a result of Ibrahim’s work, new CellDesigner test cases and plugins that use this interface have already been implemented, including one that converts between CellDesigner’s proprietary data format and the official SBML layout extension.

Leandro H. Watanabe - " Arrays Package"

The arrays and dynamic package extensions to SBML have been proposed to overcome SBML’s limitation to static static models, which is in contrast to the inherently dynamic nature of many biological systems. The goal of Leandro’s project was to implement the arrays package in JSBML. Rather than enabling models with new behaviors to be constructed, the purpose of the arrays package is to represent regular constructs more efficiently and more compact than SBML core constructs can. To aid the integration of the arrays package into existing tools, Leandro also implemented the option of flattening an arrayed model to use only SBML core constructs, and a validation procedure for array constructs that checks whether a model violates any of the rules imposed on array constructs. As a consequence, his work helped solidify the Arrays Specification document of the SBML standard.

Victor Kofia - " Redesign the implementation of mathematical formulas"

JSBML uses the concept of abstract syntax trees to work with mathematical expressions. For example, the image to the right shows a syntax tree representing the formula k8 · R1. Originally, JSBML implemented different kinds of formula components all in just one complex class with diverse type attributes, which was prone to introducing errors upon code changes and generally made maintenance of the software difficult. Victor implemented a math package for JSBML, in which different kinds of tree nodes that can occur in formulas (e.g., real numbers or algebraic symbols such as ‘plus’ or ‘minus’) are represented with their own, specialized classes. This has made handling of formulas much more straightforward, and also more efficient. In the future, this new representation could even be used for symbolic or numeric calculations.

Evan Parker - " Addition of a lazy loading sequence parser to Biopython’s SeqIO package"

Though Biopython is already equipped with sequence parsers for a wide array of formats, these generally parsed entire records into memory. For large sequences such as entire chromosomes this quickly degrades performance. To allow sequences to be loaded on-demand, Evan designed a general lazy-loading parser by refactoring the existing object model, and then added format-specific modifications to each individual parser. The approach he devised works by pre-indexing the sequence files and then loading only those sequence regions that the user requests. Benchmarking and performance comparisons showed this approach yields significant performance gains when, as is common for genome-scale files, users are interested only in parts of the full sequence. Evan’s code is currently under review by Biopython core developers, and once merged will make parsing large sequences in Biopython much more tractable.

Loris Cro - " An ultra-fast scalable RESTful API to query large numbers of VCF datapoints"

Variant Call Format (VCF) files are commonly generated by genome sequencing projects for sequence variations among different individuals and can get very large. The goal of Loris’ work was to develop code for Bioruby to determine the common variations (i.e., intersections) between multiple individuals and groups of individuals in a fast and scalable way. In the first phase of the project, Loris tested different technologies for storing large VCF files, from which MongoDB emerged as having superior performance. In the second phase Loris developed the code for efficiently storing VCF data into MongoDB, and then implemented algorithms for performing the intersection queries (see Github repo and Loris’ project blog). The code was developed using JRuby and uses the HTS-JDK library to parse the VCF data. In the course of the project, Loris also provided valuable feedback to the HTS-JDK team that led to improvements of the VCF parser and data model. The result of Loris’ GSoC work is now available to the community as a Ruby Gem, which has been tested and used already in large international genome re-sequencing projects, including Gene2Farm and WHEALBI.

Sarah Berkemer - " Open source high-performance BioHaskell"

One of the challenges with sequence alignments for the purposes of sequence similarity searches is that for most known genes (i.e., sequences) relatively little is known about their biology, and the few for which a lot is known therefore tend to be only remotely related to a query sequence. Transitive alignments try to ameliorate this by aligning the query sequence against a large body of known but not deeply understood sequences, the intermediate set, which in turn are then aligned against the core of well-understood sequences. However, in contrast to aligning two sequences, aligning a sequence via a vast intermediate data set to a smaller core set is slow and memory-consuming. As part of her GSoC project, Sarah dug deep into the structure of the algorithm, and rewrote core parts to make use of fusing data structures and efficient tree-like data structures (see her project blog). Her work brought down the runtime for a benchmark by a factor of 3, from 31 to 11 minutes, and, arguably even more important, reduced memory consumption from 53 to 22 gigabytes. This now allows running the program on consumer-grade high-memory PCs. With Sarah having finished her Masters degree (congrats!!) in the meantime, she and her mentors are now in the process of writing a scientific application note and are planning to make the program available as an online web-service.

As a rather small family within the much larger OBF umbrella, the chance to have a student contribute to functional programming for computational biology has been a tremendous opportunity and learning experience for the Biohaskell community as well.