BioPerl has moved to GitHub

BioPerl has migrated to git and GitHub!  We have also set up a mirror set of several key repositories at the great public git hosting site repo.or.cz.

If you are a current BioPerl developer (had a previous account for direct access to our prior Subversion repository), please sign up for a GitHub account and let us know your user ID.  Also, add the extra email (where ‘DEVNAME’ is your original Subversion account ID).  This should map any previous commits from the older Subversion and CVS repository to your new GitHub account.

[Read More]

O|B|F Google Summer of Code Accepted Students

I’m pleased to announce the acceptance of OBF’s 2010 Google Summer of Code students, listed in alphabetical order with their project titles and primary mentors:

Mark Chapman (PM Andreas Prlic) - Improvements to BioJava including Implementation of Multiple Sequence Alignment Algorithms

Jianjiong Gao (PM Peter Rose) - BioJava Packages for Identification, Classification, and Visualization of Posttranslational Modification of Proteins

Kazuhiro Hayashi (PM Naohisa Goto) - Ruby 1.9.2 support of BioRuby

Sara Rayburn (PM Christian Zmasek) - Implementing Speciation & Duplication Inference Algorithm for Binary and Non-binary Species Tree

[Read More]

Illumina FASTQ files - Read Segment Quality Control Indicator

In another quirk to the FASTQ story, recent Illumina FASTQ files don’t actually use the full range of PHRED scores - and a score of 2 has a special meaning, The Read Segment Quality Control Indicator (RSQCI, encoded as ‘B’).

Hats off to Dr Torsten Seemann for raising awareness of this issue in his post on the seqanswers.com forum, referring to a presentation by Tobias Mann of Illumina which says:

The Read Segment Quality Control Indicator:

[Read More]

Partial sequence files with Biopython

This is another blog post to highlight one of the neat tricks you’ll be able to do with Biopython 1.54 (which you can help test with the Biopython 1.54 beta release).

It is often useful to be able to extract a few records from a larger sequence file - for example, some sequences of interest from a full UniProt or GenBank dump. One obvious way to try to do this is to parse the file into an object representation (i.e. SeqRecord objects using Bio.SeqIO.parse(...)), filter to pick out the entries you want, and then write them back to disk (using Bio.SeqIO.write(...)). However, for complex file formats like GenBank this can be lossy ( Bio.SeqIO does not support a 100% identical round trip), and Biopython don’t currently support writing out the SwissProt plain text format used by UniProt. So, that approach won’t work.

[Read More]

Making Biopython SeqIO and AlignIO easier

One of the small changes coming in Biopython 1.54 (which you can try out already using the Biopython 1.54 beta) is to Bio.SeqIO and Bio.AlignIO. Previously the input and output functions had required file handles, but they will now also accept filenames.

This is a case of practicality beats purity (to quote the Zen of Python), and is particularly handy when doing very short scripts or working at the Python prompt.

For example, filtering a FASTA file to take only entries with a minimum length of 100 can be done like this (with handles):

[Read More]

Biopython 1.54 beta released

A beta release for Biopython 1.54 is now available for download and testing.

Since Biopython 1.53 was released at the end of last year, several new features and more documentation have been added, plus the usual bug fixes. For full details see the NEWS file.

All the new features have been tested by the dev team but it’s possible there are cases that we haven’t been able to foresee and test, especially for the updated multiple sequence alignment object (which is what you’ll now get when parsing alignments with Bio.AlignIO), the new Bio.Phylo module, and the Bio.SeqIO support for Standard Flowgram Format (SFF) files.

[Read More]

BioPerl at GMOD Meeting 2010

BioPerl developers and users attended the BioPerl satellite meeting on January 13th, just prior to the GMOD Meeting.  Several items were covered on the agenda:

  • In order to start addressing whole genome data with more lightweight objects, we are planning on setting up a lightweight Bio::SeqI object that has a flexible DB backend (i.e. Bio::DB::SeqFeature::Store or similar).  We are also contemplating adding lazy parsing for some parsers, possibly using the Bio::PullParserI methods (or similar) that Sendu Bala created.
  • After a final  1.6 branch point release, we may ‘freeze’ BioPerl in a maintenance mode, primarily so that we can reorganize core into several more easily installed subdistributions on a branch.  New modules will essentially be additional separate repos that will depend on BioPerl core.  This reorganization has been discussed for a few years now, and as we edge closer to starting this (probably this spring) we’ll announce more details.
  • Some initial thoughts on how to handle circular genomes more efficiently.  We essentially do this already, but it isn’t full-proof.
  • Need some significant time dedicated towards GFF3-based coding (reimplement FeatureIO but allow some flexibility).  Rob Buels had started the initial run at splitting out FeatureIO, so next step is a true reimplementation.
  • We don’t plan on including Moose support for the immediate future, feeling that it would be better to reimplement some of the classes from scratch using Moose and similar as a BioPerl 2.0, or possibly await the impending Rakudo Perl 6 alpha and start afresh using that instead of Moose.

Anything we missed?  Anything you would like to address?  Please add comments and we’ll discuss them on list.

[Read More]

BioRuby 1.4.0 released

We are pleased to announce the release of BioRuby 1.4.0. This new release contains many new features, in addition to bug fixes and improvements.

PhyloXML support: Support for reading and writing PhyloXML file format is added, developed by Diana Jaunzeikare, mentored by Christian M Zmasek and co-mentors, supported by Google Summer of Code 2009 in collaboration with the National Evolutionary Synthesis Center (NESCent).

FASTQ file format support: Support for reading and writing FASTQ file format is added. All of the three FASTQ format variants are supported. The code is written by Naohisa Goto, with the help of discussions in the open-bio-l mailing list. The prototype of Bio::Fastq class was first developed during the BioHackathon 2009 held in Okinawa.

[Read More]

Sanger FASTQ format and the Solexa/Illumina variants

I’m delighted to announce an open access publication in Nucleic Acids Research describing the FASTQ file format based on the conventions agreed by the OBF projects:

The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants Peter J. A. Cock ( Biopython), Christopher J. Fields ( BioPerl), Naohisa Goto ( BioRuby), Michael L. Heuer ( BioJava) and Peter M. Rice ( EMBOSS). Nucleic Acids Research, doi:10.1093/nar/gkp1137

This will hopefully serve as a reference describing the original standard Sanger FASTQ, and the two variants from Solexa/Illumina, and how to inter-convert between them.

[Read More]

Biopython 1.53 released

We are pleased to announce the availability of Biopython 1.53, a new stable release of the Biopython library, three months after the release of Biopython 1.52. This is our first release since migrating from CVS to git for source code control.

There have been some additions to our core objects - the Seq (and related UnknownSeq) objects gained upper and lower methods (like the string methods of the same name but alphabet aware) plus a new ungap method. The SeqFeature object now has an extract method to get the region of sequence it describes (useful for getting CDS nucleotide sequences from GenBank files). Also SeqRecord objects now support addition, giving a new SeqRecord with the combined sequence, all the SeqFeatures, and any common annotation.

[Read More]