Cross-links in GenomeDiagram

I’ve just finished writing up an example for the Biopython Tutorial of the new GenomeDiagram functionality added in Biopython 1.59. You can now control the start and end points of individual tracks, and you can add cross-links between regions of different tracks, as shown here: This example attempts a simplified reproduction of Figure 6 in Proux et al. (2002), and shows three related phage genomes one above the other. Different classes of genes have been given different colors, while the strength of the red shaded cross-links indicates the percentage identity of the linked genes. [Read More]

Chromosome Diagrams in Biopython

One of the new things coming in Biopython 1.59 is improved chromosome diagrams, something you may have seen via Twitter. I’ve just been updating the Biopython Tutorial (current version here, PDF) to include an example drawing this: Here’s a PDF version too. This example just parses the Arabidopsis thaliana GenBank files to get the chromosome lengths and the tRNA gene placements. There are so many tRNA on the forward strand of Chr I that their labels are forced to overlap. [Read More]

Illumina FASTQ files - Read Segment Quality Control Indicator

In another quirk to the FASTQ story, recent Illumina FASTQ files don’t actually use the full range of PHRED scores - and a score of 2 has a special meaning, The Read Segment Quality Control Indicator (RSQCI, encoded as ‘B’). Hats off to Dr Torsten Seemann for raising awareness of this issue in his post on the seqanswers.com forum, referring to a presentation by Tobias Mann of Illumina which says: [Read More]

Partial sequence files with Biopython

This is another blog post to highlight one of the neat tricks you’ll be able to do with Biopython 1.54 (which you can help test with the Biopython 1.54 beta release). It is often useful to be able to extract a few records from a larger sequence file - for example, some sequences of interest from a full UniProt or GenBank dump. One obvious way to try to do this is to parse the file into an object representation (i. [Read More]

Working with FASTQ files in Biopython when speed matters

Biopython 1.51 onward includes support for Sanger, Solexa and Illumina 1.3+ FASTQ files in Bio.SeqIO, which allows a lot of neat tricks very concisely. For example, the tutorial ( PDF) has examples finding and removing primer or adaptor sequences. However, because the Bio.SeqIO interface revolves around SeqRecord objects there is often a speed penalty. For example for FASTQ files, the quality string gets turned into a list of integers on parsing, and then re-encoded back to ASCII on writing. [Read More]

Simpler, optimized format conversion with Biopython

As per Peter’s recent post we are using this space to show of a couple of the new features in Biopython 1.52 before it is released. In this post we’ll look at the new convert() function that both Bio.SeqIO and Bio.AlignIO will get in Biopython 1.52. No one has ever complained that bioinformatics just doesn’t have enough file formats - you probably frequently find yourself converting sequence files to suit particular applications with Bio. [Read More]

Indexing sequence files with Biopython

The forthcoming release of Biopython 1.52 will include a couple of nice improvements to the Bio.SeqIO module, and here we’re going to introduce the new index function. This will of course be covered in the Biopython Tutorial & Cookbook ( PDF) once this code is released. Suppose you have a large sequence file with many many individual sequences in it. This could be next generation sequence data for example, maybe a FASTQ, FASTA or QUAL file. [Read More]

Clever tricks with NCBI Entrez EInfo (& Biopython)

Constructing complicated NCBI Entrez searches can be tricky, but it turns out one of the Entrez Programming Utilities called Entrez EInfo can help. For example, suppose you want to search for mitochondrial genomes from a given taxa - either just in the Entrez web interface, for use in a script with EFetch. I knew from past experience about using name[ORGN] in Entrez to search for an organism name - but how would you specify just mitochondria? [Read More]

Introducing (and expanding) the Biopython Cookbook

Hi all, You may have noticed we’re trying out using the wiki for Biopython cookbook entries. It’s a new idea so at the moment there are only a few ‘recipes’ on offer. If you have some tricks you find yourself using time and again to solve a problem why not share them? Similarly, if you find yourself coming up against a problem you can’t seem to solve easily with Biopython’s tools send a message to one of the mailing lists proposing it as a cookbook example and someone just might solve it for you! [Read More]