Abstract for Bioinformatics Open Source Conference (BOSC) 2001 Introduction to AGAVE Genomic Annotation XML Brian King DoubleTwist, Inc. 2001 Broadway Oakland, CA 94612 Phone: 510-628-0100 Overview The recent availability of whole-genome sequence data and high-throughput sequence analysis challenges bioinformaticists with the problem of processing and managing large data sets. We designed AGAVE XML to represent an integrated analysis of an entire genome, and to aid processing and management of genomic data by emphasizing verifiability, comprehensiveness, and extensibility. Though similar to the GAME XML format, AGAVE offers additional features that assist with data integration and exchange. AGAVE is available via an open source license at http://www.agavexml.org. Verifiable Until recently all genomic data has been hand created and therefore verified manually. It is not practical to manually verify the computational analysis of a complete genome, so it is important that the data model itself exclude errors. The first aid to verifiability is that the AGAVE DTD is parseable, so that an XML parser can use the DTD to verify that data matches the specification. This has proved valuable for our own genomic pipeline, because some types of errors in the data can be detected at the earliest stages. Another level of verifiability is traceability. In our use of AGAVE we integrate data from public sources. The sources are identified using the element which holds the ID, version, and a database code of a source record. When analysis is complete a tester can trace sequence data back to its source database to compare annotations. The is also used to create hyperlinks which is described later. Comprehensive AGAVE is comprehensive because it can represent relevant biological data in public databases such as NCBI’s GenBank. Here are excerpts from a GenBank record in AGAVE format: Homo sapiens clone RP11-17E16, WORKING DRAFT SEQUENCE, 10 unordered pieces. HTGS_PGD; HTG; HTGS_PHASE1; HTGS_DRAFT. caactctggtggtttggggctttggcatctaaactcttaggaaaaaggcacggtctcccttgacctttgtc ... ... Extensible AGAVE is extensible because it uses generic elements for computational results that can be used to capture results from new sequence annotation algorithms. Because AGAVE is XML-based, programs to manipulate and extract genomic data can be written using standard XML libraries, and data in the AGAVE format can be transformed from and to other XML-based formats using tools such as XSLT (eXtensible Style Language Transformation). We are working on a future version of AGAVE which will be described in an XML Schema, so that the AGAVE datatypes can be extended as well. Annotation There are two types of annotation elements in AGAVE: , which represents curated and typed annotations, and , which is a very generic element whose purpose is to capture all required information from sequence analysis programs. The comp_result and seq_feature elements are recursively defined so that both flat and hierarchical groups can be represented. Result properties are used as a simple means of storing scores and other scalar values. An example of a BLASTX annotation is represented in the XML below. >gi|7513096|pir||JE0225 JH8 protein - human >gi|7513096|pir||JE0225 JH8 protein - human Query: 24850 SVM*DLRKQKDTLWKFYAESDEQKLMKNRKTLPNVKNKDLSQVLRDQICQCCSEHMPLNG 24671 S + +R K L +F+A SD K ++ R+TL K + L +VL + SE +P++G Sbjct: 2 STLYSIRAHKAQLLRFFASSDSNKALEQRRTLHTPKLEHLDRVLYEWFLGKRSEGVPVSG 61 166 2e-38 33 52 0 ... Data Integration We group sequence annotations into sequence_map elements. The sequence_map elements can be treated independently, so we can merge the results of new analysis into old documents or store the results of different computations in different files. The sequence_map element contains a readable label and optionally a computation description so that the annotations can be traced to their source. ... Genomic Assembly Sequence data is assembled from sequenced fragments to produce a final assembly of chromosomes. Identifying the original fragments is important to biologists, so we represent the hierarchical assembly of sequence data in the XML and provide a element that allows linking the sequence back to its source database. . . . The version of AGAVE currently under development adds new elements to allow representation and storage of the assembly of a complete genome. The element is a container for all the sub- intervals and annotation data mapped to a chromosome. Files containing the data on complete chromosomes would be unmanageable, so we have a element that defines a subset of the chromosome data for storage. Using the combination of and we can create an integrated analysis of an entire genome, but divide it into arbitrarily sized lengths for easy storage and exchange. Comparison to GAME The following table is a mapping of the high-level GAME elements to corresponding AGAVE elements. GAME AGAVE -------------------------------------------------------------- game sciobj seq bio_sequence map_position map_location annotation seq_feature feature_span seq_feature computational_analysis sequence_map computational_analysis/program computation.algorithm computational_analysis/database computation.database result_set comp_result result_span comp_result span seq_location or query_region The main differences between AGAVE and GAME are that AGAVE: 1. is defined by a parseable DTD 2. has elements to show genomic assembly 3. links data back to source databases using the element 4. allows annotation hierachies of arbitrary depth 5. shows the annotation relationship by XML containment rather than ID reference Linking to Source Databases There are no explicit URLs in AGAVE data. Using hyperlink references in the data would not allow easy customization of a data set, for example pointing references to local databases, so including URLs would have been too restrictive for our design. Instead we use a second XML document to map database identifiers to the corresponding URLs. Databases are identified by code, and a link specification determines how sequence IDs are translated into URLs. The specification allows for regular expression processing of IDs. Database Types Hyperlink Specifications