-

Difference between revisions of "Cross Projects"

From Open Bioinformatics Foundation
Jump to: navigation, search
(Created page with "==== An ultra-fast scalable RESTful API to query large numbers of genomic variations ==== ; Rationale :VCF files are the typical output of whole genome resequencing projects...")
 
 
(4 intermediate revisions by the same user not shown)
Line 1: Line 1:
==== An ultra-fast scalable RESTful API to query large numbers of genomic variations ====
+
= Project Ideas =
 +
=== Cross BioInterchance ===
 
;  Rationale
 
;  Rationale
 
+
: [http://www.biointerchange.org/index.html BioInterchange] Interchange data using the Resource Description Framework (RDF) and let BioInterchange automagically create RDF triples from your TSV, XML, GFF3, GVF, Newick and other files common in Bioinformatics. BioInterchange helps you transform your data sets into linked data for sharing and data integration via command line, web-service, or API. BioInterchange was conceived and designed during NBDC/DBCLS's [http://2012.biohackathon.org/ BioHackathon 2012]. Architecture and RDF serialization implementations were provided by Joachim Baran, Geraint Duck provided JSON and XML deserialization implementations and contributed to architecture decisions, guidance on ontology use and applications were given by Kevin B. Cohen and Michel Dumontier, where Michel brought forward and extended the Semanticscience Integrated Ontology (SIO). Jin-Dong Kim helped to define ontology relationships for RDFizing DBCLS' PubAnnotation category annotations. The main idea is to have a central service with can be used as a validator and as interchange service for different languages.  
:VCF files are the typical output of whole genome resequencing projects (http://www.1000genomes.org/node/101). They hold the information on all the mutations and variations ([http://en.wikipedia.org/wiki/Single-nucleotide_polymorphism SNPs] and [http://en.wikipedia.org/wiki/Indel InDels]) that are found by comparing the outputs of a [http://en.wikipedia.org/wiki/DNA_sequencing#Next-generation_methods NGS] platform with a reference genome. These files are not incredibly large (a typical uncompressed VCF file is few gigabytes) but they are full with information on millions of positions in the genome where mutations are found. Large resequencing projects can produce hundreds or thousands of these files, one for each sample sequenced.
 
:Existing tools (such as [http://vcftools.sourceforge.net VCFTools] or [http://samtools.sourceforge.net/samtools.shtml#4 BCFTools]) let you manipulate, convert and access the information stored into VCF files but are limited in functionalities and speed when there is the need to work with many files together and compare the variations found for example in 100 samples to identify common mutations sites among sub-groups of samples, or to extract for instance all the mutations that are present in 50 samples but are not present in the other 50 and so on.
 
  
 
;  Approach
 
;  Approach
:The project will develop a RESTful API to address the issues described in the rationale and to allow users to manipulate and compare hundreds of VCF files. Given the high number of information that will need to be processed, scalable and fast languages such as JVM-based languages like Scala or JRuby will be a good choice. A database engine would be required to support the information processing and data mining, traditional RDBMS, noSQL databases or key-values stores can all be valid alternatives. The decision on the best database engine to be used will be discussed between the student and the mentors and within the Bio projects community.
+
:The project will identify the most common and used file formants for all the currently used language under OBF and will design a RESTful API and will project an implementation for all the supported languages. BioInterchange was developed with Ruby but the scope of the project is to have an agnostic system which let use implement a converter using the best language for that functionality. It expected to have a high traffic for the service so an appropriate refactoring or reimplementation using parallel techniques or languages devoted to parallel programming would be possible.
 
   
 
   
 
;  Difficulty and needed skills
 
;  Difficulty and needed skills
:The project is mid / high difficulty, aimed at talented students. Previous knowledge of Scala or Ruby is not necessary but a background in advanced programming languages (like C++, Java) is essential to develop the project.  
+
:The project is mid / high difficulty, aimed at talented students. Previous knowledge of Ruby or other scripting language is preferred and flexibility in learning other languages is requireed.
 
;  The project requires
 
;  The project requires
:Knowledge of advanced programming languages. Some experience and knowledge of databases and data mining will help managing the information of VCF files.  
+
:Knowledge of advanced programming languages and meta-programming and some concept in parallelizing and web services design.  
  
 
;  Mentors
 
;  Mentors
Francesco Strozzi, Raoul J.P. Bonnal
+
:  Raoul J.P. Bonnal, Francesco Strozzi, Toshiaki Katayama, Joachim Baran
  
 
----
 
----

Latest revision as of 11:17, 14 February 2014

Project Ideas

Cross BioInterchance

Rationale
BioInterchange Interchange data using the Resource Description Framework (RDF) and let BioInterchange automagically create RDF triples from your TSV, XML, GFF3, GVF, Newick and other files common in Bioinformatics. BioInterchange helps you transform your data sets into linked data for sharing and data integration via command line, web-service, or API. BioInterchange was conceived and designed during NBDC/DBCLS's BioHackathon 2012. Architecture and RDF serialization implementations were provided by Joachim Baran, Geraint Duck provided JSON and XML deserialization implementations and contributed to architecture decisions, guidance on ontology use and applications were given by Kevin B. Cohen and Michel Dumontier, where Michel brought forward and extended the Semanticscience Integrated Ontology (SIO). Jin-Dong Kim helped to define ontology relationships for RDFizing DBCLS' PubAnnotation category annotations. The main idea is to have a central service with can be used as a validator and as interchange service for different languages.
Approach
The project will identify the most common and used file formants for all the currently used language under OBF and will design a RESTful API and will project an implementation for all the supported languages. BioInterchange was developed with Ruby but the scope of the project is to have an agnostic system which let use implement a converter using the best language for that functionality. It expected to have a high traffic for the service so an appropriate refactoring or reimplementation using parallel techniques or languages devoted to parallel programming would be possible.
Difficulty and needed skills
The project is mid / high difficulty, aimed at talented students. Previous knowledge of Ruby or other scripting language is preferred and flexibility in learning other languages is requireed.
The project requires
Knowledge of advanced programming languages and meta-programming and some concept in parallelizing and web services design.
Mentors
Raoul J.P. Bonnal, Francesco Strozzi, Toshiaki Katayama, Joachim Baran