Google Summer of Code 2020 Project ideas

Quick links


Cross-Project Ideas

OBF is an umbrella organization which represents many different programming languages used in bioinformatics. In addition to working with each of the “Bio*” projects (listed below) we also accept “cross-project” ideas that cover multiple programming languages or projects. These collaborative ideas are broadly defined and can be thought of as “unfinished” — interested students should adapt the ideas to their own strengths and goals, and are responsible for the quality of the final proposed idea in their application.

Feel free to propose your own entirely new idea.


Protein Database Suitability by DeNovo Sequencing

Rationale

Mass spectrometry has become one of the bioanalytical tools of choice for many life scientists. Technological advancements in the last years have led to an increase in acquisition speed and resolution. Common tasks, like identifying and quantifying proteins and metabolites from biological or medical samples are regularly performed in different fields of research, industry, and clinical diagnostics. 

A crucial part in proteomics is to control the quality of the underlying data. Peptide identifications, for example, are usually identified using an MS/MS spectrum from which the amino acid sequence can be reconstructed using a so called ‘search engine’ such as Mascot, MSGF+ or X!Tandem. These tools are driven by a protein database, which is provided as input by the user.

This database (usually in FASTA format) should contain all protein sequences which are suspected to be in the actual sample. Providing too many unrelated protein sequences will lead to more ‘noise’ in the output data, while providing only a subset will cause the same problem (i.e. false assignments). This is especially problematic for meta-proteomics (e.g. environmental samples, e.g. sea water) where the content is not entirely known, or rare and unexpected contaminations, such as Mycoplasma’s in cell cultures.

Background

OpenMS is an open-source software C++ library for LC-MS data management and analyses. It offers an infrastructure for the rapid development of mass spectrometry related software. OpenMS is free software under the three-clause BSD license, runs under Windows, MacOSX and Linux and available on GitHub.

Approach/Goals

To solve this problem, an automated system to score the completeness of a given protein database and corresponding sample data is needed. A recent publication (Assessing protein sequence database suitability using de novo sequencing) in MCP suggested a useful method towards solving this problem by complementing the search using a database-free ‘De Novo’ approach.

This allows for the creation of an integrated tool in OpenMS to score the suitability/completeness of the protein database and the spectral quality. The building blocks required for this tool are mostly available in OpenMS.

Languages and skill

  • (At least) intermediate knowledge of C++ (C++11 and above) and its STL
  • Knowledge of how to prototype a solution and benchmark it using sensible metrics
  • Basic knowledge of mass spectrometry
  • Basic knowledge of peptide search engines and target-decoy approaches (can be acquired in the first days).

Difficulty

Medium

Mentors

Chris_Bielow , Julianus_Pfeuffer

 Student Benefits

  • We aim to ensure that each student gets a great learning experience tailored to their ability, interest and experience.
  • Practical experience in one of the most awesome OMICS there is: proteomics, using a  software project that is used around the world
  • Gain understanding of how real-world software is developed and how priorities are established
  • Improving your oral and written communication skills in a team environment

How to Apply

  • Provide a cover letter that explains why your skills would be a good fit. If you don’t have the skills, explain why you would like to learn those skills (2 pages maximum)
  • Provide a resume with a list of skills and experience (2 pages maximum)
  • Provide a breakdown of how you’d run this project – i.e. Features A, B delivered in the first two weeks, Features C, D delivered in later weeks. Show your proposal to mentors for feedback as they may be able to suggest improvements!
  • Provide links to any code you might have contributed to eg. github, bitbucket repos/commits

Contact

For questions regarding this project: https://gitter.im/OpenMS/GSOC2020_DeNovoDB or chris.bielow@fu-berlin.de, pfeuffer@informatik.uni-tuebingen.de

For general questions on OpenMS: General OpenMS discussion open-ms-general@lists.sourceforge.net


OpenMS R-package 

Rationale

Mass spectrometry-based proteomics and metabolomics have become the bioanalytical high-throughput method of choice in many research fields. While a wide range of tools and workflows have been developed to identify and quantify analytes from primary data new biological knowledge is often only extracted by appropriate downstream statistical analysis.

The statistical programming language R, sometimes referred to as the Swiss army knife of data scientists provides cutting edge methods to tackle the challenges of proteomics and metabolomics data analysis.

Approach/Goals

OpenMS, an open-source C++ library exposes a large part of its functionality to the Python community via the automatically generated pyOpenMS package. To make our algorithms available to the R community we showed in a simple feasibility study that pyOpenMS methods can be called from R using reticulate.

The goal of this project is to turn the early prototype into a full-fledged R package that includes tests, additional documentation, example visualizations, and a beginner tutorial. Ideally, the R package can be automatically built by the OpenMS build system with little human interaction and maintenance if the underlying C++ implementation changes. Additional glue code to ease interaction with existing projects like https://github.com/lgatto/RforProteomics might speed up the adoption process.

Languages and skill

  • Intermediate knowledge of R and Python
  • Basic knowledge of mass spectrometry

Difficulty

Medium

Mentors

Hannes Röst ,Timo_Sachsenberg

 Student Benefits

  • Practical experience in proteomics and metabolomics, using an open-source software project that is used around the world
  • Gain insight into the development process of a medium size open-source projects 
  • Improving your oral and written communication skills in a team environment

How to Apply

  • Provide a cover letter that explains why your skills would be a good fit. If you don’t have the skills, explain why you would like to learn those skills (2 pages maximum)
  • Provide a resume with a list of skills and experience (2 pages maximum)
  • Provide a breakdown of how you’d run this project – i.e. Features A, B delivered in the first two weeks, Features C, D delivered in later weeks. Show your proposal to mentors for feedback as they may be able to suggest improvements!
  • Provide links to any code you might have contributed to eg. Github, bitbucket repos/commits

Contact

For questions regarding this project: sachsenb@informatik.uni-tuebingen.de hannes.rost@utoronto.ca

For general questions on OpenMS: General OpenMS discussion open-ms-general@lists.sourceforge.net


NGLess: Expanding data types

Rationale

NGLess is a domain specific language designed for next generation sequencing (NGS) data processing. Currently NGLess recognizes FASTQ, SAM/BAM and tabular formats, allowing the construction of pipelines using these as intermediate files, inputs or outputs. In addition to built-in functionality, NGLess can be expanded by using external modules. These can be easily created by users to interact with code or tools beyond the core in NGLess. Some examples of the flexibility of this approach are already visible in the official community contributed collection of external modules (https://github.com/ngless-toolkit/ngless-contrib). However, due to a reduced set of data types recognized by NGLess, external modules are currently limited in scope and breadth. Examples of data types to expand include GFF/GTF/BED, VCF, BIOM, HDF5, MPEG-G, all formats frequently used in bioinformatics and DNA sequencing analysis.

Approach/Goals

Using the same approach already in place for SAM/BAM and FastQ formats, we expect the student to execute the following

  1. Get acquainted with NGLess and its documentation, some of its applications (before GSoC!)
  2. Get or be familiar with the specified data formats (GFF/GTF/BED, VCF, BIOM,…) and some of its uses.
  3. Implement support for these formats by using existing Haskell libraries or new ones.
  4. Design and extend the NGLess language to support the use of the new types.
  5. Document the new language functions/verbs and types.

Languages and skill

Requires knowledge of Haskell. Knowledge of YAML, shell scripting (BASH) and Python is not required but recommended. Some familiarity with next generation sequencing (NGS) and commonly used formats is desirable.

Difficulty

Medium to Hard

Mentors

Luis Pedro Coelho, Renato Alves

Student Benefits

  • We aim to ensure that each student gets a great learning experience tailored to their ability, interest and experience.
  • Practical experience in Haskell in real-world applications
  • Design of bioinformatic solutions and domain specific languages (DSL).
  • Implementing and extending a software project with increasing use around the world
  • Improving your oral and written communication skills in a team environment

How to Apply

  • Provide a cover letter that explains why your skills would be a good fit. If you don’t have the skills, explain why you would like to learn those skills. 2 pages maximum.
  • Provide a resume with a list of skills and experience. 2 pages maximum.
  • Provide a breakdown of how you’d run this project – i.e. Features A, B delivered in the first two weeks, Features C, D delivered in later weeks. Show your proposal to mentors for feedback as they may be able to suggest improvements!
  • Provide links to any code you might have contributed to eg. github, bitbucket repos/commits

Contact

Mailing list at: https://groups.google.com/forum/#!forum/ngless

Gitter: https://gitter.im/ngless-toolkit/community 

Personal email: Luis – luis@luispedro.org , Renato – renato.alves@embl.de 


NGLess: Integration with nixpkgs

Rationale

NGLess is a domain specific language designed for next generation sequencing (NGS) data processing. In addition to built-in functionality, NGLess can be expanded by using external modules. These can be easily created by users to interact with code or tools beyond the core in NGLess. Some examples of the flexibility of this approach are already visible in the official community contributed collection of external modules (https://github.com/ngless-toolkit/ngless-contrib). However, at the moment, installation of these is left to each individual user. This is not only a burden on the users, it breaks reproducibility, which is a goal of NGLess which is adhered to strictly in its builtin functionality.

To address this and make external modules perfectly reproducible, this project would aim to integrate with nixpkgs. This would enable modules to be paired with a nix environment, which can be made completely reproducible at an almost negligible runtime cost. 

Approach/Goals

Conceptually, the desired result is that the user can add a default.nix environment description file (or an equivalent description) next to the existing module and this will be activated prior to executing the module.

  1. Get acquainted with NGLess and its documentation, some of its applications (before GSoC!).
  2. Get acquainted with nixpkgs and nix-shell.
  3. Convert 2 modules to this approach as case studies
  4. Based on this experience, design and implement a general-purpose framework for associating a nix environment with a NGLess module.

Languages and skill

Knowledge of nixpkgs, YAML, and Shell scripting (BASH). Haskell and Python are not required, but can be helpful (this can be implemented without changing the core NGLess code). Some familiarity with next generation sequencing (NGS) and commonly used tools is desirable.

Difficulty

Easy to Medium

Mentors

Luis Pedro Coelho, Renato Alves

Student Benefits

  • We aim to ensure that each student gets a great learning experience tailored to their ability, interest and experience
  • Practical experience in using nixpkgs in real-world applications
  • Implementing and extending a software project with increasing use around the world
  • Improving your oral and written communication skills in a team environment

How to Apply

  • Provide a cover letter that explains why your skills would be a good fit. If you don’t have the skills, explain why you would like to learn those skills. 2 pages maximum.
  • Provide a resume with a list of skills and experience. 2 pages maximum.
  • Provide a breakdown of how you’d run this project – i.e. Features A, B delivered in the first two weeks, Features C, D delivered in later weeks. Show your proposal to mentors for feedback as they may be able to suggest improvements!
  • Provide links to any bug report you have submitted to any open source project
  • Provide links to any code you might have contributed to eg. github, bitbucket repos/commits

Contact

Mailing list at: https://groups.google.com/forum/#!forum/ngless

Gitter: https://gitter.im/ngless-toolkit/community 

Personal email: Luis – luis@luispedro.org , Renato – renato.alves@embl.de 


NGLess: (JIT) Compilation of NGLess

Rationale

NGLess is a domain specific language designed for next generation sequencing (NGS) data processing. Currently, NGLess is implemented by an interpreter. While the approach is fast enough for most uses, there exist use cases for which it is too slow. For these, it would be valuable to replace the current interpreter by compilation, either to machine code or to bytecode (which would itself require an interpreter). This can be done either ahead of time or Just-in-Time (JIT).

Approach/Goals

The NGLess interpreter already includes a parser and type checker for NGLess. These should be reused and compilation can proceed from the abstract syntax tree (AST).

  1. Get acquainted with NGLess and its documentation, some of its applications (before GSoC!)
  2. Choose a compilation target (e.g., LLVM, direct to assembly, or a Bytecode interpreter). Ideally one with which you have some experience, or get acquainted before hand.
  3. The existing parser and type checking can be reused and compilation can proceed from the parsed AST (abstract syntax tree).
  4. Particular areas (loops) can be prioritized as the areas that will provide the biggest benefit.

Languages and skill

Requires knowledge of Haskell and compiler concepts. Knowledge of a target system such as LLVM is also necessary (although it need not be LLVM, in particular). Familiarity with bioinformatics is, however, not necessary.

Difficulty

Hard: this project requires advanced computer science knowledge

Mentors

Luis Pedro Coelho, Renato Alves

Student Benefits

  • We aim to ensure that each student gets a great learning experience tailored to their ability, interest and experience.
  • Practical experience with Haskell and (JIT) compilation in real-world applications
  • Implementing and extending a software project with increasing use around the world
  • Improving your oral and written communication skills in a team environment

How to Apply

  • Provide a cover letter that explains why your skills would be a good fit. If you don’t have the skills, explain why you would like to learn those skills. 2 pages maximum.
  • Provide a resume with a list of skills and experience. 2 pages maximum.
  • Provide a breakdown of how you’d run this project – i.e. Features A, B delivered in the first two weeks, Features C, D delivered in later weeks. Show your proposal to mentors for feedback as they may be able to suggest improvements!
  • Provide links to any bug report you have submitted to any open source project
  • Provide links to any code you might have contributed to eg. Github, bitbucket repos/commits

Contact

Mailing list at: https://groups.google.com/forum/#!forum/ngless

Gitter: https://gitter.im/ngless-toolkit/community 

Personal email: Luis – luis@luispedro.org , Renato – renato.alves@embl.de 


NGLess: Improved reporting of results

Rationale

NGLess is a domain specific language designed for next generation sequencing (NGS) data processing. During execution NGLess is able to collect different kinds of statistics about the data. In order to present this data to users in an informative and friendly form, NGLess can currently produce an HTML report with some plots. However, in its current form, this report has limited use due to being very basic. In order to increase its usefulness, these should be enhanced with interactive plots and alternative visualizations, making the best out of available information. 

Languages and skill

Requires knowledge of HTML/Javascript. Familiarity with plotting concepts or a Javascript plotting/charting framework (charts.js, D3.js, Vega-lite, …) is recommended. Some familiarity with bioinformatics is desirable.

Difficulty

Medium

Mentors

Luis Pedro Coelho, Renato Alves

Student Benefits

  • We aim to ensure that each student gets a great learning experience tailored to their ability, interest and experience.
  • Practical experience in browser/web oriented technologies in real-world applications
  • Design of bioinformatic solutions and user interfaces for data presentation in a scientific context.
  • Implementing and extending a software project with increasing use around the world
  • Improving your oral and written communication skills in a team environment

How to Apply

  • Provide a cover letter that explains why your skills would be a good fit. If you don’t have the skills, explain why you would like to learn those skills. 2 pages maximum.
  • Provide a resume with a list of skills and experience. 2 pages maximum.
  • Provide a breakdown of how you’d run this project – i.e. Features A, B delivered in the first two weeks, Features C, D delivered in later weeks. Show your proposal to mentors for feedback as they may be able to suggest improvements!
  • Provide links to any bug report you have submitted to any open source project
  • Provide links to any code you might have contributed to eg. github, bitbucket repos/commits

Contact

Mailing list at: https://groups.google.com/forum/#!forum/ngless

Gitter: https://gitter.im/ngless-toolkit/community 

Personal email: Luis – luis@luispedro.org , Renato – renato.alves@embl.de 


SIMDe: Add new implementations of ISA extensions

Rationale

SIMDe is a header-only library which implements vendor-specific APIs for SIMD instruction set architecture extensions using both portable code which can be run anywhere and, if available, intrinsics for the architecture which is available; for example, implementing Intel’s SSE API with ARM’s NEON API, or NEON on POWER.

We currently have complete portable implementations for MMX, SSE, SSE2, SSE3. SSSE3, SSE 4.1, AVX, and FMA, but rely heavily on the compiler’s auto-vectorization abilities, which leaves a lot of performance on the table. We would like to add more accelerated implementations other architectures, such as ARM, POWER, MIPS, etc.

There is currently a large amount of scientific (and other) code written which use Intel’s SIMD APIs. With SIMDe, porting these to other architectures is almost trivial. Additional accelerated implementations would allow this code to run much faster.

Approach/Goals

The portable implementations already exist, and we have tests in place to verify correctness which run on many different platforms. The student would need to consult the documentation for whatever architecture they are targeting and re-implement each function using intrinsics native to that architecture. The functions are generally fairly simple and work is straightforward and can be done in isolation, there are just a lot of functions to implement.

Languages and skill

Requires familiarity with the C programming language and a basic understanding of what SIMD is.

Difficulty

Easy

Mentors

Evan Nemerson, Michael R. Crusoe

Student Benefits

  • You’ll gain a pretty deep understanding of what operations can be accelerated using SIMD and what those operations do.
  • Practical experience in writing portable C and dealing with differences between various platforms and compilers.

How to Apply

  • Provide a cover letter that explains why your skills would be a good fit. If you don’t have the skills, explain why you would like to learn those skills. 2 pages maximum.
  • Provide a resume with a list of skills and experience. 2 pages maximum.
  • Pick the ISA extension(s) you’d like to implement and the ISA extensions you’d like to use to implement them.
  • Provide a breakdown of how you’d run this project – i.e. Features A, B delivered in the first two weeks, Features C, D delivered in later weeks. Show your proposal to mentors for feedback as they may be able to suggest improvements!
  • Provide links to any code you might have contributed to eg. github, bitbucket repos/commits
  • If you have any questions, please ask us here – https://github.com/nemequ/simde

SIMDe: Add new ISA extensions

Rationale

SIMDe is a header-only library which implements vendor-specific APIs for SIMD instruction set architecture extensions using both portable code which can be run anywhere and, if available, intrinsics for the architecture which is available; for example, implementing Intel’s SSE API with ARM’s NEON API, or NEON on POWER.

We currently have complete portable implementations for MMX, SSE, SSE2, SSE3. SSSE3, SSE 4.1, AVX, and FMA, would like to add support for as many Instruction Set Architecture (ISA) extensions as we can, such as ARM NEON, IBM VMX (AltiVec), MIPS MSA, etc.

Supporting additional ISA extensions will make it easier to port code which targets those platforms to other platforms and to develop code for those platforms without need of emulators or continuous access to the platform.

Approach/Goals

This project would require the student to re-implement existing APIs using code which will run in any standard C compiler. You’ll also need to develop test cases to verify your implementation. There is a brief guide on the SIMDe Wiki which provides an overview of the process which the student would need to repeat for all the functions in the instruction set extension of their choice.

Languages and skill

Requires familiarity with the C programming language and a basic understanding of what SIMD is.

Difficulty

Medium

Mentors

Evan Nemerson

Student Benefits

  • You’ll gain a pretty deep understanding of what operations can be accelerated using SIMD and what those operations do.
  • Practical experience in writing portable C and dealing with differences between various platforms and compilers.

How to Apply

  • Provide a cover letter that explains why your skills would be a good fit. If you don’t have the skills, explain why you would like to learn those skills. 2 pages maximum.
  • Provide a resume with a list of skills and experience. 2 pages maximum.
  • Pick the ISA extension(s) you’d like to implement. There are several ideas on the SIMDe issue tracker, but if you’re interested in something else that would be fine, too.
  • Provide a breakdown of how you’d run this project – i.e. Features A, B delivered in the first two weeks, Features C, D delivered in later weeks. Show your proposal to mentors for feedback as they may be able to suggest improvements!
  • Provide links to any code you might have contributed to eg. github, bitbucket repos/commits

If you have any questions, please ask us here – https://github.com/nemequ/simde


pysradb – Enhancing search for next-generation sequencing datasets

Rationale

The NCBI Sequence Read Archive (SRA) is the primary archive of next-generation sequencing datasets. The metadata and raw sequencing data made available to the research community encourages reproducibility and provides avenues for testing novel hypotheses on publicly available data. We have developed pysradb as a versatile tool to retrieve metadata and download sequencing datasets from SRA. It is meant to serve as an analogous command-line version of NCBI’s SRA search and metadata interface which is accessible only through a web browser.

The current implementation of pysradb allows retrieving metadata and downloading data for one or multiple projects via the command line. It, however, lacks a search feature that would allow one to query for datasets based on a user-defined string as is currently possible on the web-based interface. 

Approach/Goals

Pysradb is written in Python and makes use of requests library for retrieving responses from NCBI’s APIs. It also relies on pandas library for handling metadata in tabular format. The search extension to pysradb will be implemented as a module inside pysradb that would be exposed to the user through a command-line interface. The search would allow SQL-like syntax and the filtering criteria will be based on the fields available through NCBI’s API.  The student will execute the following:

  1. Get acquainted with pysradb and its documentation (before GSoC!)
  2. Implement a new search module that will make calls to NCBI’s API and fetch the results
  3. Implement SQL-like syntax support for fine-tuning the search results
  4. Implement a module for summary statistics and graphs based on the search results
  5. Implement test cases for existing sub-commands and the new search interface

Languages and skill

Requires Python programming and some working knowledge of handling and parsing HTTP responses (xml/json).  Some familiarity with next-generation sequencing technology and data formats is encouraged but not required.

Difficulty

Medium

Mentors

Saket Choudhary, Amal Thomas

Student Benefits

  • Authorship on a scientific manuscript if the search interface is implemented and documented as version 2 of pysradb’s existing manuscript or an entirely new manuscript
  • Implementing and extending a bioinformatics software project that is a requirement of every bioinformatics researcher in one way or another
  • Gaining practical experience in writing a scientific manuscript
  • Improving your oral and written communication skills in a team environment

How to Apply

  • Go through the list of currently open issues and documentation on pysradb’s Github page 
  • Provide a cover letter that explains why your skills would be a good fit. If you don’t have the skills, explain why you would like to learn those skills. 2 pages maximum.
  • Provide a resume with a list of skills and experience. 2 pages maximum.
  • Provide a breakdown of how you’d run this project – i.e. Features A, B delivered in the first two weeks, Features C, D delivered in later weeks. Show your proposal to mentors for feedback as they may be able to suggest improvements!
  • Provide links to any code you might have contributed to eg. github, bitbucket repos/commits

Contact

Mailing-list: https://github.com/saketkc/pysradb/issues
Personal email: Saket Choudhary – saketkc@gmail.com, Amal Thomas – amalthomas111@gmail.com


The fastest GPU Variation Graph explorer (VG)

Rationale

We are working on the variation graph (aka pangenome). Most of the work has been coded out in vgtools, but there are issues around scaling up for large datasets. In this project we aim to take a computer science approach towards scalability, making use of GPU architecture.

Approach/Goals

Current vgtools are written in C++. We want to create a rapid graph explorer which can be written in Rust or D instead making use of GPU functionality or another RISC architecture (we’ll decide as we go along). The student will

  1. Get acquainted with VG storage techniques (before GSoC)
  2. Contribute at least two patches (before GSoC)
  3. Roll out a memory model that can handle (sparse) graphs for rapid traversel
  4. Implement in Rust or D
  5. Create a search algorithm
  6. Optimize the algorithm for GPU architecture

You can read more about VG on Erik’s blog.

Languages and skill

Requires Rust or D programming and an interest in graphs and GPUs.

Difficulty

Hard

Mentors

Pjotr Prins, Erik Garrison, George Githinji

Student Benefits

  • We aim to ensure that each student gets a great learning experience tailored to their ability, interest and experience.
  • Practical experience in Rust or D with GPU
  • Gain understanding of how real-world software is developed and how priorities are established
  • Improving your oral and written communication skills in a team environment

How to Apply

  • Provide a cover letter that explains why your skills would be a good fit. If you don’t have the skills, explain why you would like to learn those skills. 2 pages maximum.
  • Provide a resume with a list of skills and experience. 2 pages maximum.
  • Provide a breakdown of how you’d run this project – i.e. Features A, B delivered in the first two weeks, Features C, D delivered in later weeks. Show your proposal to mentors for feedback as they may be able to suggest improvements!
  • Provide links to any code you might have contributed to eg. github, bitbucket repos/commits
  • If you have any questions, please ask one of the listed mentors

Contact

pjotr.public531@thebird.nl. During GSoC we’ll use IRC and video conferencing.


Generative biolearn project

Rationale

Our project is focused on applying generative models to address the common problems of biomedical data:

  • Dimensionality: most of the biomedical datasets are high-dimensional but limited in sample size (HDLSS). The number of genes (and genes products) is usually way higher than the number of samples.
  • High level of imbalance. Some diseases are more frequent at certain age, some tissues are easier to collect (i.e. blood and saliva) which leads to high overrepresentation of some classes in the data and distorts the work of the classifiers
  • Privacy concern. Even anonymoused data will still be re-identified (  https://en.wikipedia.org/wiki/Data_re-identification#Re-identification_efforts ) that is why many biomedical data providers often publish only simplified anonymised subset of the data with rounded values, while putting a huge bureaucracy burden on anybody who wants to access the full dataset. As a result: only a small number of organizations manage to get through the paperwork while most of the Machine Learning community has no access to full datasets and cannot apply its expertise to curing diseases and ageing as well as doing other meaningful research

To illustrate the problem we give GTEX, the largest non-cancer human RNA-Seq dataset ( 54 tissue, 948 donors, 17382 samples), as an example.

17K samples can look solid but it is still an HDLSS as the number of genes in humans is around 67K (around 20K of them are coding) and transcripts is >198K. If you look at the violin plot above it is easy to spot that the distribution is highly skewed towards some age groups. Plots also look like “kris knives” as GTEX consortium rounded age values to anonymize the data and cut health records which diminishes the value of the public part of the dataset for the machine learning community. 

Approach/Goals

To address the above-mentioned problems we suggest using generative models. Such models will be trained on original data and then be used to generate so-called “synthetic data”, a data that is not real but that retains the property of original data as much as possible. Such type of data allows to overcome (at least partially) the privacy restrictions and provides so-called over-sampling capabilities for addressing the imbalanced data problems.

Overall, it is possible to find large general expression and methylation datasets but for the specific research problems (i.e. specific diseases, aging) the datasets are way smaller. For example, RECOUNT dataset ( https://jhubiostatistics.shinyapps.io/recount/ ) has tens of thousands of RNA-Seq samples, but less than a thousand samples there have age mentioned in the metadata and thus can be of any use for aging and aging-related diseases research.  For this reason we consider it promising to use so-called transfer-learning approach and pre-train generative models on large general datasets (i.e. GTEX and TCGA) and use them for data generation with smaller datasets devoted to aging and other less popular problems. 

Deciding on the generative models and training data is a part of the project, however we see some potential in models based on variational autoencoders and large gene expression datasets like GTEX and TCGA.

Our goals with the project are to:

  • Create proof-of-concept generative models for expression data and evaluate how realistic the data is
  • Compare the oversampling performance of our methods with traditional ( i.e. (i.e. https://github.com/scikit-learn-contrib/imbalanced-learn implementations) SMOTE and ADASYN over-sampling methods and integrate the code with imbalanced-learn and/or other popular OSS libraries.
  • Develop workflows for training, publishing and validating synthetic data
  • Create a repository for publishing biological synthetic data models and provide tools of its verification on real data

Languages and skill

Terms:

ADASYN is Adaptive Synthetic Sampling Approach for Imbalanced Learning (see https://imbalanced-learn.readthedocs.io/en/stable/over_sampling.html#from-random-over-sampling-to-smote-and-adasyn )

Generative models are models that learn the underlying data distribution, thus making them able to generate data from the same distribution. Variational autoencoders (VAE) and Generative adversarial networks (GAN) are good examples of such models.

Oversampling is a way to fight data imbalance problems by generating new samples in the classes which are under-represented.

GTEX is The Genotype-Tissue Expression (GTEx) project ( https://gtexportal.org/ ) . GTEX  provides large gene expression datasets of various tissues

RNA-Seq  is a particular technology-based sequencing technique which uses next-generation sequencing (NGS) to reveal the presence and quantity of RNA in a biological sample at a given moment

SMOTE is Synthetic Minority Oversampling TEchnique, a popular oversampling method (see https://imbalanced-learn.readthedocs.io/en/stable/over_sampling.html#from-random-over-sampling-to-smote-and-adasyn )

TCGA is The Cancer Genome Atlas (TCGA), a landmark cancer genomics program, molecularly characterized over 20,000 primary cancer and matched normal samples spanning 33 cancer types. TCGA has large datasets of methylation and gene expression data.

Variational autoencoder (VAE) can be defined as being an autoencoder whose training is regularised to avoid overfitting and ensure that the latent space has good properties that enable generative process ( see https://lilianweng.github.io/lil-log/2018/08/12/from-autoencoder-to-beta-vae.html )

Difficulty

Very Hard

Mentors

Anton Kulaga (antonkulaga@gmail.com), Vlada Tishenko, Dmitry Nowicki

Student Benefits

  • You will participate in a cutting edge research in a team of bioinformaticians
  • We plan to make a research paper if the project will succeed
  • You will get your hands dirty with modern generative models

How to Apply

  • Send email to antonkulaga@gmail.com (with tischenko.vlada@gmail.com and  sensus.sextus@gmail.com in CC) with GSOC: Generative Biolearn Project in the topic
  • Provide a resume with a list of education, skills and experience.
  • If there are gaps in your skills explain (in few sentences) how you intend to fill those gaps. 
  • Provide link to any existing github or similar repositories of relevance code you have written.

    P. S. As an aging research lab we will also be grateful for any bioinformatic/ML project ideas (even not connected with generative biolearn) that will be beneficial for solving biological aging.

Generate CWL CommandLineTool descriptor from Rust clap CLI description

Rationale

CWL (Common Workflow Language) is an open standard for describing analysis workflows and tools in a way that makes them portable and scalable across a variety of software and hardware environments. The CommandLineTool is CWL’s standard for describing command line tools in terms of the software they install, the command line parameters they accept and the output they provide. The Rust clap package is commonly used by authors of command line tools in the Rust programming language to specify command line parameters. Much of the information needed to produce a CWL tool descriptor file is present in the clap command line description. Translating this information into CWL will enable faster interfacing between tools written in Rust (with clap) and workflow systems that support CWL.

Approaches / Goals

The clap package provides different approaches to building a command line descriptor object. This object and its attributes could be translated into a CWL command line tool. Since clap does not specify tool output, a template could be provided that combines information gleaned from clap with user specified details to produce a final tool.

The argparse2tool package provides an example of how Python argparse command line descriptions can be turned into CWL tools and might be useful inspiration for this project.

Languages and Skill

  • Basic to Intermediate knowledge of CWL, Rust and clap

Difficulty

Medium

Mentors

Peter van Heusden, Michael Crusoe

Student Benefits

  • Exposure to CWL and scientific workflow systems
  • Practical experience in writing Rust code

How to Apply

  • Provide a cover letter explaining your skills and if there are gaps in your skills how you intend to fill those gaps. 2 pages maximum
  • Provide a resume with list of education, skills and experience. 2 pages maximum
  • Provide a breakdown of how you would approach this project. For example what milestones you would reach, etc.
  • Provide link to any existing github or similar repositories of relevance code you have written.

Contact

Peter van Heusden pvh@sanbi.ac.za, Michael R. Crusoe mrc@commonwl.org


Fuzz the CWL Reference Runner

Rationale

CWL (Common Workflow Language) is an open standard for describing analysis workflows and tools in a way that makes them portable and scalable across a variety of software and hardware environments. The CWL reference runner (“cwltool”) is popular with both users and developers, but it could be much more robust.

Approaches / Goals

No interaction with the CWL reference runner (cwltool) should run forever, produce a segfault, or quit with just a plain Python exception traceback.
Fuzzing is a technique in computer testing and security where you generate a bunch of random inputs, and see how some program handles it. For example, if you had a JPEG parser, you might create a bunch of valid images and broken images, and make sure it either parses them or errors out cleanly. In C (and other memory unsafe languages) fuzzing can often be used to discover segfaults, invalid reads, and other potential security issues. Fuzzing is also useful in Python, where it can discover uncaught exceptions, and other API contract violations.

Languages and Skill

Python

Difficulty

Easy to Medium

Mentors

Michael Crusoe, Peter Amstutz

Student Benefits

  • Exposure to CWL and scientific workflow systems
  • Practical experience in fuzzing Python code

How to Apply

  • Provide a cover letter explaining your skills and if there are gaps in your skills how you intend to fill those gaps. 2 pages maximum
  • Provide a resume with list of education, skills and experience. 2 pages maximum
  • Provide a breakdown of how you would approach this project. For example what milestones you would reach, etc.
  • Provide link to any existing github or similar repositories of relevance code you have written.

Contact

Fuzzing cwltool · Issue #1170 · common-workflow-language/cwltool


Implementing a tuple-based backing database for InterMine

Rationale

InterMine is an open source integration system that uses PostgreSQL to store complex biological data. The InterMine Neo4j project attempted to replace the current relational database with a Neo4j graph database; it developed the data loaders for an InterMine Neo4j database and the RESTful API from which one can query an InterMine Neo4j database using PathQuery. However, mapping an InterMine data model to a graph is a very non-trivial task, requiring many undesirable hacks.

Approach/Goals

We’d like to try a different project in 2020 by attempting to implement a MongoDB-based datastore instead of Neo4j. A tuple-based datastore like MongoDB seems to be a much better fit for the hierarchical Java-based InterMine data model. 

A fully functional InterMine running from a MongoDB backend rather than from PostgreSQL is the expected outcome.

Languages and skill

  • Good understanding of Java
  • Good understanding of RESTful APIs
  • Good understanding of tuple (NoSQL) databases, preferably MongoDB experience
  • GitHub
  • Note: no biology skills needed, this is straight-up database and API programming

Difficulty

Hard

Mentors

Sam Hokin , Andrew Farmer

Student Benefits

  • Practical experience in NoSQL databases and SQL databases upto a certain extent.
  • First hand experiences on REST APIs
  • Improving your oral and written communication skills in a team environment which consists of remote members.

How to Apply

Contact

Sam Hokin <shokin@ncgr.org>,  Andrew Farmer <adf@ncgr.org>
Please email both mentors for all communications.
Also join our Intermine chat on chat.intermine.org and meet others on the GSOC channel.


Generating RDF from InterMine database

Rationale

InterMine is an open source integration system that uses PostgreSQL to store complex biological data. The data model is generated from a core model XML file and any number of additions files which define extra classes and fields. The data stored in the InterMine instances can be accessed via web interface and restful APIs, and exported in several formats (JSON, XML, CSV, FASTA).

Approach/Goals

In order to improve data interoperability, we aim to:

  • Provide the data available in an InterMine instance in RDF format. InterMine instances are updated by periodic rebuilds, we will generate the RDF at build time;
  • Implement a new web service endpoint which executes a query and returns the result in json format to be used by others in their federated queries.

The expected outcome would be the Bulk download created at build time from InterMine data.

Languages and skill

  • RDF
  • Java
  • git
  • no biology skills needed

Difficulty

Medium

Mentors

Daniela Butano, Arunan Sugunakumar, Rahul Yadav

Student Benefits

  • Practical experience in RDF.
  • Improving your oral and written communication skills in a team environment which consists of remote members.

How to Apply

Contact

Daniela Butano <daniela@intermine.org>,  Arunan Sugunakumar <arunans.14@cse.mrt.ac.lk> Rahul Yadav <rahulyadavwk@gmail.com>
Please email all mentors for all communications.
Also join our Intermine chat on chat.intermine.org and meet others on the GSOC channel.


Javascript Data Visualisations

Rationale

Last year, we began to create a suite of javascript-based biological data visualisations for BlueGenes, InterMine’s ClojureScript based user interface, and would need to conform to the Bluegenes Tool API

Approach/Goals

A suite of 2-5 javascript based tools that can be used in production BlueGenes is the expected outcome. Clojure knowledge is not required.

Languages and skill

  • CSS
  • HTML
  • Javascript
  • DOM manipulation or library of your choice (eg. React)

Difficulty

Medium

Mentors

Kevin Herald Reierskog, Adrián Rodríguez-Bazaga, Asher Pasha

Student Benefits

  • Practical experience in Javascript, CSS & HTML.
  • Improving your oral and written communication skills in a team environment which consists of remote members.

How to Apply

Contact

Kevin Herald Reierskog <khr29@cam.ac.uk> Adrián Rodríguez-Bazaga <ar989@cam.ac.uk>
Please email all mentors for all communications.
Also join our Intermine chat on chat.intermine.org and meet others on the GSOC channel.


gtfbase – A curated resource of multispecies genomic regions 

Rationale

Each genome has some common features: exons that make the mRNA, coding domain sequence (CDS), and untranslated regions (UTRs) that are located both towards the 5’ and 3’ ends of the transcripts. ENSEMBL (http://ensembl.org/), Gencode (https://www.gencodegenes.org), and NCBI (https://www.ncbi.nlm.nih.gov/) are some of the key available resources that provide access to these features in the form of General Transfer Format (GTF) files (https://uswest.ensembl.org/info/website/upload/gff.html). While GTF files are by themselves comprehensive, a lot of analysis is focused on individual features. For example, any analysis focused on transcriptional regulation would focus more on exons and possibly introns and non-coding RNA than UTRs while translational regulation analysis would focus on only the CDS and possibly the UTRs. These analyses often require a BED file (https://uswest.ensembl.org/info/website/upload/bed.html). Though it is trivial to obtain a BED file from GTF, currently there are no resources that provide ready access to BED files. Though the GTF is supposed to be a standard format, there are difference in the annotation features for different species.

We have a collection of scripts currently available as part of gencode_regions repository: https://github.com/saketkc/gencode_regions  that provides ready access to BED files of 5’UTR/exons/CDS/3’UTRs across multiple species. We plan to generalize these scripts into a usable tool that can be used to generate BED files for a variety of use cases and serve as a readily updated database of BED files that will keep in sync with ENSEMBL’s GTF releases.

Approach/Goals

The current codebase is in Python and makes use of gffutils library for processing GTFs. The GTFs themselves cannot be assumed to be free of errors and hence while processing we need to be able to handle issues such as overlapping regions or infer missing feature annotations from known features.  The student will execute the following:

  1. Get acquainted with gencode_regions, GTF, BED file formats (before GSoC!)
  2. Convert the existing scripts to a library with an extensible API that can be exposed to command line
  3. Create a modular pipeline that will use the above library to create BEDs for GTFs of all organisms hosted on ENSEMBL: https://uswest.ensembl.org/info/data/ftp/index.html
  4. The following bed files should be supported at the minimum: 
    1. 5’ UTR
    2. CDS
    3. Exons
    4. Introns
    5. Start codons
    6. Stop codons
    7. Non-coding RNA
    8. 3’ UTR
    9. First exons
    10. Last Exons

Languages and skill

Requires Python programming and some knowledge of Biology/genomics.

Difficulty

Hard

Mentors

Saket Choudhary, Amal Thomas

Student Benefits

  • Authorship on a scientific manuscript if we decide to go ahead with publication
  • Implementing and extending a bioinformatics software project that is a requirement of every bioinformatics researcher in one way or another
  • The student will gain practical experience in writing a scientific manuscript
  • Improving your oral and written communication skills in a team environment

How to Apply

  • Go through the l current codebase of gecode_regions on Github
  • Provide a cover letter that explains why your skills would be a good fit. If you don’t have the skills, explain why you would like to learn those skills. 2 pages maximum.
  • Provide a resume with a list of skills and experience. 2 pages maximum.
  • Provide a breakdown of how you’d run this project – i.e. Features A, B delivered in the first two weeks, Features C, D delivered in later weeks. Show your proposal to mentors for feedback as they may be able to suggest improvements!
  • Provide links to any code you might have contributed to eg. github, bitbucket repos/commits
  • If you have any questions, please ask us here – saketkc@gmail.com or amalthomas111@gmail.com

Contact

Mailing list at: https://github.com/saketkc/gencode_regions/issues
Personal email: Saket Choudhary – saketkc@gmail.com, Amal Thomas – amalthomas111@gmail.com