Google Summer of Code 2022 Project Ideas

Shortcut to project ideas:

Configurable feature visualization to improve the user experience and performance of the feature viewer
Space- and time-efficient data format for mass spectrometry data (OpenMS)
Efficient data layout for mass spectrometry data (OpenMS)
GPU support for Toil-CWL-Runner (UCSC/CWL Project)
Improving automated wrapping of C++ code in Python (OpenMS/autowrap)
Migration of Journal Policy Tracker backend to Express + GraphQL
Journal tracker - finalise and deploy React front-end
Genestorian: data refinement
Citation and databasing functionality in luox

Cross-Project Ideas

OBF is an umbrella organization which represents many different programming languages used in bioinformatics. In addition to working with each of the “Bio*” projects (listed below) we also accept “cross-project” ideas that cover multiple programming languages or projects. These collaborative ideas are broadly defined and can be thought of as “unfinished” — interested contributors should adapt the ideas to their own strengths and goals, and are responsible for the quality of the final proposed idea in their application.

Feel free to propose your own entirely new idea.

Configurable feature visualization to improve the user experience and performance of the feature viewer

Project description

Analyzing positional features/annotations in sequences is important in bioinformatics. Visualizing such data is quite a challenging task, considering the large amount of data to be displayed. The feature viewer is an open source javascript library developed to visualize biological data (referred to as features) mapped to a linear sequence (Paladin et al., 2020). For instance, it can be configured to visualize the location of protein domains or amino acid variations in a protein sequence. The feature viewer is being used in several popular bioinformatics resources such as neXtProt and COSMIC 3D.

Currently, the feature viewer supports limited configurability options in the features displayed, such as the color, shape and on-click behavior. This is too restrictive for some of the possible use cases of the feature viewer, where more flexibility is required in the display of features. One such instance is when different types of amino acid variants should be displayed in a color-specific manner in the same feature track.

The overall goal of this project is to improve the configurability of the feature viewer, such that it allows greater flexibility in the visualization of detailed biological data. Specific aims of the project are:

Simplify the configuration of new tracks
The current version of the feature viewer requires new tracks to be hard-coded. Implementing a solution allowing new tracks to be added or existing tracks to be modified or deleted would ease the use of the viewer.
Add graphical representations of numerical features
Protein sequences can have features/annotations which are numerical. For instance, the frequency observed in a population of amino acid variants at a specific position. Such numerical data can be visualized as graphs of different types, such as line graphs or histograms.
Configure the visualization of certain features in a class of feature
Currently all the features displayed on a single track have the same color and/or shape; interesting features can not be highlighted using a different color or shape.
Improve the speed of display by first displaying a summary or message
Currently all the data displayed has to be provided before display, which results in slow loading and rendering time when there are tens of thousands of features. In order to improve the user experience, it is possible to initially show a message summarizing the data and only fetch and display the data on-demand by the user.
Enable the user to download a snapshot of the current display
The user community has requested a download button which generates a snapshot of the feature viewer. This feature will allow users to include an image of the data displayed in the feature viewer in a publication or elsewhere.

Project size

350h

Languages and skills needed

Javascript
HTML/CSS
Git/Github
Optionally: Java

Difficulty

Medium to Hard

Estimated Length

20 weeks

Mentors

Kasun Samarasinghe (kasun.wijesiriwardana@unige.ch)
Lydie Lane (lydie.lane@sib.swiss)

Contributor benefits

Gain experience in developing configurable libraries
Gain experience in handling large amounts of data in web applications
Gain experience on web application user experience
Gain experience in biological data such as gene, protein sequences

How to apply

For information on how to apply, please contact mentors in the emails provided above.

Space- and time-efficient data format for mass spectrometry data (OpenMS)

Project description

OpenMS is a framework for computational mass spectrometry. Modern mass spectrometers produce large files (e.g., 100 GB) that can’t be easily stored or accessed in the established XML file format mzML. Recently, an update to mzML has been developed that uses HDF5 to store Blosc compressed spectra in binary format: called mzMLb. In this project, the student will add a reader and writer for the mzMLb file format to OpenMS. To some extent, code from the OpenMS reader and writer for the mzML file format can be reused, as well as inspiration can be taken from reference implementations by other parties.

Project size

350h

Languages and skills needed

C++
Git/GitHub
Optionally: CMake

Difficulty

Easy for experienced C++ programmers with good CMake and GitHub skills. Medium for experienced C++ programmers without CMake or GitHub skills.

Estimated Length

20 weeks

Mentors

Timo Sachsenberg, GitHub: https://github.com/timosachsenberg
Julianus Pfeuffer, GitHub: https://github.com/jpfeuffer
Samuel Wein, GitHub: https://github.com/poshul

Contributor Benefits

Gain experience working in a friendly team of developers.
Get a glimpse into the exciting field of bioinformatics / computational mass spectrometry.
Obtain detailed knowledge on HDF5, which is widely used to store large data efficiently.

How to Apply

Please introduce yourself in our Gitter channel: https://gitter.im/OpenMS/OpenMS and get in contact with us prior to writing a proposal.

Efficient data layout for mass spectrometry data (OpenMS)

Project description

OpenMS is a framework for computational mass spectrometry. It features a wide range of algorithms and data structures to process and analyze mass spectra. For some very computationally demanding parts, we performed manual code conversion to make the layout of our data better fit the data access patterns of our algorithms. We observed the biggest speedup switching the data layout from an Array of Structs (AoS) to a Structure of Arrays (SoA). In this project, the GSoC contributor will adapt our core data structure for mass spectra to AoS. Ideally, the contributor should be using a modern C++ zero-cost abstraction (e.g., building on https://github.com/crosetto/SoAvsAoS) that makes the old code work without (or minimal) manual changes.

Project size

175h

Languages and skills needed

C++
Git/GitHub

Difficulty

Medium: requires a good understanding of modern C++.

Estimated Length

12 weeks

Mentors

Timo Sachsenberg, GitHub: timosachsenberg
Hannes Roest, GitHub: hroest
Julianus Pfeuffer, GitHub: jpfeuffer
Aditya R Rudra, Github: adityaofficial10

Contributor Benefits

Gain experience working in a friendly team of developers.
Get a glimpse into the exciting field of bioinformatics / computational mass spectrometry.
Learn and apply modern C++ features to a real-life project.

How to Apply

Please introduce yourself in our Gitter channel: https://gitter.im/OpenMS/OpenMS and get in contact with us prior to writing a proposal.

GPU support for Toil-CWL-Runner (UCSC/CWL Project)

Project description

Teach the TOIL workflow system how to support GPUs with CWL. Optionally this can be expanded to HPC clusters. This will enable researchers to run specialized workflows that need occasional GPU support efficiently on university computing clusters.

Project size

Either 175 (basic support) or 350 hours (advanced job routing)

Languages and skills needed

Python
Optional: HPC, Clustering

Difficulty

Medium, or easier if you have experience with HPC or clustering.

Mentors

Michael Crusoe, https://github.com/mr-c
Lon Blauvelt, https://github.com/DailyDreaming

Contributor Benefits

Exposure to workflow engines and distributed computing

How to Apply

To get started, visit https://github.com/DataBiosphere/toil, review the readme and the contributing guidelines: https://toil.readthedocs.io/en/master/contributing/contributing.html#contributing

Improving automated wrapping of C++ code in Python (OpenMS/autowrap)

Project description

Autowrap is a python package for the automated wrapping of whole C++ projects into Python via Cython. C++ developers basically need to provide a Cython header file for each C++ header file to specify what needs to be wrapped and how. It then analyses the syntax tree generated by the Cython parser for those “header” files and generates Cython source code for it. Cython then creates the necessary source code to be compiled with e.g. CPython to create a Python extension module to be imported by the end-user. While the wrappers created by autowrap are rather simple, passing templated and nested STL objects like vectors, maps, or tuples between Cython/Python and C++ with autogenerated code can become rather complex. Autowrap offers recursion for nested vectors but cannot handle mixed data structures yet. It also misses support for newer STL containers like tuples and only offers simple vector to (Python) list conversions, while numpy arrays, e.g. via the buffer protocol would sometimes be more suitable. We are seeking for a motivated GSoC contributor proficient in at least Python to tackle those improvements.

Project size

350h

Languages and skills needed

Python (advanced knowledge, for code generation)
Cython (basic knowledge, potentially possible to be acquired, syntax similar to Python; to be generated)
C++ (basic knowledge, potentially possible to be acquired; to be wrapped)
Git/GitHub

Difficulty

Medium: requires a deep understanding of Python and in the beginning at least some basic knowledge about its differences to C++ (regarding memory management and typing)

Estimated Length

20 weeks

Mentors

Julianus Pfeuffer, GitHub: jpfeuffer
Axel Walter, GitHub: axelwalter
Timo Sachsenberg, GitHub: timosachsenberg

Contributor Benefits

Gain experience working in a friendly team of developers.
Get a deep understanding of two very commonly used programming languages and how to interface between them.
It is also possible to learn how to create your own Python package with C++ extensions.

How to Apply

Please introduce yourself in our Gitter channel: https://gitter.im/OpenMS/OpenMS and get in contact with us prior to writing a proposal.

Migration of Journal Policy Tracker backend to Express + GraphQL

Project description

The Journal Policy Tracker is your go-to place where you can find all the open-source scientific journals and their policies. Currently the backend of this project is on Flask and SQLite3 along with SQLAlchemy as the ORM. This project aims to migrate the backend from Flask and SQL database to Express, GraphQL using express-graphql and a NoSQL database like MongoDB.

At the end of the program, the mentee is expected to do a successful migration of the existing server to an Express & GraphQL based backend.

Repository: codeisscience/journal-policy-tracker-backend
Existing API documentation: journal-policy-tracker.herokuapp.com/swagger

Project size

175h

Languages and skills needed

Required skills:

JavaScript
Express.js
GraphQL
MongoDB
Documentation (Markdown knowledge preferred)

Useful skills:

Deployent nodejs apps to Heroku
Familiarity with testing frameworks like Jest, Chai, Mocha in-order to write unit tests, integration tests and e2e tests.

Difficulty

Easy if you have experience with GraphQL
Medium if you are familiar with Express.js

Estimated Length

Flexible, depending on contributor needs

Mentors

Pritish Samal, pritish.samal918@gmail.com
Yo Yehudi, yochannah@gmail.com

Contributor Benefits

Experience developing in backend js frameworks and GraphQL.

How to Apply

Please always email all mentors in the same mail if you would like to ask questions or discuss the project.
You can also join the Code is Science Slack workspace.

Journal tracker - finalise and deploy React front-end

Project description

The Journal Policy Tracker is your go-to place where you can find all the open-source journals and their policies. Currently the Frontend of this project is on React and React Bootstrap. This project aims to finalise the frontend after GSoC 2021, add state management and a user dashboard, and decouple the Frontend from CSS Frameworks for layout and presentation using Grid and Flex in place.

Tasks:

Study the existing user-interface for the journal policy tracker
Add functionality to the existing website while developing the components
Migrate the existing frontend CSS libraries to vanilla CSS
Work on the user-management dashboard
Use context/Redux for state management

Repository: codeisscience/journal-policy-tracker-frontend
Frontend Preview: journal-policy-tracker.netlify.app

Project size

175h

Languages and skills needed

Required skills:

JavaScript
React.js
CSS3
HTML5
Grid
Flex box
Documentation (Markdown knowledge preferred)

Useful skills:

Familiarity with HTML5 and CSS3 semantics
Familiarity with UI/UX
Familiarity with Writing tests

Difficulty

Easy if you have experience with Grid, Flex box, and CSS page layout
Medium if you are familiar with CSS page layout

Estimated Length

Flexible, depending on contributor needs

Mentors

Isaac Miti, ikayztm@gmail.com
Yo Yehudi, yochannah@gmail.com

Contributor Benefits

Experience designing and developing front-end interfaces.

How to Apply

Please visit the repo: codeisscience/journal-policy-tracker-frontend and make at least one contribution, and email the mentors to discuss your project proposal.
You can also join the Code is Science Slack workspace.

Project description

Genestorian is a web application to manage a collection of model organism strains and recombinant DNA in a life sciences laboratory.

New DNA sequences (inside or outside cells) are always generated by combining existing sequences. Genestorian leverages existing semantic web tools for synthetic biology and libraries for DNA visualisation to provide an intuitive interface where researchers can plan, document and revisit their experiments. Here you can find a short summary of the problem we are adressing, adapted for the non-biologists.

An important challenge for the project is to migrate data from spreadsheets, where most labs keep their collections, to the database. In this project, the intern will develop a first version of a tool to perform the data refinement required to migrate from spreadsheet to the database.

Project size

350h

Required skills

Good knowledge of text processing in a programming language (preferably Python).
Willingness to learn the biology concepts that underlie the data models.

Useful skills

Experience with data refinement and approximate string matching.
Willingness to interact with experimental researchers of which the data will be refined.

Estimated Length

Flexible, depending on contributor needs

Mentors

Manuel Lera Ramirez, manulera14@gmail.com
Yo Yehudi, yo@openlifesci.org

Expected outcome

Development of a first version of a tool to perform the data refinement required to migrate from spreadsheets to the database. The task could focus only on the program for refinement, but also developing a web interface for migration is a possibility. In addition to mentorship, we will organise two half-day sessions with a professional Research Software Engineer for helping and advising the contributor, and for code review.

Difficulty level

Medium if you have experience with string matching, easier if you know a bit of biology.

How to Apply

Please always email all mentors in the same mail if you would like to ask questions or discuss the project.

Citation and databasing functionality in luox

luox is a free, open-access and open-source platform for documenting and reporting light-related quantities from a spectrum of light written in JavaScript and React running directly in the browser. It is targeted to biomedical researchers looking for a convenient way to make their research with light(ing) interventions reproducible. Researchers can request a DOI (digital object identifier) for an uploaded spectrum, which is stored in a compressed/hashed way in the URL. The goal of this project is to develop simple database functionality to luox such that the web interface displays the DOI for any spectra that have already been assigned a DOI.

Platform: https://luox.app/
Repository: https://github.com/luox-app/luox
Article describing the platform: https://doi.org/10.12688/wellcomeopenres.16595.2

Project size

350h

Required skills:

Good knowledge of programming in JavaScript
Good knowledge of web development
Version control with Git

Useful skills:

Knowledge of DOIs (digital object identifiers)

Estimated Length

Flexible, depending on contributor needs

Mentors

Manuel Spitschan, manuel.spitschan@tuebingen.mpg.de
Yo Yehudi, yo@openlifesci.org

Expected outcome

Functionality to look up DOI from the compressed/hashed spectrum through a table
Display of the associated DOI in the web interface

Difficulty

Medium

How to Apply

Please always email all mentors in the same mail if you would like to ask questions or discuss the project.

Google Summer of Code 2022 Project Ideas

Cross-Project Ideas

Configurable feature visualization to improve the user experience and performance of the feature viewer

Space- and time-efficient data format for mass spectrometry data (OpenMS)

Efficient data layout for mass spectrometry data (OpenMS)

GPU support for Toil-CWL-Runner (UCSC/CWL Project)

Improving automated wrapping of C++ code in Python (OpenMS/autowrap)

Migration of Journal Policy Tracker backend to Express + GraphQL

Journal tracker - finalise and deploy React front-end

Genestorian data refinement

Citation and databasing functionality in luox