<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Gsoc-2021 on Open Bioinformatics Foundation</title><link>https://www.open-bio.org/tag/gsoc-2021/</link><description>Recent content in Gsoc-2021 on Open Bioinformatics Foundation</description><generator>Hugo</generator><language>en-US</language><managingEditor>board@open-bio.org (Open Bioinformatics Foundation)</managingEditor><webMaster>board@open-bio.org (Open Bioinformatics Foundation)</webMaster><lastBuildDate>Wed, 23 Jun 2021 08:58:43 +0000</lastBuildDate><atom:link href="https://www.open-bio.org/tag/gsoc-2021/feed.xml" rel="self" type="application/rss+xml"/><item><title>Working on a CWL-Toil project with the Open Bioinformatics Foundation</title><link>https://www.open-bio.org/2021/06/23/working-on-a-cwl-toil-project-with-the-open-bioinformatics-foundation/</link><pubDate>Wed, 23 Jun 2021 08:58:43 +0000</pubDate><author>board@open-bio.org (Open Bioinformatics Foundation)</author><guid>https://www.open-bio.org/2021/06/23/working-on-a-cwl-toil-project-with-the-open-bioinformatics-foundation/</guid><description>&lt;p&gt;&lt;em&gt;This is a guest post from Mihai Popescu, a GSoC student with CWL, which participates under the OBF umbrella. Cross-posted on the CWL forums: &lt;a href="https://cwl.discourse.group/t/working-on-a-cwl-toil-project-with-the-open-bioinformatics-foundation/390"&gt;https://cwl.discourse.group/t/working-on-a-cwl-toil-project-with-the-open-bioinformatics-foundation/390&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;I am Mihai Popescu, a second year master student at VU Amsterdam in &lt;a href="https://vuweb.vu.nl/en/education/master/parallel-and-distributed-computer-systems"&gt;Parallel and Distributed Computer Systems&lt;/a&gt;. I am happy that my GSoC proposal got accepted and that I have started working on the project. I attended some CWL meetings since I submitted the proposal and I got to know a small part of the community. I would like to thank my mentor Michael for introducing me to the CWL community and answering a lot of my questions about workflows.&lt;/p&gt;
&lt;p&gt;The objective of my &lt;a href="https://summerofcode.withgoogle.com/projects/#6469533377757184"&gt;2021 GSoC project&lt;/a&gt; is to implement data streaming for &lt;code&gt;toil-cwl-runner&lt;/code&gt;, which is a way of running Toil using CWL. This project aims to implement data streaming to speed up the analysis by avoiding slow disk/storage IO and speeding up the start of tool execution when it isn’t required to wait for data to download. The main focus is to implement this first in AWS S3. &lt;a href="https://toil.readthedocs.io/en/latest/"&gt;Toil&lt;/a&gt; is an open-source pure-Python workflow engine. &lt;a href="https://www.commonwl.org/"&gt;Common Workflow Language&lt;/a&gt; (CWL) is an open standard for describing analysis workflows.&lt;/p&gt;
&lt;p&gt;Sarah Wait Zaranek from &lt;a href="https://arvados.org/"&gt;Arvados&lt;/a&gt; helped me get a real world CWL &lt;a href="https://github.com/arvados/arvados-tutorial/tree/main/WGS-processing"&gt;workflow 1&lt;/a&gt; that uses streaming. It took a while to get used to the Arvados platform and I actually ran a much bigger workflow than intended on their “playground” public instance. I ended up using a single step from the workflow to keep it simple at the start and be able to run it locally on my computer. I’ve splitted it up into two individual components ( &lt;a href="https://github.com/mhpopescu/toil-gsoc-tests/blob/2561e007167834ca777de8e2f2a7e03fb65aab2f/bwamem.cwl"&gt;first 1&lt;/a&gt; and &lt;a href="https://github.com/mhpopescu/toil-gsoc-tests/blob/2561e007167834ca777de8e2f2a7e03fb65aab2f/samtools-view.cwl"&gt;second&lt;/a&gt;) so that I could test the streaming feature.&lt;/p&gt;
&lt;p&gt;There are 2 runners in Toil: pure python &lt;code&gt;toil&lt;/code&gt; and &lt;code&gt;toil-cwl-runner&lt;/code&gt;. The &lt;code&gt;toil&lt;/code&gt; runner has functionality for file streaming. The proposed solution to enable file streaming for &lt;code&gt;toil-cwl-runner&lt;/code&gt; is to make use of named pipes. I tested to see if this would work by simulating the behavior. I started with a simple &lt;a href="https://github.com/mhpopescu/toil-gsoc-tests/blob/2561e007167834ca777de8e2f2a7e03fb65aab2f/fifo-cat.py"&gt;example&lt;/a&gt; doing a &lt;code&gt;cat&lt;/code&gt; command without toil, where the input file would be streamed using a named pipe. &lt;a href="https://github.com/mhpopescu/toil-gsoc-tests/blob/2561e007167834ca777de8e2f2a7e03fb65aab2f/fifo-sam.py"&gt;Then&lt;/a&gt; I streamed the input for the samtools step that was in the splitted workflow. Then I streamed the input and the output for the same step, running a python toil &lt;a href="https://github.com/mhpopescu/toil-gsoc-tests/blob/2561e007167834ca777de8e2f2a7e03fb65aab2f/samtools-view-fifo-out.py"&gt;workflow&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Streaming the output looks similar to this:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-fallback" data-lang="fallback"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;def writeOutputToPipe(self, fin, foutStream, fileStore):
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; with open(fin, &amp;#39;rb&amp;#39;) as fi:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; while True:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; data = fi.read()
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; if not data:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; break
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; foutStream.write(data)
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;And before running the job, another thread would start to run this function. Streaming the input is similar.&lt;/p&gt;
&lt;p&gt;Now that I tested that using named pipes could help streaming the files, I would look at how to implement this in the source of &lt;code&gt;toil-cwl-runner&lt;/code&gt; itself: &lt;code&gt;cwltoil.py&lt;/code&gt;. There are a few steps for what to do next:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Investigate the block of code that parses the CWL file&lt;/li&gt;
&lt;li&gt;Check if &lt;code&gt;streamable&lt;/code&gt; option is set and then maybe create a new flag in internal &lt;code&gt;toil&lt;/code&gt; file structures&lt;/li&gt;
&lt;li&gt;Investigate the block of code that downloads the file&lt;/li&gt;
&lt;li&gt;Add the streaming functionality when &lt;code&gt;streamable&lt;/code&gt; option is set&lt;/li&gt;
&lt;/ul&gt;</description></item></channel></rss>