Large-Scale Solutions For Genomic Analysis

John Vivian, PhD Student, Biomolecular Engineering & Bioinformatics

Wednesday, July 19, 2017 – 11:00am

Engineering 2, Room 599

Host – Professor Benedict Paten


Reduced sequencing costs due to rapidly advancing genomic technologies has lead to a huge increase in publicly available genomic data. It is necessary to develop methods for managing and processing this plethora of data, and scaling downstream analyses to leverage it efficiently. My research aims focus on addressing this issue. First, I’ll discuss Toil: a distributed workflow platform I helped develop that capitalizes on existing cloud infrastructure to run computational pipelines efficiently at massive scale, and an application of Toil — A robust, portable, open-source, and reproducible RNA-seq workflow I developed, that was used to analyze 20,000 patient samples from four major studies. These results were then made available to the public through the UC Santa Cruz Genome Browser and UC Santa Cruz Xena platform. Next, I’ll discuss strategies for scaling differential gene expression analysis to handle thousands of samples, and how this analysis can be leveraged to analyze cancer samples from TCGA for which there are few, or absent, corresponding normal tissues. Finally, I’ll discuss how convolutional and recurrent neural networks can be used to tackle the problem of nanopore sequence alignment, and how Toil may serve as an ideal platform to support distributed training of these neural networks.