Dissertation Defense: Classifying Cancer Genomic Alterations Using Machine Learning and Multi-Omic Data

David Haan, PhD Candidate Systems Biology, Stuart Lab

Abstract

In 2018, an estimated 1,762,450 new cases of cancer will be diagnosed in the United States and 606,880 people will die from these diseases. Cancer is a group of diseases characterized by the overgrowth of abnormal cells as the result of genomic mutations. Mutations that initiate tumorigenesis are called driver mutations whereas those which can not are called passenger mutations. Driver mutations define a tumor’s sub-type and can be used as therapeutic targets thus, deciphering driver mutations from passenger mutations is of utmost importance as we strive to improve cancer treatment. As the cost of genome sequencing is decreasing, the amount of available tumor data is increasing, making it possible to conduct large scale computational analysis with machine learning to identify novel tumor characteristics. There have been numerous recent collaborations to collect, sequence, and analyze human tumors. The largest of these collaborations, the Cancer Genome Atlas (TCGA), is a comprehensive analysis of 9000 patients and 33 sub types cataloging mutation data, DNA, mRNA, methylation, and protein expression. Whereas the TCGA is mostly whole exome sequencing, The International Cancer Genome Consortium (ICGC) has begun contributing data from the whole genome sequencing of a few thousand tumors. Using both the TCGA and ICGC data, I performed four new variant classification analyses using both unsupervised machine learning techniques and a novel supervised machine learning technique to identify tumor subtypes, driver mutations and potential therapeutic targets. I first present an analysis in which I used supervised machine learning to determine the most important genomic features responsible for accurate gene fusion detection among a set of fusion detection methods. Next, I present a method of unsupervised machine learning in which I classify non-coding variants of splicing factors as potential driver mutations in a number of tumor types. Third, I analyze telomere data from ICGC whole genome sequencing data using unsupervised machine learning to identify 4 subtypes of telomere maintenance mechanisms(TMM) among 2,500 tumor samples. Lastly, I present a new variant classification method called LURE, which uses supervised machine learning to classify variants based on existing signatures from known driver mutations.

Advisor

Josh Stuart

To accommodate a disability, please contact Ben Coffey at the UC Santa Cruz Genomics Institute (becoffey@ucsc.edu, 831-459-1477).