Genomics is in transition. The growth in data—driven by the need for vast sample sizes to gain statistical significance and the explosion of clinical sequencing—is far outpacing Moore’s law. Large projects like The Cancer Genome Atlas have generated petabyte scale datasets that very few groups have the capacity to analyze independently. Looking forward, improvements in the computing technologies will be insufficient to satisfy the community’s exponentially growing needs for computing throughput and storage capacity. The cost of computing on the rapidly growing data is compounded by the expanding complexity of genomic workflows. Typically, dozens of programs must be precisely configured and run to reproduce an analysis. The drastic increases in data volume and workflow complexity have created a serious threat to scientific reproducibility.
However, the shift to cloud platforms, the creation of distributed execution systems and the advent of lightweight virtualization technologies, such as containers, offers solutions to tackle these challenges. With UC Berkeley’s AMP lab we are pioneering ADAM, a genomics platform built on Apache Spark that can radically improve the efficiency of standard genomic analyses. To support portable, scalable and reproducible workflows we have created Toil, a cross cloud workflow engine that supports several burgeoning standards for workflow definition. We argue the flexibility afforded by such a system is not only efficient, but transformative, in that it allows the envisioning of larger, more comprehensive analyses, and for other groups to quickly reproduce results using precisely the original computations. To support scientific container discovery and sharing we are supporting Dockstore, a project pioneered by OICR, that is part of the Global Alliance for Genomics and Health effort that we are co-leading to develop APIs and standards for genomic containers.