Data Sharing: Fast Forward Consent

Chair: Nazneen Rahman (The Institute of Cancer Research, The Royal Marsden NHS Foundation Trust)
For the rest of the 4th Plenary Meeting – October 18th, check out the summaries found in Global Alliance for Genomics & Health.

David Haussler (University of California, Santa Cruz) reviewed the technical process and progress of GA4GH toward data sharing in translational medicine. The evolving vision centers on fostering agreement on how genetic variants and other data elements are represented, so that an application programming interface (API) can define how the data are exchanged, transferred, and manipulated. The API can be used in public data sets, and also provide appropriate levels of authorized access to the private data sets hosted by medical research institutions, patient advocacy registries, hospitals, health IT companies, national health services, and other constituents.

The code is all open source and lives in Github, a collaborative web space where hundreds of people worldwide have contributed code under the constant evaluation of the most active GA4GH contributors. Still in development toward the 1.0 version, the GA4GH Genomics API is a constellation of objects making up an ecosystem. Haussler urged individuals and organizations to get involved in developing the API further. Task teams tackle specific technical issues, and the API supports demonstration projects, such as Beacons, Matchmaker Exchange, the BRCA Challenge, and the Cancer Gene Trust.

The underlying bricks and mortar of the API are data standards, which started when the 1000 Genomes Project built BAM and VCF, two file formats that became the de facto world standard for genetic reads and variants. GA4GH volunteers are maintaining those and also building an abstract version of that data so that it can be articulated in the more flexible API format.

A new project aims to create a single ethnically diverse reference using all the world’s genomes, rather than an individual genome from one particular ethnic group, as is the case for the current reference genome. Another project answers a surprisingly basic question about what constitutes a genetic allele in precise machine-readable form. Subtle differences exposed by Reece Hart and his Variation Modeling Collaboration revealed the potential for ambiguity and confusion between the major constituencies storing and exchanging genetic variants. This must be solved before using the shared data in clinical practice, Haussler said. A recent addition was the streaming API to transfer large data files efficiently.

One task team aims to bring code to the data, addressing the expectation that code may move more freely, because of privacy restrictions and cross-border genomic data restrictions. Code would move in containers known as Dockers, connected in processes called workflows—all requiring an API that allows the complex code to not only run reproducibly in different data centers, but to understand layers of authorization and access.

A new project, Variant Interpretation for Cancer Consortium, led by investigators at Washington University in St. Louis, U.S., and Universitat Pompeu Fabra in Spain, is uniting a number of the best databases for actionable interpretation of somatic cancer variants to create a one-stop, carefully reviewed and authoritative place to find the information. As of the Plenary, GA4GH counted 92 international genomic data initiatives. GA4GH wants the world to collaborate on the interface, allowing all constituents to communicate, and also wants to encourage competition on implementation.