Originally published on Science Policy Forum

 Early data-sharing efforts have led to improved variant interpretation and development of treatments for rare diseases and some cancer types (13). However, such benefits will only be available to the general population if researchers and clinicians can access and make comparisons across data from millions of individuals.

Despite much progress, genomic and clinical data are still generally collected and studied in silos: by disease, by institution, and by country. Regulatory data-privacy requirements do not seamlessly lend themselves to the secure sharing of data within and across institutions and countries (4). Current practices in research and medicine hinder the sharing of data in ways that tangibly recognize an individual’s contributions. Tools and analytical methods are nonstandardized and incompatible, and the data are often stored in incompatible file formats. If we stay this course, the likely outcome will be an assortment of balkanized systems akin to those developed for U.S. electronic health records, which, although designed to advance human health by sharing clinical data across institutions, have by all measures fallen short of that goal because of a lack of interoperability.

A FEDERATED DATA ECOSYSTEM. The Global Alliance for Genomics and Health (GA4GH) was established in 2013 to enable responsible and effective sharing of genomic and clinical data in a way that is as simple as using the World Wide Web. GA4GH, which now brings together hundreds of individuals and organizations, was built on the hypothesis that the data underlying genomic medicine must be federated. That is, whereas data may be distributed across many databases and computers around the world, they must be virtually connected through software interfaces that allow seamless, authorized access. In contrast to large centralized data repositories, a federated system will allow legal data control to remain within the originating jurisdiction (see the figure). International consortia such as the International Cancer Genome Consortium (ICGC) have already adopted federated databases because the model allows local databases to maintain autonomy (5).

TOOL DEVELOPMENT AND USE. As a first step, the GA4GH Regulatory and Ethics Working Group (REWG) developed a framework document that provides basic principles and core elements for responsible data sharing (6, 7) and is founded on Article 27 of the 1948 Universal Declaration of Human Rights (8). This focus on human rights represents a paradigm shift with respect to data sharing, as most previous discussions focused solely on protection from harm without acknowledging the right to benefit from the fruits of scientific and medical advances. In practical terms, increased data sharing will enable researchers to make better predictions about disease risk, prevention, and treatment by virtue of having access to larger data sets. And through data exchanges that link the clinical and research communities, clinicians will be able to make better precision medicine decisions for individual patients.

Additionally, the Data Working Group (DWG) has developed a standardized application programming interface (API), which offers a defined protocol to allow disparate technology services of institutions around the globe to communicate with one another to exchange genotypic and phenotypic information.

The API and the framework document are being used in demonstration projects spearheaded by GA4GH members.

Beacon Project. The Beacon Project (http://ga4gh.org/#/beacon) is developing an open technical specification for sharing genetic variant data sets collected from large-scale population-sequencing projects, clinical diagnostic settings, and variant curation efforts available to the community. A beacon is a Web-accessible service that allows data sets to be queried for the presence or absence of a specific allele. A user of a beacon can ask it questions of the form, “Have you observed this nucleotide (e.g., C) at this genomic location (e.g., position 32,936,732 on chromosome 13)?” to which the beacon must respond with either “yes” or “no.” In the 2 years since the project’s launch, 23 organizations have lit more than 60 beacons serving more than 200 data sets. The data sets served through beacons may be queried individually or in aggregate via the Beacon Network, a federated search engine (http://beacon-network.org/#). Currently, all Beacon users must agree to a single set of data-use conditions. However, work is under way to allow Beacon users to choose from a predetermined set of conditions that restrict potential data use on the basis of the consent of individuals represented in the data (9).

A federated data ecosystem.

To share genomic data globally, this approach furthers medical research without requiring compatible data sets or compromising patient identity.


In contrast to traditional “all-or-nothing” approaches to sharing data [e.g., open or password-protected access to variant call format (VCF) files], Beacon uses a tiered access approach, in which increasingly detailed information is made available to users at more stringent levels of authentication and authorization and with a formal specification of data-use conditions. A registered access level that would fall between open and controlled access is under development.

By adopting a federated model, Beacon overcomes the inefficiency and expense experienced when data generators must transfer whole copies of their data sets into a single, centralized repository. The federated approach also circumvents the often-prohibitive privacy and security risks that arise when such transfers force data to cross institutional and, sometimes, national or continental boundaries. Also, because Beacon is compatible with any underlying representation of alleles or allelic annotations, it is not limited by particular file formats. Finally, Beacon allows data discovery without exposing identifiable information, because it does not require data generators to share fully described data representations or annotations.

BRCA Challenge. The BRCA Challenge aims to advance understanding of the genetic basis of breast, ovarian, and other cancers that are driven by germline variants in BRCA1 and BRCA2. The project’s first product is the BRCA Exchange (http://brcaexchange.org), a publicly accessible Web portal that provides a simple interface for patients, clinicians, and researchers to access curated, expert interpretations of BRCA1/2 genetic variants, as well as supporting evidence. An expanded research arm of the portal was recently launched to allow any Web user to access data from the original submitters. A third tier of access will be made available to registered users who require access to potentially identifiable case-level data and are working on variant interpretation. In addition to developing the BRCA Exchange, the BRCA Challenge team members are working to understand the liability concerns faced by federated databases of this kind, such as misclassifications or failure to regularly validate and update classifications.

Matchmaker Exchange. Matchmaker Exchange (MME) is a collaborative effort of consortia, including members of the International Rare Diseases Research Consortium (www.irdirc.org/) and related laboratories in the rare disease space, where the majority of cases studied lack a clear etiology after initial analysis (10). Given a suspicious variant in a candidate disease gene, matching two cases that share the variant, as well as an overlapping phenotype, may be sufficient evidence to causally implicate the gene. To facilitate such discovery, researchers in the rare disease community have established a series of platforms that allow users to identify cases with phenotypes and disrupted genes in common. MME was established to connect rare disease databases, such that a query to one would enable searches of the others, without having to deposit data into each one.

At this time, three Matchmaker Services (GeneMatcher, Phenome Central, and DECIPHER) have implemented the MME API. To ensure accurate comparison of patients assessed by different clinicians, similar phenotypes are determined by matching identical or ontologically similar terms with the Human Phenotype Ontology. MME users must deposit their data into an existing MME service, and tools on the MME website (http://matchmakerexchange.org) help guide users toward the database that is most appropriate for a given case. Although the system is currently geared toward clinicians and researchers, the team is also working with patients to establish patient-led matchmaking endeavors with support from such organizations as Free the Data and the ClinGen Genome Connect Registry.

Matchmaking has already led to the diagnosis of several previously undiscovered rare diseases (10). Successful matching will increase considerably as the volume of cases connected through MME increases. Additionally, MME will soon enable “hypothesis-free” matching in which the genotype aspects of matching can occur by direct query of variants within a VCF that meet certain criteria, even if no candidate gene has been labeled as such. This will require MME services to support queries of entire genomic data sets.

Finally, with input from the REWG, MME has developed a two-tiered informed-consent policy to define the type of consent needed for using MME and when no consent is needed. If the data are associated with a unique or sensitive phenotype or with sequence-level data, consent from the patient is required to share it for research purposes. However, if only standard phenotype terms and candidate gene names are used, consent to clinical care allows for matchmaking. Still, challenges remain in balancing discovery with privacy and data protection.

A variety of issues arise when data must cross multiple communities (e.g., patient privacy, distinct international laws, individual academic success in gene discovery, user authentication and consistent standards for data exchange across distinct databases). Although GA4GH has been convening stakeholders to address these challenges, more groups and data sets must still be brought in.

REMAINING CHALLENGES. Shringarpure and Bustamante (11) used simulations to show that, in some scenarios, querying a public beacon for as few as 250 variants already known to be present in an individual’s genome could reveal information distinctive to that individual. GA4GH members have been developing solutions to this potential security breach since the project’s inception, including aggregating data among multiple beacons, tracking usage to restrict systematic searches and introducing tiers of secured access that require users to be authorized for data access—but these necessarily limit the scope of information that can be shared widely. Innovative policy and regulatory measures, as well as technological solutions, are needed to securely handle individual genomic and clinical data.

A second challenge is scalability. For every problem there will be domain-specific challenges that may require uniquely applicable tools. For instance, the field of dementia research may demand new solutions that integrate data from brain MRI technology. Furthermore, it is expected that individual fields will have previously developed standards, which may demand that GA4GH adapt its existing solutions in order to be compliant. Applying existing GA4GH approaches in new contexts will require solutions that are easily portable, customizable, and interoperable. GA4GH must also focus on solutions that can benefit many different patient groups, jurisdictions, health systems, and environmental and socioeconomic realities, such as interoperability with mobile devices, which are now broadly available even in developing nations. Open technology, built-in interoperability, and ease of use of data-sharing tools are essential.

Data sharing has inherent costs related to data curation, hosting, and computation. Hoopen et al. described substantial costs of post-data curation, leading to a proposal for lower-cost submitter-driven annotation as a more sustainable curation solution (12). Many databases currently recover costs through user fees (13), creating either a need to charge and share revenue or a two-tiered system that may limit some downstream users from accessing the full complement of information. Member projects, such as ICGC’s PanCancer Analysis of Whole Genomes (PCAWG), have implemented federated cloud-based solutions that bring the cost of analyzing a single sample from U.S. $200 by using traditional academic high-performance computing models to under U.S. $20 per sample. Cloud-based approaches also have the benefit of being compatible with some country-specific legal frameworks (14). Several business models to support genomics big data research have been proposed, including a subscription model, which may inherently limit access, and a “freemium” model, which charges not for data access but for associated services, such as curation and interpretation (15).

Notwithstanding emergence of new business models for private and public sector partnerships to support some data-sharing costs, government agencies may need to support some features of the ecosystem (e.g., curation) so that clinicians and patients have access to as much free, curated data as possible. In addition to economic incentives, more can be done to establish greater academic recognition for data sets through citations and microattribution, in which quantitative credit is attached to every data-use accession (16, 17).

Finally, ensuring engagement among the entire global community is necessary from a social justice and medical perspective, although this will likely require distinct legal, cultural, and business models. In some countries, health care and research organizations are interested in GA4GH as a means to link nascent national efforts in precision medicine with other international groups, such as the Brazilian Initiative on Precision Medicine (www.fcm.unicamp.br/gtc/evento/1/trabalho/8). Training and infrastructure needs related to data storage, management, security, and policies are common to many jurisdictions. Technology and economic incentives can make it possible for an international, federated network of genomic and clinical data to become a network for learning that will illuminate causes of disease and potential interventions for prevention and treatment.