The following is an edited transcript of Scientific Director David Haussler’s remarks during a panel discussion on October 14, 2020 as part of a National Academies workshop titled, “Data in Motion: New Approaches to Advancing Scientific, Engineering and Medical Progress.” The panel discussion moderated by Stuart Feldman (Schmidt Futures) was titled “The Need for Fast Response Science: COVID and Other Challenges/Drivers.” Panelists were David Haussler (University of California, Santa Cruz), Ana Bonaca (Harvard University), and Mark Zelinka (Lawrence Livermore National Laboratory).
Edits were made for readability, with remaining content reflecting what Dr. Haussler said.
Thanks to the Academy for setting this up. It’s a great honor to be able to talk with you today. I’ll try to be brief. But there’s a lot to go over.
Basically, I’m a very angry individual at this point. And I’m angry about obstacles to the flow of scientific data. I’ve been in a decades-long fight against ownership of information in the life sciences. This is a critical issue of our time. We won’t be able to get the large data sets and the AI enabled analysis of them unless we solve this problem.
It all started 20 years ago when I was a member of the International Human Genome Sequencing Consortium sequencing the human genome. At the last minute, our group, and specifically a student in my group name Jim Kent, put together the pieces of the human genome produced by the Consortium to form the first draft of the DNA sequence of the human chromosomes, just in time before a June 26, 2000 meeting at the White House when it was announced that the competitor Celera Genomics and the Consortium had both finished a first draft of the human genome at the same time.
The thing about that was that Celera Genomics planned to charge a subscription fee for people to read humanity’s genetic heritage, the product of billions of years of evolution. And I think that was inappropriate.
Because we computationally assembled it, we had the honor of posting the first draft of the human genome on the Internet free, without any restrictions for scientific or any other use, on July 7, 2000. Following that, we started to look at the genomes of as many species as possible and over the years, we’ve started to understand the very dictionary, the periodic table, if you will, for life sciences by sequencing the genomes from species everywhere on the planet.
This is absolutely important: if we want to save the endangered species on this planet, we need to understand their genomes. But recently, many of you may know, there is a movement to add what’s called digital sequence information DSI to the Nagoya protocol for the protection of genetic resources, which would allow people to patent the genome sequences of life forms within their jurisdictions and restrict transfer and use of genetic information. And I ask you, what would chemistry be like if Lithuania owned the atomic structure of lithium and Germany owned the atomic structure of germanium? It doesn’t make any sense for countries to actually own the genetic sequences that naturally occur in species within their borders. Biology is not parceled up into nation states.
We are now spending a huge amount of time and a huge amount of money on collecting information about genetic sequences involved in human disease. I also had the honor of spending a decade with the Cancer Genome Atlas and the International Cancer Genome Consortium, where we got our first glimpse of what cancer looks like at the genetic level. Cancer is a genetic disease, 100%. We had the honor of becoming the first trusted partner of the NIH, which is a legal status, allowing us to share information about individual genomes of people who have specific diseases with the rest of the world. That project, called CGHub, produced the first petabyte-scale international genome sharing. Our output exceeded the total amount of information that was produced by the National Center for Biotechnology Information at the time.
We went on and build something which we called the Cancer Gene Trust that would include information about how cancer patients were treated and what the medical outcome was. We were never allowed to include this information in CGHub. The Cancer Gene Trust went precisely nowhere. You can google it and you’ll find a few dozen entries on a free website. But the idea of actually freely exchanging information about cancer genetics and cancer outcomes has never taken hold, and I think that’s a travesty. The fact that we have deadly diseases occurring in individuals all over the world and we cannot get specific information about this is absolutely insane.
I went on to be one of the co-founders of the Global Alliance for Genomics and Health, which now has more than 600 institutional members from countries all over the world. One of our key projects was to try to resolve this issue of genetic data sharing. We started a project called the BRCA Exchange. The BRCA gene, of course, is the gene that’s most involved in susceptibility to breast and ovarian cancer in women. That gene was owned by a company called Myriad Genetics for many years until the US Supreme Court struck down their patent. During those years they amassed the world’s most comprehensive information about the various genetic changes that occur in this gene and which are pathogenic. Can you get that information? No. The BRCA Exchange has now collected more than 40,000 different distinct genetic variants that occur in the BRCA gene and they make that information freely available for scientific research and to practitioners.
Over the years we also built the Human Genome Browser, which is a way to collect all information freely available for interactive exploration about the human genome. That has become an essential utility in human genetics — a huge driver of science. It’s kind of like the Google maps of the human genome. You can explore it interactively, make hypotheses and essentially test them electronically right there on the spot. More than 10,000 scientists around the world will use it today. And they will use it for a significant amount of time, generating more than a million page hits.
When the pandemic hit, we built an analogous tool, which we call SARS CoV-2 Genome Browser. We put everything that was being rapidly published in bioRxiv and medRxiv that was worth it in terms of data on the browser as quickly as possible. You have an issue during a pandemic like this with the bioRxiv resource: information just comes in so fast, it’s very hard to assimilate it, and it takes work to integrate it. We worked very hard to integrate it so everyone could compare and contrast data sets across different studies. Carl Zimmer called it one-stop shopping for COVID-19 molecular biology. That resource is also extensively used, and we’ll have about 1800 page hits a day on that — today will be no exception.
But one of the things that also shocked me is that when we put on the first 50,000 SARS CoV-2 genomes that had been collected around the world we got complaints from the organization GISAID that controls that information saying we could not share this information. These virus genome sequences are the information that allows the world to actually watch the genetic evolution of the virus in real time, trace its movements from place to place and watch for mutations that may cause resistance or intractability to certain therapies or diagnostics. When we had the opportunity to correlate all that virus genetic information with all of the experimental molecular biology information from bioRxiv and medRxiv in one place, they tried to shut us down.
This is insane. As I said, I am an angry person. We need to make it clear once and for all that Nature’s information is open to all.