CAPTION: Organizations that support the GRC assembly and its gene annotations. Abbreviations: e!, Ensembl Project; GRC, Genome Reference Consortium; HGNC, Human Genome Organisation (HUGO) Gene Nomenclature Committee; INSDC, International Nucleotide Sequence Database Collaboration; NCBI, National Center for Biotechnology Information; UCSC, University of California, Santa Cruz.
Daniel R. Zerbino, Adam Frankish, and Paul Flicek | Annual Review of Genomics and Human Genetics | May 18, 2020
HISTORY OF THE HUMAN GENOME AND ITS ANNOTATION
The history of the sequencing and annotation of the human genome is marked by breathtaking acceleration after a prolonged theoretical inception. Many components of the human genome were discovered long before its base pairs were read through astute experimental design. When the DNA was finally readable, these abstract concepts were mapped onto actual sequences, creating a multilayered annotation linking sequence to phenotype.
The concept of genes evolved from theoretical consideration to molecular components (54). In 1866, Gregor Mendel published his laws of genetics (97), and three years later, Friedrich Miescher isolated nucleic acids (31). The term gene itself was coined as early as 1909 by Wilhelm Johannsen (79, 130) to designate the characteristics of the gametes that affect the resulting organism. Even though geneticists did not know the exact molecule involved, statistical analyses of inheritance patterns allowed them to determine that genes were stored in a linear fashion and to start computing genetic maps of gene proximity (143).
It was only in the mid-twentieth century that the experiments of Avery et al. (8) (1944) and Hershey & Chase (65) (1952) demonstrated the role of DNA in carrying genetic information. Once the role of DNA was proven, genes became physical components. Protein-coding genes could be characterized by the genetic code, which was determined in 1965 (109, 135), and could thus be defined by the open reading frames (ORFs). However, exceptions to Francis Crick’s central dogma of genes as blueprints for protein synthesis (30) were already being uncovered: first tRNA (27) and rRNA (87) and then a broad variety of noncoding RNAs (38).
The genome also provides mechanisms to regulate when and where genes are expressed, thus refining their phenotypic effects. In 1939, Conrad Hal Waddington (161) coined the term epigenetics to designate the study of cell type differentiation (67). In 1970, John Gurdon (61) demonstrated that differentiation did not involve changes to DNA, raising the question of how a multicellular organism, whose genome is (nearly) identically replicated across all cells, could express a wide diversity of cell types, tissues, and so on. Epigenetics thus became the study of information conserved across mitosis and not carried by the DNA sequence. Confusingly, the term later came to additionally (and simultaneously) refer to the study of non-Mendelian inheritance across generations (45, 70).
The control mechanism of gene expression levels was illuminated by François Jacob and colleagues through the discovery of the lac operon (78), and a model of gene expression regulation was produced: a promoter sequence upstream of the gene to recruit polymerase and operator sequences to recruit transcription factors. Farther away from the promoter, enhancers were found—first in viruses in 1981 (13, 59) and then in eukaryotes in 1983 (9, 55, 98)—to affect transcriptional output at the promoter regardless of distance or orientation.
The genome contains functional regions relevant to its integrity. Centromeric regions, for example, are necessary to recruit the kinetochores to ensure proper separation of chromatin during mitosis, to keep sister chromatids together ahead of mitosis (10), and finally to ensure their own rapid replication during S phase (145). Telomeric regions have long been interpreted to protect the ends of chromosomes, but our understanding of their function is still evolving (133).
Finally, a large amount of the genome is derived from transposable elements. In 1953, Barbara McClintock (95) published the first observation of genes moving in the genome. It was later discovered that transposable elements correspond to repeated sequences that are able to copy themselves within a cell’s genome.
Despite all the technical obstacles, as early as 1985, scientists such as Robert Sinsheimer at the University of California, Santa Cruz (UCSC), started discussing the feasibility of sequencing the human genome.
Reading the Genome
Shortly after the discovery of the importance of nucleic acids, the first (RNA) genome was sequenced in 1976 (47), and sequencing methods were refined to allow for large-scale data production (131). Despite this rapid progress, characterizing the entire sequence of a large genome was still a complex and costly endeavor due to the necessity of collecting genetic linkage maps (101), which meant that the first eukaryotic genome (of yeast) was published only in 1996 (58).
As genome sequencing technology improved, one obvious challenge was to sequence the human genome (155). Despite all the technical obstacles, as early as 1985, scientists such as Robert Sinsheimer at the University of California, Santa Cruz (UCSC), started discussing the feasibility of sequencing the human genome (28). This idea gathered support, and in 1988, a joint project of the US National Institutes and Health and Department of Energy was created to sequence the human genome over a period of 15 years, around which parallel efforts in China, France, Germany, Great Britain, and Japan rallied. The project continued slowly, sequencing less than 15% of the genome over the next 11 years, until the competition of the Celera Corporation created uncertainty about the availability of the sequence and spurred a significant ramping up of resources and processes, leading to the back-to-back release of two draft sequences on June 26, 2000 (76, 157).