Nanopore sequencing and assembly of a human genome with ultra-long reads

Abstract:

We report the sequencing and assembly of a reference genome for the human GM12878 Utah/Ceph cell line using the MinION (Oxford Nanopore Technologies) nanopore sequencer. 91.2 Gb of sequence data, representing 30× theoretical coverage, were produced. Reference-based alignment enabled detection of large structural variants and epigenetic modifications. De novo assembly of nanopore reads alone yielded a contiguous assembly (NG50 3 Mb). We developed a protocol to generate ultra-long reads (N50 > 100 kb, read lengths up to 882 kb). Incorporating an additional 5× coverage of these ultra-long reads more than doubled the assembly contiguity (NG50 6.4 Mb). The final assembled genome was 2,867 million bases in size, covering 85.8% of the reference. Assembly accuracy, after incorporating complementary short-read sequencing data, exceeded 99.8%. Ultra-long reads enabled assembly and phasing of the 4-Mb major histocompatibility complex (MHC) locus in its entirety, measurement of telomere repeat length, and closure of gaps in the reference human genome assembly GRCh38.

Main:

The human genome is used as a yardstick to assess performance of DNA sequencing instruments1,2,3,4,5. Despite improvements in sequencing technology, assembling human genomes with high accuracy and completeness remains challenging. This is due to size (3.1 Gb), heterozygosity, regions of GC% bias, diverse repeat families, and segmental duplications (up to 1.7 Mbp in size) that make up at least 50% of the genome6. Even more challenging are the pericentromeric, centromeric, and acrocentric short arms of chromosomes, which contain satellite DNA and tandem repeats of 3–10 Mb in length7,8. Repetitive structures pose challenges for de novo assembly using “short read” sequencing technologies, such as Illumina’s. Such data, while enabling highly accurate genotyping in non-repetitive regions, do not provide contiguous de novo assemblies. This limits the ability to reconstruct repetitive sequences, detect complex structural variation, and fully characterize the human genome.

Single-molecule sequencers, such as Pacific Biosciences’ (PacBio), can produce read lengths of 10 kb or more, which makes de novo human genome assembly more tractable9. However, single-molecule sequencing reads have significantly higher error rates compared with Illumina sequencing. This has necessitated development of de novoassembly algorithms and the use of long noisy data in conjunction with accurate short reads to produce high-quality reference genomes10. In May 2014, the MinION nanopore sequencer was made available to early-access users11. Initially, the MinION nanopore sequencer was used to sequence and assemble microbial genomes or PCR products12,13,14because the output was limited to 500 Mb to 2 Gb of sequenced bases. More recently, assemblies of eukaryotic genomes including yeasts, fungi, and Caenorhabditis elegans have been reported15,16,17.

Recent improvements to the protein pore (a laboratory-evolved Escherichia coli CsgG mutant named R9.4), library preparation techniques (1D ligation and 1D rapid), sequencing speed (450 bases/s), and control software have increased throughput, so we hypothesized that whole-genome sequencing (WGS) of a human genome might be feasible using only a MinION nanopore sequencer17,18,19.

We report sequencing and assembly of a reference human genome for GM12878 from the Utah/CEPH pedigree, using MinION R9.4 1D chemistry, including ultra-long reads up to 882 kb in length. GM12878 has been sequenced on a wide variety of platforms, and has well-validated variation call sets, which enabled us to benchmark our results20.

[ Read More ]