By Branwyn Wagman
Can a class of 17 graduate students and undergrads make a continuous genome from the A’s, C’s, T’s, and G’s that comprise the banana slug genome?
The short answer: partially.
The banana slug genome seems to be as quirky as the banana slug itself and has already yielded curious findings.
In spring quarter 2015, Banana Slug Genomics (Biomolecular Engineering 235), set out to assemble the genome of the UC Santa Cruz mascot, Ariolimax dolichophallus.
Specimen preparation, genome sequencing, and assembly fully employed the $21,443 raised from 139 donors during a crowd-funding effort in October and November 2014.
Coached by biomolecular engineering faculty Ed Green and Kevin Karplus, the students wrestled with bioinformatics techniques to tackle this project.
Undergrad Jared Copher prepared the specimen and ran the sequencer in biomolecular engineering professor Nader Pourmand’s laboratory during winter quarter.
Copher also became the team’s slug biology expert. “The genome might explain relationship to other related animals that are now ambiguous, such as land slugs and sea slugs,”he said.
The first problem in extracting DNA from an organism is that the chromosomes are too big for the state-of-the-art Illumina sequencing machine, so Copher had to cut them up before feeding them to the sequencer.
The Illumina sequencer yielded relatively short fragments of DNA representing perhaps 40 times coverage of the genome, which is in the neighborhood of 2 billion base pairs long. The job was to solve the multi-layered puzzle to figure out where all the pieces fit—similar to solving a massive jigsaw puzzle where each piece has 40 overlapping copies that are one billionth the size of the entire image—without a picture on the box cover.
“If the sequence of your organism already exists, it’s more of a computationally trivial project,” student Emilio Feal said, explaining the common practice of using known genomes of similar species as a scaffold to construct new genomes. This is similar to having a partial photo on the puzzle box.
But Feal added, “In the case of the banana slug, we do not have an existing related organism, so we have to do de novo sequencing.”
Five teams formed to apply different genome assembly software packages to the task, with the aim of finding the most effective. They chose methods that had shown well in the Assemblathon contest from the Genome 10K project.
Class members also sequenced the banana slug’s RNA—a method that helps identify genes and points to how segments of the genome fit together—and will be doing some additional RNA sequencing over the summer.
Charles Cole explained, “Sequencing the RNA tells us two important things: One, what sort of proteins are found in the banana slug; and two, because the RNA is complementary, created from the DNA in the cell, we can map it back to the genome to find the genes.”
Natasha Dudek focused on assembling the banana slug’s mitochondrial genome—a much smaller genome present in the slug’s cells.
The mitochondrial genome contains the COX1 gene sequence, a component often used as a “barcode” to identify a species.
Dudek was able to identify the COX1 sequence for A. dolichophallus, which will make it possible to easily identify other specimens of the same species with a simple genomic test.
Christopher Eisenhart built a prototype banana slug browser for the UCSC Genome Browser, an open access tool for visualizing and studying genomes.
Gepoliano Chaves explained the first step in analyzing the data is to control the quality of data they received. In particular, he said, repeating segments of DNA code are a problem for genome assembly.
“Jim Kent says DNA sequence is more like a song that is in poetry than a book that is in prose,” Chavez said, referring to the director of the UCSC Genome Browser, who assembled the first draft of the human genome. “Mary had a little lamb little lamb little lamb Mary had a little lamb whose fleece was white as snow.”
“Repeats are a big problem in assembly,” Kyle McGovern explained. “We believe there are a lot of them in the banana slug genome, and it is hindering our assembly.”
The assembly effort
Because most of the assembly methods were designed for different types of genomes or were developed by groups that may not have documented them fully, installing the assemblers and getting them up and running took a significant portion of spring quarter.
By the end of the quarter, no team had completed the assembly, but a couple had gotten close.
The availability of computer time turned out to be a critical limiting factor for the assembly teams. Many of the assemblers required significant RAM and time to run. So in addition to seeking methods that would accurately assemble the banana slug genome, they looked for the most efficient methods.
One team spent considerable time correcting the source code of their assembler, Meraculous, but one step in the process consistently failed, and according to team member Jake Houser, the results they did get were unpromising. Members of that team re-deployed to different assembly teams.
The ABySS assembler seemed promising, because it was developed for energy efficiency on large compute clusters and optimized for larger genomes such as the banana slug’s. But the students found it hard to install and configure on the cluster they had available to them, and ABySS requires more memory than the class had available.
Sidra Hussain explained, “We first did an assembly with about a quarter of the data we had. It still used way too much RAM. It failed when tried to run with all the data.”
The assembler SOAPdenovo, a collection of genome alignment tools developed by BGI (formerly Beijing Genomics Institute), seemed promising according to team member Charles Markello.
He explained it allowed the students to use both long segments of genome, which help resolve repeating regions, and short overlapping segments, which help resolve errors and low coverage in portions of the genome. SOAPdenovo also turned out to be memory-efficient, making it possible to process the massive amount of data.
The SOAPdenovo team estimated the banana slug genome at 2.3 billion base pairs.
The assembler SGA seemed promising, because it was designed to use much less memory than other assemblers, and memory was in short supply for the class. The tradeoff, according to an SGA team member is that it uses much more CPU time—it uses 10 percent of the energy, but it takes 10 times longer.
On the last day of class, the SGA assembler was still churning through the full data set, and the team plans to keep it going into the summer.
The final assembly team employed DISCOVAR, developed at the Broad Institute of MIT and Harvard.
Student Robert Calef explained DISCOVAR was easy for them to install and use. “In 20 years when home genome projects are a thing, it will be more like this.”
Days after the class officially ended, the DISCOVAR team came out with an assembly 2.4 billion bases long.
Green summed up the class by saying, “We have collected an amazing amount of data that will live on beyond the class.”
As a treat for the last day of class, in addition to the banana bread one of the students provided, course professor Ed Green brought the class some data analysis of his own.
Green explained, “If half the reads in certain positions differ from the other half, it shows half are from mom and half from dad.” But when he mapped banana slug reads back to the assembly, it showed a high level of homozygosity—all the reads mapped to the same positions.
Known hermaphrodites, banana slugs have been observed mating with themselves, but no one knows how often that happens in the wild.
Green’s results seem to indicate banana slugs may self-mate more often than not, and this specimen may be the result of a highly inbred population.
On the other hand, the homozygosity he found may simply be an artifact of the genome assembly process.
“This may be a strong suggestion about the banana slug biological history,” Green said. “We have seen them mate with others, but I have never before seen a wild-caught diploid organism with so much homozygosity.”
The banana slug genome assembly is well on the way, but the work is not done. The banana slug assembly crew plans to meet weekly throughout the summer to keep the project going.
“SGA is still running as of last week,” Karplus said a few weeks after class ended. “As of now, the DISCOVAR assembly is the best of them.”
Karplus explained that to complete the banana slug genome will require both combining the most successful methods and collecting more data that covers longer DNA segments.
Christopher Kan explained that the next step once the results have come in from the assemblers that seemed to work is to choose the best assemblies and merge them to create a unified genome.
Once the assembly is in hand, Kan said, “It’s like a book without an index.”
Then the job will be to make it possible for researchers to find their way around the genome by marking the genes and other features.
Once all that is done, Kan hopes class members will publish their findings, “so we can get other researchers interested in the banana slug and its genome.”
A wiki page details the banana slug genome sequencing project efforts.