UCSC TECH REPORT UCSC-CRL-02-30 Published 07/13/2002 09:00 AM
Krishna M. Roskin, Mark Diekhans, W. James Kent, and David Haussler
Funding for this project was provided by NHGRI Grant 1P41HG02371. We thank Simon Whelan, Nick Goldman, Laura Elnitski, Ross Hardison, Webb Miller, Scott Schwartz, Francesca Chiaromonte, Aran Smit, Eric Lander, Bob Waterston and Francis Collins for their input and data.
We construct several score functions for use in locating unusually conserved regions in genome-wide search of aligned DNA from two species. We test these functions on regions of the human genome aligned to mouse. These score functions are derived from properties of neutrally evolving sites on the mouse and human genome, and can be adjusted to the local background rate of conservation. The aim of these functions is to identify regions of the human genome that are conserved by evolutionary selection, because they have an important function, rather than by chance. We use them to get a very rough estimate of the amount of DNA in the human genome that is under selection.
4 Context-dependent I-score
5 Including insertions and deletions in the score
6 Further Extensions
7 Tests of the Selected Score Functions
8 Estimating the Fraction of the Human Genome Under Selection
As part of the Mouse Genome Project, groups at several universities have been studying alignments between the draft genomes of human and mouse. A full report will be submitted by the Mouse Genome Sequencing Consortium at a later time. Here we report some preliminary results we obtained using early versions of this data.1 We designed several score functions, described below, that could be applied to short aligned regions (tens to thousands of bases) to measure how diverged they were between the two species. Emphasis was on counting the number of observed base substitutions in various ways, although gaps are also considered in some versions of the score functions. We have been especially interested in looking at the distributions of these score functions on regions of aligned DNA that we have reason to believe are not under selection, but rather are evolving neutrally. We looked at two types of “neutral” sites:
(1) 4d-sites: 3rd bases in the 8 four-fold degenerate codons (sites marked “x” in the codons GCx (ALA), CCx (PRO), TCx (SER), ACx (THR), CGx (ARG), GGx (GLY), CTx (LEU), GTx (VAL) that can be any base without changing the amino acid)
(2) AR-sites: “ancient repeat” sites from retrotransposons or DNA transposons that were inserted in the genome before the human-mouse split and appear in syntenic positions in both species.
The properties of these sites will be described more fully in subsequent papers. In particular, we noticed that substitutions at a given site are dependent on the flanking bases, so some of our score functions take this into consideration. We hope to give a more complete treatment of this subject in a future paper as well. Here we use information from our study of neutrally evolving sites to construct some simple score functions for human-mouse aligned regions.
The score functions are:
- normalized divergence (Section 2)
- I-score (Section 3)
- context-dependent I-score (Sections 4 and 6)
- context-dependent I-score with gap penalties (Sections 5)
We first define these functions for gap-less aligned regions only, then we discuss ways of extending them to include gap costs. In the initial results below, to apply the score function to a gapped alignment, we just remove the gaps and indels first (see example in Section 5 below).
In the final section, we use one of our score functions (the context-dependent I-score) to get a crude estimate of the fraction of the human genome that is under selection. To do this, we scored all non-overlapping 100bp windows with at least 30 aligned bases in the human genome draft, and plotted the empirical distribution of the scores we obtained (see Figure 13). We noticed an extra mass in the region where the scores for more highly conserved windows lie. This extra mass is absent when we plot the distribution of the scores from only the windows from ancient repeats, which are our model for typical scores from neutrally evolving DNA (see bell-shaped curve in Figure 13 representing the score distribution for windows of neutrally evolving DNA). We suspect this extra mass represents windows containing DNA that is under selection. Indeed, windows containing coding exons and known regulatory elements (kindly provided by Laura Elinski at Penn State University) do tend to have scores in the range where we see this extra mass in the genome-wide score Divergence 3 distribution (Figure 12). We obtain a crude estimate of the size of this extra mass by simply scaling the curve in Figure 13 for the density of the neutral distribution to fit within the overall density for the genome-wide scores, using the value at the origin. The neutral density is symmetric about this value, and the fit to the genome-wide density for all windows is quite good on the side representing scores from highly diverged regions, nearly all of which are likely to be neutral. On the side of more highly conserved regions, this scaling of the neutral density leaves out the extra mass that is likely to represent windows that are under selection. By subtracting the two densities, we find that this “selected” mass represents about 5% of the human genome.
It is clear that considerable further work needs to be done to validate and improve this estimate, including more sophisticated analysis of the densities, better and more complete assemblies of the genomes of both species, more exploration of the sensitivity of the method to the choice of windows and score functions, and a better understanding of the properties of neutrally evolving DNA, so that it may be more precisely distinguished from DNA under selection. Extending these methods to alignments of multiple species would sharpen the results considerably as well. We suspect that this will be required to reliably distinguish selected regions from neutral regions on a window-by-window bases, rather than in a “bulk statistical estimate” as we attempt here.
1The Mouse data was taken from the October Phusion assembly and aligned to the UCSC August Golden Path assembly using BLAT . This data can found at http://genome.ucsc.edu/cgi-bin/hgGateway?db=mm1.
For more, visit the UCSC BSOE Technical Reports site, and download UCSC-CRL-02-30, Score Functions for Assessing Conservation in Locally Aligned Regions of DNA from Two Species