William J. Bruno
Theoretical Biology and Biophysics (T-10), MS K-710 Los Alamos National Laboratory, Los Alamos, NM 87545; billb@LANL.GOV.
A great deal of information is in principle contained in the evolutionary history of a gene or DNA regulatory element. Such a history may be viewed as a vast series of point mutation experiments, with successful variants being retained and unsuccessful ones being removed from the ensemble.
A simple quantity to investigate is the frequency of each nucleotide in each position of an alignment, corrected for sample bias caused by the evolutionary interrelationships of the sequences. These corrected frequencies should correspond to the frequencies one would observe in a large set of very distantly related sequences sharing a common function, and they may be interpreted as the "fitness" of a nucleotide in a given position. Previous methods for estimating such a fitness have relied on heuristic "sequence weighting" methods to correct for evolutionary relationships.
A more general, likelihood-based approach to estimating corrected nucleotide frequencies is presented. The method employs a modified EM algorithm to generate estimates of the corrected frequencies, as well as an estimate of the phylogenetic tree relating the sequences. Tree topologies are supplied to the program, either by the user or automatically by calls to existing distance-based phylogeny programs, and branch lengths are optimized taking the nucleotide frequencies at each position into account. Although the underlying model of mutation and selection is highly simplified, it captures the discrete nature of the process.
The resulting model for the constraints on the sequences can be used to improve their alignment. Furthermore, the amount of covariation between different sites in a sequence -- corrected for evolutionary relationships can be estimated.
This work supported in part by DOE contract W-7405-ENG-36.
Return to Table of Contents