Genome Informatics Section 

DOE Human Genome Program Contractor-Grantee Workshop VII 
January 12-16, 1999  Oakland, CA


109. Multiple Sequence Alignment with Confidence Estimates 

David J. States 
Institute for Biomedical Computing, Washington University in St. Louis, St. Louis, Missouri 
states@ibc.wustl.edu 

Multiple sequence alignment (MSA) is the basis for many aspects of molecular sequence analysis including phylogenetics, motif detection and molecular modeling. Because the space of possible multiple sequence alignments is very large and the information accessible through sequence data is limited, there are often regions of a multiple sequence alignment that are not well determined. Here we develop a theory for assessing the confidence of multiple sequence alignment, describes software that implements this algorithm, and discusses the application of these methods. 

A hierarchical approach to MSA is used in which each constituent sequence is related to the full alignment as a leaf in a tree of nearest neighbor relationships. The algorithm uses a progressive strategy for building the multiple alignment. Hidden Markov Models (HMM) are used to describe each sequence or collection of sequences. At each phase in the alignment calculation, all current models are compared with each other using a dynamic programming calculation to calculate the maximum scoring local alignment. A new HMM is derived from the pair of models with the highest alignment score, and this new model replaces both of the previous models. The iteration is repeated until only a single HMM remains. A site specific confidence estimate, C, for pairwise alignemnts is calculated by comparing the likelihood for the optimal alignment passing through a pair of residues with the sum of the likelihoods for all alternative pairings of either the query or target residue. 

 
 
where is the optimal score for an alignment passing through any pair of residues i and j calculated using a forward and back dynamic programing algorithm [Vingron and Argos, Bishop and Thompson]. Note that the alternatives, , include the possibility that the site is deleted or inserted as well as being a matched pair of residues. C has the form of a probability and is bounded by  

0 < C < 1 . The overall confidence for a site in the multiple sequence alignment is calculated as the product of the confidence in the all of the pairwise alignments making up the full MSA. 
 

The algorithm provides an efficient way to build HMMs for large families of unaligned sequences. A web site provide access to this tool is available at http://www.ibc.wustl.edu/service/msa. 


 
Home Sequencing Functional Genomics
Author Index Sequencing Technologies Microbial Genome Program
Search Mapping Ethical, Legal, & Social Issues
Order a copy Informatics Infrastructure