|Genome Informatics Section
DOE Human Genome Program Contractor-Grantee Workshop
119. Ribosomal RNA Alignment Using Stochastic Context Free Grammars
Michael P.S. Brown
I present a method for aligning ribosomal RNA using a well principled probabilistic method that models pairwise interactions in a computationally efficient manner, Stochastic Context-Free Grammars (SCFG's). I show this method has superior performance characteristics in relation to several other alignment methods. This method has applications in areas such as phylogenetic tree reconstruction. A webserver is located at http://www.cse.ucsc.edu/research/compbio/ ssurrna.html.
SCFG's have been used previously for modeling structures such as tRNA (Sakakibara94, Eddy+Durbin94) and have been demonstrated to have the highest specificity of any method (Lowe97). This performance comes from SCFG's pairwise modeling ability as well as it's probabilistic foundations that allow specific estimations of parameters such as gap and mutation costs. Unfortunately SCFG's require a relatively high computational cost, O(n^3), where n is the length of the sequence. Previous work to reduce this cost has been done by preprocessing databases with a fast approximate method and presenting only likely strings to the SCFG for further processing (Lowe97). I extend this idea in a new direction using Hidden Markov Models (HMM's).
HMM's are used not only to preprocess the database but to also constrain the SCFG computation in a principled way using posterior decodings. These constraints allow the analysis of large molecules such as rRNA to be done using the full power of complex SCFG models in a reasonable amount of time. I analyze several methods for RNA structure prediction and show that SCFG's have the highest specificity and generalization capabilities using the Ribosomal Database Project alignment of small subunit rRNA as a gauge (Maidak97).
Alignment of ribosomal RNA is important for several reasons. Historically, rRNA was used by Carl Woese to relate all organisms and reconstruct the tree of life (Woese77). Recently, Norman Pace pointed to an opportunity for an environmental genome survey in which rRNA is gathered from the environment to provide a sequence based snapshot of the microbial biodiversity (Pace97).
In order to relate organisms based on their biosequence identity, a multiple sequence alignment is necessary. Indeed, alignment is a very important process in correct phylogenetic tree reconstruction (Morrison97). Current methods of computing this alignment involve a combination of computer alignment with human fine tuning (O'Brien98). This leads to a computational bottleneck as evidenced by the large number of unaligned rRNA sequences in the Ribosomal Database Project. Full analysis of widescale environmental biodiversity projects will exacerbate this problem.
Stochastic Context-Free Grammars are an automatic method of determining RNA alignment using a well principled probabilistic model that accounts for pairwise interactions in a computationally efficient manner. SCFG's have superior performance properties in relation to other methods and have several important application areas including phylogenetic tree reconstruction.
(Sakakibara94) Y.Sakakibara et. al. Nucleic
Acids Research. (22)5112-5120. (1994).
|Author Index||Sequencing Technologies||Microbial Genome Program|
|Search||Mapping||Ethical, Legal, & Social Issues|