Gary D. Stormo
Department of Molecular, Cellular and Developmental Biology University of Colorado, Boulder, CO 80309
We have developed an approach for predicting coding regions in genomic DNA that utilizes multiple types of evidence, combines those into a single scoring function and then returns both optimal and ranked suboptimal solutions using that scoring function.[1] By including separate scoring functions for loci with different G+C content, the method improves prediction overall, and especially for the usually difficult low G+C genes. The use of similarity matches, in the form of BLAST "hits" further increases the reliability of the predictions considerably. Alternative splicing pathways often show up in the suboptimal plots. The approach is shown to be robust to substitution errors in the sequence, but highly susceptible to frame-shift errors. The approach can easily be extended to other problems where a sequence is to be partitioned into domains belonging to a set of possible functional classes. It can also be modified such that the probability of the correct parsing is maximized over a training set of examples.[2] This modification allows for the use of a stochastic grammar to describe the class of possible parses, and for the weighting of various types of evidence to be adjusted to obtain the highest reliability.
We are now exploring methods for reliably predicting other classes of sequence regions, especially promoters. These include approaches based on minimal length encoding algorithms and on Markov chains for the various classes. Some new advances have been made and will be described.
* Supported by a grant from the U.S. Department of Energy under contract ER61606.
[1] E.E. Snyder and G.D. Stormo, J. Mol. Biol. 248, 1-18 (1995).
[2] G.D. Stormo and D. Haussler, Proceedings of the Second International Conference on Intelligent Systems in Molecular Biology, pp. 369-375 ( 1994).