David Kulp and David Haussler, Martin Reese** and Frank Eeckman**
Baskin Center for Computer Engineering and Information Sciences, University of California, Santa Cruz
We present a statistical model of genes in DNA. A Generalized Hidden Markov Model (GHMM) provides the framework for describing the grammar of a legal parse of a DNA sequence. Probabilities are assigned to transitions between states in the GHMM and to the generation of each nucleotide bases given a particular state. Machine learning techniques are applied to optimize these probabilities using a standardized training set. Given a new candidate sequence, the best parse is deduced from the model by using a dynamic programming algorithm to identify the path through the model with maximum probability.
For each state, the likelihood of a string of nucleotide bases is determined by consulting a sensor which returns a probability of any subsequence of bases in that particular state. A content sensor returns a likelihood of a variable length subsequence, e.g. protein-coding regions based on coding potential. A signal sensor returns a posterior probability of a fixed-length subsequence such as a splice site. The interpretation of such a posterior probability requires careful attention to local null models implicit in each signal sensor, but we show how signal sensors can be properly transformed and integrated with content sensors.
We present the description and results of an implementation of such a gene-finding model. The exon sensor is a codon frequency model conditioned on windowed nucleotide frequency and the preceding codon. The promoter signal sensor is a time-delayed neutral net. Two neural networks are used, as described by Brunak, et al for splice site prediction. We show that this simple model performs quite well. For a crossvalidated standard test set of 305 genes [ftp://www-hgc.lbl.gov/pub/genesets] in eukaryotic DNA, our gene-finding system identified 77% of protein coding bases correctly with a specificity of 71%. 48% of exons were exactly identified with a specificity of 49%.
The HMM is flexible and modular, so new sensors and additional states can be inserted easily. In addition, it provides simple solutions for integrating cardinality constraints, reading frame constraints, "indels", multiple gene recognition, and homology searching in GHMMs.
* Supported by a grant from the Office of Health and Environmental Research of the U.S. Department of Energy under contract DE-FG03-95ER62112
** Human Genome Center, Lawrence Berkeley National Laboratory, Berkeley, California
 D. Haussler. Generalized Hidden Markov Models for DNA Parsing. Workshop on Gene-Finding and Gene Structure Prediction, University of Pennsylvania, Philadelphia, October, 1995.
 M. Reese. Novel Neural Network Prediction Systems for Human Promoters and Splice Sites. Workshop on Gene-Finding and Gene Structure Prediction, University of Pennsylvania, Philadel- phia, October, 1995.
 S. Brunak, J. Engelbrecht, and S. Knudsen, J. Mol. Biol, 220, 49-65, 1991.
Return to Table of Contents