Introduction to the Workshop
URLs Provided by Attendees
- Ethical, Legal, and Social Issues
The electronic form of this document may be cited in the following style:
Human Genome Program, U.S. Department of Energy, DOE Human Genome Program Contractor-Grantee Workshop IV, 1994.
Abstracts scanned from text submitted for November 1994 DOE Human Genome Program Contractor-Grantee Workshop. Inaccuracies have not been corrected.
Prediction of Coding Regions in Genomic DNA: Optimal and Suboptimal Parses
Eric E. Snyder and Gary D. Stormo
Department of Molecular, Cellular and Developmental Biology
University of Colorado, Boulder, CO 80309
We have developed an approach for predicting coding regions in genomic DNA that utilizes multiple types of evidence, combines those into a single scoring function and then returns both optimal and ranked suboptimal solutions using that scoring function. The current version of the program predicts four classes of sequence: introns and three types of exons, first, last and internal. It uses a variety of statistical tests for these different classes, including those for the signals that define their ends and for baises in their contained sequences. A neural network is used to weight the different types of statistical tests to optimize performance, which we find to be as good or better than other published methods when tested on new examples. However, we find one of the most important features of this system is the ability to examine multiple solutions which is provided by the dynamic programming approach. These multiple, ranked solutions often provide indications of which portions of the predictions are most reliable and in cases where the highest scoring prediction is not correct it can often be found in a high ranking suboptimal solution. Furthermore alternative splicing patterns can often be found among the high ranking suboptimal solutions. We have performed tests of the robustness of the method when there are sequencing errors in the data, and shown that the system can be trained to optimize performance for data with specified error rates. We are now exploring methods for reliably predicting other classes of sequence regions, especially promoters. These include approaches based on minimal length encoding algorithms and on Sequence Landscape methods. Recent results from these approaches will be described.