|Functional Genomics Section
DOE Human Genome Program Contractor-Grantee Workshop
148. Prediction of Protein Structural Domains
Robert Miller1, Winston A. Hide1,
and David C. Torney2
New challenges of sequence analysis have arisen with the advent of functional genomics. In particular, there is a premium on being able to make good use of small collections of example sequences, of known function, for classifying and predicting the functions of new sequences. Established techniques of classification have thus far not performed as well as needed, even with relatively abundant data, as in the case of exon prediction. We have therefore developed new example-based Bayesian statistical techniques for classification. These approaches can use conserved sequence motifs when these are present, but such overt similarities are not required because our techniques capture and employ all the statistical properties exhibited by a collection of example sequences. Thus, the likelihood for any sequence being a member of a given functional class is derived based on examples from the class. As many classes of structurally or functionally related biological sequences have only a relatively small number of examples, the prior specification of "what the statistical properties of a class might comprise" is critical. Our techniques include judicious choices for this prior, using insights about the statistical and physical properties of the sequences. One promising application of our techniques is the development of automatic clustering methods for use with a class of sequences. This will enable the discovery of heterogeneity within a class, improving the ability to predict class membership and deriving new classes.
To establish and refine our techniques, as well as provide the basis for predicting structural and functional aspects of new protein sequences, we created datasets of sequence-dissimilar examples of known secondary structures, using DSSP applied to Brookhaven PDB files. We obtained 64,775 residues of alpha-helix, 47,304 residues of beta-sheet, and 45,549 residues of coil, exhibiting recognized structural features such as helix capping mechanisms. The application of our techniques classifies regions of novel protein sequences into these three categories. We will report the details of the implementation and performance, making comparisons with established approaches. Data may be submitted for analysis by our methods via the World Wide Web (http://www.sanbi.ac.za/karoo). Supported by the U.S. D.O.E. Office of Biological and Environmental Research under contract W-7405-ENG-36.
|Author Index||Sequencing Technologies||Microbial Genome Program|
|Search||Mapping||Ethical, Legal, & Social Issues|