Functional Genomics Section 

DOE Human Genome Program Contractor-Grantee Workshop VII 
January 12-16, 1999  Oakland, CA


148. Prediction of Protein Structural Domains 

Robert Miller1, Winston A. Hide1, and David C. Torney2 
1South African National Bioinformatics Institute, University of the Western Cape and 2Joint Genome Institute, Los Alamos National Laboratory, Los Alamos, New Mexico 
dct@lanl.gov 

New challenges of sequence analysis have arisen with the advent of functional genomics. In particular, there is a premium on being able to make good use of small collections of example sequences, of known function, for classifying and predicting the functions of new sequences. Established techniques of classification have thus far not performed as well as needed, even with relatively abundant data, as in the case of exon prediction. We have therefore developed new example-based Bayesian statistical techniques for classification. These approaches can use conserved sequence motifs when these are present, but such overt similarities are not required because our techniques capture and employ all the statistical properties exhibited by a collection of example sequences. Thus, the likelihood for any sequence being a member of a given functional class is derived based on examples from the class. As many classes of structurally or functionally related biological sequences have only a relatively small number of examples, the prior specification of "what the statistical properties of a class might comprise" is critical. Our techniques include judicious choices for this prior, using insights about the statistical and physical properties of the sequences. One promising application of our techniques is the development of automatic clustering methods for use with a class of sequences. This will enable the discovery of heterogeneity within a class, improving the ability to predict class membership and deriving new classes. 

To establish and refine our techniques, as well as provide the basis for predicting structural and functional aspects of new protein sequences, we created datasets of sequence-dissimilar examples of known secondary structures, using DSSP applied to Brookhaven PDB files. We obtained 64,775 residues of alpha-helix, 47,304 residues of beta-sheet, and 45,549 residues of coil, exhibiting recognized structural features such as helix capping mechanisms. The application of our techniques classifies regions of novel protein sequences into these three categories. We will report the details of the implementation and performance, making comparisons with established approaches. Data may be submitted for analysis by our methods via the World Wide Web (http://www.sanbi.ac.za/karoo). Supported by the U.S. D.O.E. Office of Biological and Environmental Research under contract W-7405-ENG-36.  


 
Home Sequencing Functional Genomics
Author Index Sequencing Technologies Microbial Genome Program
Search Mapping Ethical, Legal, & Social Issues
Order a copy Informatics Infrastructure