DOE Human Genome Program Contractor-Grantee
84. Annotating DNA with Protein Coding Domains
Winston A. Hide1, Robert Miller1, Gary L. Sandine2, and David C. Torney2
1South African National Bioinformatics Institute, University of the Western Cape, South Africa and 2Los Alamos National Laboratory, Los Alamos, NM
DNA genomic sequence is now becoming readily available for the human and fly genomes. Reliably finding genes and annotating gene information remains, however, at a premium. Coding domains within gene sequences are detected both by gene prediction programs, locating exons based on predictive models, and by by similarity to known expressed sequences.
Predictive gene detection methods have yet to be sufficiently sensitive to be able to accurately predict all exons of a given gene. In addition, once predicted, only sequence comparison provides reliable corroboration. Once located, exonic DNA sequences need to be correctly translated into their corresponding proteins. The proteins may then be compared with known protein sequences corresponding to known structures as determined by direct or modelled homology.
Our annotation approach employs the novel paradigm of direct annotation of DNA based upon the secondary-structure properties of its translate (e.g. helix, sheet, and turn). To accomplish this, we have developed Bayesian classification methods for biological sequences. These methods use examples for 'training'. We have used several secondary-structure classes of polypeptides from the CATH database (Orengo et al 1997). ( The latter is a valuable resource because it has strictly hierarchically classed secondary structures and presents homologous superfamilies found in genome sequences).
We have successfully completed an analysis of such classes. The integrals of the Bayes'-rule formulas are approximated by finding the global maximum of the integrand the product of probabilities of the sequences in the sample. This has been a challenge for numerical analysis, but, a constrained quadratic programming program yielded near-optimal points. For example, for peptides of length four, taken from alpha-helix sequences, each point consists of 300 parameters characteristic of the sequences in the sample. These parameters yield posterior likelihoods for all peptides of length four to belong to the class.
The DNA sequences for these polypeptides may also be used for 'training'. Our direct approach, however, has been to submit genomic sequence to exon prediction engines such as 'Genome Annotator Pipeline', and, in addition, to generate a large number of processed expressed sequence tag fragments for reduced redundancy and consensus generation using clustering. Proteins predicted from both the exons and the clustered consensus sequences are submitted for analysis using our statistical methods.
The results are presented via a web system which reveals likely structural domains within exons and coding expressed sequence. We have implemented a web tool which accepts raw DNA sequence and generate predicted coding regions from expressed sequences or accept predicted exonic information and process for statistical states of structural class and display predicted states for each of the structurally encoded parameters.
Our next steps will be to
|The online presentation of this publication is a special feature of the Human Genome Project Information Web site.|