Bioinformatics Section 

DOE Human Genome Program Contractor-Grantee Workshop VIII
February 27-March 2, 2000  Santa Fe, NM


Home
 
PDF

Author Index
Sequencing
Table of Contents
Abstracts   
Instrumentation
Table of Contents
Abstracts
Mapping 
Table of Contents
Abstracts
Bioinformatics
Table of Contents
Abstracts
Function and cDNA Resources
Table of Contents
Abstracts

Microbial Genome Program
Table of Contents
Abstracts
Ethical, Legal, and Social Issues
Table of Contents
Abstracts
Infrastructure
Table of Contents
Abstracts

Ordering Information

Abstracts from
Past Meetings

84. Annotating DNA with Protein Coding Domains

Winston A. Hide1, Robert Miller1, Gary L. Sandine2, and David C. Torney2

1South African National Bioinformatics Institute, University of the Western Cape, South Africa and 2Los Alamos National Laboratory, Los Alamos, NM

dct@ipmati1.lanl.gov

DNA genomic sequence is now becoming readily available for the human and fly genomes. Reliably finding genes and annotating gene information remains, however, at a premium. Coding domains within gene sequences are detected both by gene prediction programs, locating exons based on predictive models, and by by similarity to known expressed sequences.

Predictive gene detection methods have yet to be sufficiently sensitive to be able to accurately predict all exons of a given gene. In addition, once predicted, only sequence comparison provides reliable corroboration. Once located, exonic DNA sequences need to be correctly translated into their corresponding proteins. The proteins may then be compared with known protein sequences corresponding to known structures as determined by direct or modelled homology.

Our annotation approach employs the novel paradigm of direct annotation of DNA based upon the secondary-structure properties of its translate (e.g. helix, sheet, and turn). To accomplish this, we have developed Bayesian classification methods for biological sequences. These methods use examples for 'training'. We have used several secondary-structure classes of polypeptides from the CATH database (Orengo et al 1997). ( The latter is a valuable resource because it has strictly hierarchically classed secondary structures and presents homologous superfamilies found in genome sequences).

We have successfully completed an analysis of such classes. The integrals of the Bayes'-rule formulas are approximated by finding the global maximum of the integrand the product of probabilities of the sequences in the sample. This has been a challenge for numerical analysis, but, a constrained quadratic programming program yielded near-optimal points. For example, for peptides of length four, taken from alpha-helix sequences, each point consists of 300 parameters characteristic of the sequences in the sample. These parameters yield posterior likelihoods for all peptides of length four to belong to the class.

The DNA sequences for these polypeptides may also be used for 'training'. Our direct approach, however, has been to submit genomic sequence to exon prediction engines such as 'Genome Annotator Pipeline', and, in addition, to generate a large number of processed expressed sequence tag fragments for reduced redundancy and consensus generation using clustering. Proteins predicted from both the exons and the clustered consensus sequences are submitted for analysis using our statistical methods.

The results are presented via a web system which reveals likely structural domains within exons and coding expressed sequence. We have implemented a web tool which accepts raw DNA sequence and generate predicted coding regions from expressed sequences or accept predicted exonic information and process for statistical states of structural class and display predicted states for each of the structurally encoded parameters.

Our next steps will be to

  • Determine sensitivity and selectivity of the statistic with respect to current secondary structural prediction (tools that rely models/empirical derivatives)
  • Analyze 100,000 EST consensus sequences, producing peptides and predicting their domains.
  • Determine the efficacy of employing an implementation of our methods for synergistic support of gene finding tools
  • Map gene prediction outputs onto structurally predicted states to determine jointly predicted exons.
  • Annotate jointly predicted exons onto known gene and protein structures
  • Determine the efficacy of combining our methods with other methods for finding genes
  • Implement these methods in other important contexts, such as functional promoter class characterization and annotation.

 

 

 


The online presentation of this publication is a special feature of the Human Genome Project Information Web site.