Introduction to the Workshop
URLs Provided by Attendees
- Ethical, Legal, and Social Issues
The electronic form of this document may be cited in the following style:
Human Genome Program, U.S. Department of Energy, DOE Human Genome Program Contractor-Grantee Workshop IV, 1994.
Abstracts scanned from text submitted for November 1994 DOE Human Genome Program Contractor-Grantee Workshop. Inaccuracies have not been corrected.
Analysis and Annotation of Nucleic Acid Sequence
David J. States, Ron Cytron, Pankaj Agarwal and Hugh Chou
Institute for Biomedical Computing, Washington University in St. Louis
We are developing improved methods for analyzing nucleic acid sequences based on sequence similarity and very large scale classification techniques. Our previously developed methods for sequence classification provide a basis for defining families of protein sequence and modular domains preserved during evolution. Functional modules of gene and gene product structure and regulatory signals within the genome can be recognized as recurrent patterns in anonymous nucleic acid sequence using large scale classification techniques.
As a first step to extending our methods to a general nucleic acid annotation database, consensus sequences have been derived for each family in the protein sequence classification. This is basically a multiple sequence alignment problem. Some families are quite large (>1000 members) so a computationally efficient algorithm is needed to implement this. We have chosen ClustalW as a tool for this task. Our groups are defined using a minimal spanning tree representation that identifies the most similar members in each family. This tree can be imported directIy into ClustalW or a fast heuristic comparison of all family members can be recalculated internally (where tested, the two have been equivalent). A hierarchical multiple sequence alignment scored with the BLOSUM62 matrix is then performed and a consensus sequence based on the whole family or any subtree of the family is then dynamically derived. An important advantage of this approach is that it allows the user to define the granularity of classification interactively. In some cases a very broad classification (e.g. grouping all serine proteases together) may be desired while in other cases a much finer granularity (tracing species variation within the trypsin subfamily) may be needed.
Improved methods for nucleic acid sequence comparison are being developed and a repeat analysis software toolkit has been written. The Dayhoff PAM formalism has been extended to codon based sequence comparisons and scoring systems have been developed for sequences related as protein coding regions, non-coding, transcribed sequences, regions of untranscribed sequence, and regions of similar predicted three-dimensional structure. These methods are being tested on C. elegans and human sequence.
An annotated, intelligently non-redundant sequence database, is being built. This database will complement existing public databases using automated classification technology and manual review. An associated database of all pairwise sequence similarities is also being maintained.
An improved user interface for the classification analysis tool has been developed using WWW, perl and the html protocol. This approach has allowed us to readily link our classification data to the NCBI, EMBL and other network accessible databases. The use of perl allows computationally efficient retrieval systems to be rapidly prototyped and implemented.
Classification data and source code are being distributed on an anonymous FTP site (ibc.wustl.edu) in addition to the WWW interface http://ibc.wustl.edu.