Introduction to the Workshop
URLs Provided by Attendees
- Ethical, Legal, and Social Issues
The electronic form of this document may be cited in the following style:
Human Genome Program, U.S. Department of Energy, DOE Human Genome Program Contractor-Grantee Workshop IV, 1994.
Abstracts scanned from text submitted for November 1994 DOE Human Genome Program Contractor-Grantee Workshop. Inaccuracies have not been corrected.
Gene Recognition, Modeling, and Homology Search in the GRAlL-genQuest System
Manesh Shah , J. Ralph Einstein , Sherri Matis , Ying Xu , Xiaojun Guan , Donna Buley , Sergey Petrov , Loren Hauser , Richard J. Mural , and Edward C. Uberbacher
 Engineering Physics and Mathematics, and  Biology Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831-6364. e-mail:GRAILMAIL@ornl.gov
GRAIL-genQuest is a modular expert system being constructed to analyze and characterize genomic and cDNA sequences. Recognition of gene features, gene modeling, and DNA, protein, and motif databases searches are supported by an e-mail server system and a graphical client-server version. Within the last year significant improvements have been made in the sensitivity and accuracy of coding region recognition and gene modeling.
GRAIL E-mail Server and Feature Recognition: GRAIL provides an on-line e-mail service for locating the protein coding regions of DNA sequences. This interface utilizes a multiple sensor-neural network (Uberbacher and Mural, 1991, PNAS 88:11261-11265) to find coding regions and a rule based interpreter to reduce this output to a table. A new version of the coding recognition portion of this systems is capable of finding 94% of all exons, and 80% of exons less than 100 bases in size. Improvements have also been made in the accuracy with which a given exon is described - the edges are more accurate with 54% of exons predicted with both edges correct to the base, and an additional 40% with one edge exactly correct.
E-mail GRAIL can analyze up to 100 kbp of sequence at a time and several sequences may be included in a single e-mail message. The amino acid sequence of predicted coding regions can be searched against the SwissProt database using the Intel iPSC/860 parallel computer, and be searched for functional motifs using the PROSITE database. These search capabilities are part of the new genQuest sequence comparison server system (described below) which can be accessed directly or through GRAIL.
Gene Modeling and Client-Server GRAIL: In addition to the current coding region recognition capabilities based on a multiple sensor-neural network and rule base, modules for the recognition of features such as splice junctions, transcription and translation start and stop, and other control regions have been constructed and incorporated into an expert system for reliable computer-based modeling of genes. The gene modeling version of GRAIL is available through an X-window-based client-server system. The client-server system allows the user to try a number of different scenarios for a given sequence region and ask "what if" type questions interactively.
A gene assembly program (GAP) which combines the outputs from the various feature recognition modules and attempts to predict the sequence of the spliced mRNA from the genomic DNA sequence is part of this system. Heuristic methods and dynamic programming are used to construct first pass gene models which include the potential for insertions and deletions of initially predicted exons. These actions result in a net improvement in gene characterization, particularly in the recognition of very short coding regions. Genes modeled by this system have an average correlation coefficient compared to the actual gene of 0.93. The models contain 94% of all exons regardless of size and only 3% false positive information. After model construction 79% of exons have both edges correctly defined to the base and an additional 19% have one edge correct. In addition, other features of interest such as poly-A addition sites and repetitive DNA elements can be located using this program. Translation of gene models and database searches are also supported through access to the genQuest server (described below). Generation of an annotation report in "feature table" format is supported. Client software can be downloaded by anonymous ftp from arthur.epm.ornl.gov, and help information can be obtained by sending a message with the word help in the first line to GRAIL@ornl.gov e-mail address.
The GenQuest Sequence Comparison Server: The genQuest server is an integrated sequence comparison server which can be accessed via e-mail and through a X-windows graphical client-server system. The basic purpose of the server system is to facilitate rapid and sensitive comparison of DNA and protein sequences to existing DNA, protein, and motif databases. Databases accessed by this system include the LANL daily updated DNA sequence database (GSDB), SwissProt, the dbEST expressed sequence tag database, protein motif libraries and motif analysis systems (Prosite, BLOCKS), a repetitive DNA library (from J. Jurka), and sequences in the PDB protein structural database. These options are designed to provide a comprehensive description of newly obtained sequences through homology methods, and can also be accessed from the XGRAIL graphical client tool. The system uses a specialized parallel computing environment at the Oak Ridge National Laboratory and is supported and curated by research teams in the genome community.
The genQuest e-mail server supports a variety of sequence query types. For searching protein databases, queries may be sent as amino acid or DNA sequence. DNA sequence can be translated in a user specified frame or in all 6 frames. DNA-DNA searches are also supported. User selectable methods for comparison include the Smith-Waterman dynamic programming algorithm, FastA, versions of BLAST, and the IBM dFLASH protein sequence comparison algorithm. A variety of options for search can be specified including gap penalties and option switches for Smith-Waterman, FastA, and BLAST, the number of alignments and scores to be reported, desired target databases for query, choice of PAM and Blosum matrices, and an option for masking out repetitive elements. Multiple target databases can be accessed within a single query.
E-mail turn-around times for the system are quite rapid, less than 1 minute for protein searches, about 1 to 2 minutes for protein-DNA, and several minutes for reasonable length DNA-DNA searches. GenQuest can be accessed by e-mail at the Q@ornl.gov e-mail address, and instructions can be obtained by sending a message to that address with the word help in the first line. Further help for GRAIL or genQuest can be obtained by sending e-mail to the GRAILMAIL@ornl.gov address, and the graphical client tool can be downloaded by anonymous ftp from arthur.epm.ornl.gov.
(This research was supported by the Office of Health and Environmental Research, United States Department of Energy, under contract DE-AC05-84OR21400 with Martin Marietta Energy Systems, Inc.)