INTELLIGENT COMPUTER TRACKS DOWN GENES IN DNA SEQUENCES             
   
   The sequence of DNA bases that makes up the human genome is a coded
   message containing a huge amount of information. However, after
   sequencing a region containing a single gene out of perhaps 100,000
   genes in the genome, the result is a string of thousands of A's, G's,
   T's, and C's, from which biological meaning must somehow be derived.
   Separating the biologically relevant information from the rest of the
   sequence is further complicated by the fact that only 3 to 5% of the
   bases in the region of a gene actually contain instructions for the
   manufacture of proteins--the molecules that govern the chemical
   processes necessary for life. Determining which portions of the sequence
   are biologically relevant is a significant challenge in this area of
   research.    
   
   Although the technology to locate these relevant sequences, or "exons,"
   is critical to the success of the Human Genome Project, it has never
   been clear how this would be accomplished or even if it would ever be
   possible. However, ORNL researchers Richard Mural of the Biology
   Division and Ed Uberbacher, Ralph Einstein. Reinhold Mann, and Xiaojun
   Guan, all of the Engineering Physics and Mathematics Division, have
   taken a major step toward putting the pieces of this puzzle together.      
               
   They have developed an intelligent computational system called GRAIL
   (Gene Recognition and Analysis Internet Link), which integrates
   biological insight into a state-of-the-art, computer-based artificial
   intelligence system. Mural characterizes the effort as "a hybrid effort
   combining the computational and biological sciences--a realm of science
   that ORNL has not dealt with before."               
   
   "At a basic level, there's a new kind of biology going on here, based on
   genetic sequence data in which all the information used by living cells
   is stored." Uberbacher says. "Learning how to computationally extract
   the important biological information stored in these sequences is a
   tremendous challenge with great rewards." So far, ORNL researchers have
   correctly predicted the locations of at least 80% of the genes in the 5
   million bases they have analyzed. The team's work was featured in the
   November 8, 1991, issue of Science magazine.               
   
   The GRAIL system first analyzes a genetic sequence using a group of
   mathematical algorithms that measure properties of DNA sequences that
   code for proteins and then combines the results of these algorithms
   using a neural network. "A neural network is a computer simulation of
   how neurons in the brain function and learn to recognize objects,"
   Uberbacher says. "Machine learning is accomplished in the neural network
   by showing the system many examples of exons. Eventually the network
   learns the characteristics that distinguish exons from the rest of the
   DNA sequence." After its "training" process, the neural network learns
   how to identify exons in uncharacterized regions of sequence.              
         
   
   "What distinguishes our approach is that we are combining many features
   of sequences that are known to encode proteins," says Mural. "We become
   partners in the neural network." says Uberbacher. "We provide what we
   know about the biology of exon sequences, and the neural network learns
   how best to use the information--each partner contributes knowledge to
   the process."       
   
   Although traditional statistical and mathematical methods exist for
   analyzing sequences, these often result in large numbers of "false
   positives"--areas identified as exons that really are not. Because each
   area identified as "positive" must be experimentally verified, there is
   a premium on selecting an effective, yet conservative, method of
   analysis--one that locates a high percentage of the total number of
   exons, with a low false positive rate. The GRAIL system provides such a
   method.      
   
   Recently, GRAIL was used to analyze a sequence of bases thought to
   contain the Huntington's disease gene. The system located a number of
   potential exons, clustered in such a way as to suggest several genes.
   Subsequent experimental work based on GRAlL's predictions has verified
   exons at 80% of the predicted locations. Once the location of an exon is
   found. a combination of experimental and informational techniques can be
   used to determine the portion of the sequence that makes up the gene.      
               
   
   To encourage researchers outside ORNL to take advantage of GRAIL, users
   can send electronic mail files of DNA sequences directly to the system
   and it will automatically e-mail its analysis back. Turnaround time is
   usually only a few minutes for sequences containing fewer than 100,000
   base pairs, and single or multiple sequences may be sent.                  
  
   
   Over 200 laboratories worldwide currently take advantage of the GRAIL
   system in their search for the causes of genetic diseases such as
   Huntington's disease, various muscular dystrophies, and fragile X
   syndrome. Mural attributes GRAIL's increasing popularity to its ease of
   use. "It's very simple for the users. They can do some very simple
   procedures and get some highly interpretable results."                  
   
   According to Mural and Uberbacher, the future holds an upgraded version
   of GRAIL that will incorporate feedback from the system's users. They
   are also working on a companion system to GRAIL that will allow users to
   assemble model genes from the exons found throughout the sequence, using
   a machine-learning approach similar to that employed by GRAIL.
   

   ------------------------------------------------------------------------
   
   Please send inquiries or comments about this gopher to the mail address:
        gopher@gopher.ornl.gov
   
   Date Posted:  1/10/94  (ktb)