Computing the Genome
By Ed Uberbacher
Ed Uberbacher shows a model of transcribing genes. Photograph by Tom Cerniglio.
ORNL is part of a team that is designing and preparing to implement a new computational engine to rapidly analyze large-scale genomic sequences, keeping up with the flood of data from the Human Genome Project.
Just a few short years ago, most of us knew very little about our genes and their impact on our lives. But more recently, it has been virtually impossible to escape the popular medias attention to a number of breathtaking discoveries of human genes, especially those related to diseases such as cystic fibrosis, Huntingtons chorea, and breast cancer. What has brought this about is the Human Genome Project, an international effort started in 1988 and sponsored in the United States by the Department of Energy (DOE) and the National Institutes of Health (NIH). The goal of this project is to elucidate the information that makes up the genetic blueprint of human beings.
The Human Genome Projects success in sequencing the chemical bases of DNA is virtually revolutionizing biology and biotechnology. It is creating new knowledge about fundamental biological processes. It has increased our ability to analyze, manipulate, and modify genes and to engineer organisms, providing numerous opportunities for applications. Biotechnology in the United States, virtually nonexistent a few years ago, is expected to become a $50 billion industry before 2000, largely because of the Human Genome Project.
Despite this projects impact, the pace of gene discovery has actually been rather slow. The initial phase of the project, called mapping, has been primarily devoted to fragmenting chromosomes into manageable ordered pieces for later high-throughput sequencing. In this period, it has often taken years to locate and characterize individual genes related to human disease. Thus, biologists began to appreciate the value of computing to mapping and sequencing.
A good illustration of the emerging impact of computing on genomics is the search for the gene for Adrenoleukodystrophy (related to the disease in the movie Lorenzos Oil). A team of researchers in Europe spent about two years searching for the gene using standard experimental methods. Then they managed to sequence the region of the chromosome containing the gene. Finally, they sent information on the sequence to the ORNL server containing the ORNL-developed computer program called Gene Recognition and Analysis Internet Link (GRAIL). Within a couple of minutes, GRAIL returned the location of the gene within the sequence.
The Human Genome Project has entered a new phase. Six NIH genome centers were funded recently to begin high-throughput sequencing, and plans are under way for large-scale sequencing efforts at DOE genome centers at Lawrence Berkeley National Laboratory (LBNL), Lawrence Livermore National Laboratory (LLNL), and Los Alamos National Laboratory (LANL); these centers have been integrated to form the Joint Genome Institute. As a result, researchers are now focusing on the challenge of processing and understanding much larger domains of the DNA sequence. It has been estimated that, on average from 1997 to 2003, new sequence of approximately 2 million DNA bases will be produced every day. Each days sequence will represent approximately 70 new genes and their respective proteins. This information will be made available immediately on the Internet and in central genome databases.
Such information is of immeasurable value to medical researchers, biotechnology firms, the pharmaceutical industry, and researchers in a host of fields ranging from microorganism metabolism to structural biology. Because only a small fraction of genes that cause human genetic disease have been identified, each new gene revealed by genome sequence analysis has the potential to significantly affect human health. Within the human genome is an estimated total of 6000 genes that have a direct impact on the diagnosis and treatment of human genetic diseases. The timely development of diagnostic techniques and treatments for these diseases is worth billions of dollars for the U.S. economy, and computational analysis is a key component that can contribute significantly to the knowledge necessary to effect such developments.
In addition to health-related biotechnology, other application areas of great importance to DOE include bioremediation, waste control, energy supplies, and health risk assessment. Correspondingly, in addition to human DNA sequencing, sequencing of micro-organisms and other model organisms that are important to biotechnology is also ramping up at a very rapid rate. For example, the recent sequencing of Methanococcus jannaschii, a methane-producing microorganism from deep-sea volcanic vents that flourishes without sunlight, oxygen, or surrounding organic material, suggests that life has three branches, not two. Such microorganisms, called archaea, are genetically different from bacteria and from eukaryotes (which includes plants, animals, and people).
The entire genomes for Haemophilus influenzae and yeast have also been fully sequenced, although the significance of many genes remains a mystery. The potential for the discovery of new enzymes and chemical processes important for biotechnology (e.g., new types of degradative enzymes), as well as new insights into disease-causing microbes, makes these efforts highly valuable economically and socially.
The rate of several megabase pairs per day at which the Human Genome and microorganism sequencing projects will soon be producing data will exceed current sequence analysis capabilities and infrastructure. Sequences are already arriving at a rate and in forms that make analysis very difficult. For example, a recent posting of a large clone (large DNA sequence fragment) by a major genome center was made in several hundred thousand base fragments, rather than as one long sequence, because the sequence database was unable to input the whole sequence as a single long entry. Anyone who wishes to analyze this sequence to determine which genes are present must manually reassemble the sequence from these many small fragments, an absolutely ridiculous task. The sequences of large genomic clones are being routinely posted on the Internet with virtually no comment, analysis, or interpretation; and mechanisms for their entry into public-domain databases are in many cases inadequately defined. Valuable sequences are going unanalyzed because methods and procedures for handling the data are lacking and because current methods for doing analyses are time-consuming and inconvenient. And in real terms, the flood of data is just beginning.
Computers can be used very effectively to indicate the location of genes and of regions that control the expression of genes and to discover relationships between each new sequence and other known sequences from many different organisms. This process is referred to as sequence annotation. Annotation (the elucidation and description of biologically relevant features in the sequence) is the essential prerequisite before the genome sequence data can become useful, and the quality with which annotation is done will directly affect the value of the sequence. In addition to considerable organizational issues, significant computational challenges must be addressed if DNA sequences that are produced can be successfully annotated. It is clear that new computational methods and a workable process must be implemented for effective and timely analysis and management of these data.
In considering computing related to the large-scale sequence analysis and annotation process, it is useful to examine previously developed models. Procedures for high-throughput analysis have been most notably applied to several microorganisms (e.g., Haemophilus influenzae and Mycoplasma genitalium) using relatively simple methods designed to facilitate basically a single pass through the data (a pipeline that produces a one-time result or report). However, this is too simple a model for analyzing genomes as complex as the human genome. For one thing, the analysis of genomic sequence regions needs to be updated continually through the course of the Genome Projectthe analysis is never really done. On any given day, new information relevant to a sequenced gene may show up in any one of many databases, and new links to this information need to be discovered and presented. Additionally, our capabilities for analyzing the sequence will change with time. The analysis of DNA sequences by computer is a relatively immature science, and we in the informatics community will be able to recognize many features (like gene regulatory regions) better in a year than we can now. There will be a significant advantage in reanalyzing sequences and updating our knowledge of them continually as new sequences appear from many organisms, methods improve, and databases with relevant information grow. In this model, sequence annotation is a living thing that will develop richness and improve in quality over the years. The single pass-through pipeline is simply not the appropriate model for human genome analysis, because the rate at which new and relevant information appears is staggering.
Computational Engine for Genomic Sequences
Researchers at ORNL, LBNL, Argonne National Laboratory (ANL), and several other genome laboratories are teaming to design and implement a new kind of computational engine for analysis of large-scale genomic sequences. This sequence analysis engine, which has become a Computational Grand Challenge problem, will integrate a suite of tools on high-performance computing resources and manage the analysis results. In addition to the need for state-of-the-art computers at several supercomputing centers, this analysis system will require dynamic and seamless management of contiguous distributed high-performance computing processes, efficient parallel implementations of a number of new algorithms, complex distributed data mining operations, and the application of new inferencing and visualization methods. A process of analysis that will be started in this engine will not be completed for seven to ten years.
Fig. 1. The sequence analysis engine will input a genomic DNA sequence from many sites using Internet retrieval agents, maintain it in a data warehouse, and facilitate a long-term analysis process using high-performance computing facilities.
Fig. 2. In the sequence analysis engine, a central task manager coordinates analysis tasks such as pattern recognition and gene modeling and also initiates sequence comparison and data mining using multiple external databases.
The data flow in this analysis engine is shown in Fig. 1. Updates of sequence data will be retrieved through the use of Internet retrieval agents and stored in a local data warehouse. Most human genome centers will daily post new sequences on publicly available Internet or World Wide Web sites, and they will establish agreed-upon policies for Internet capture of their data. These data will feed the analysis engine that will return results to the warehouse for use in later or long-term analysis processes, visualization by researchers, and distribution to community databases and genome sequencing centers. Unlike the pipeline analysis model, the warehouse maintains the sequence data, analysis results, and data links so that continual update processes can be made to operate on the data over many years.
The analysis engine will combine a number of processes into a coherent system running on distributed high-performance computing hardware at ORNLs Center for Computational Sciences (CCS), LBNLs National Energy Research Scientific Computing Center, and ANLs Center for Computational Science and Technology facilities. A schematic of these processes is shown in Fig. 2. A process manager will conditionally determine the necessary analysis steps and direct the flow of tasks to massively parallel process resources at these several locations. These processes will include multiple statistical and artificial-intelligence-based pattern-recognition algorithms (for locating genes and other features in the sequence), computation for statistical characterization of sequence domains, gene modeling algorithms to describe the extent and structure of genes, and sequence comparison programs to search databases for other sequences that may provide insight into a genes function. The process manager will also initiate multiple distributed information retrieval and data mining processes to access remote databases for information relevant to the genes (or corresponding proteins) discovered in a particular DNA sequence region. Five significant technical challenges must be addressed to implement such a system. A discussion of those challenges follows.
Seamless high-performance computing. Megabases of DNA sequence being analyzed each day will strain the capacity of existing supercomputing centers. Interoperability between high-performance computing centers will be needed to provide the aggregate computing power, managed through the use of sophisticated resource management tools. The system must be fault-tolerant to machine and network failures so that no data or results are lost.
Parallel algorithms for sequence analysis. The recognition of important features in a sequence, such as genes, must be highly automated to eliminate the need for time-consuming manual gene model building. Five distinct types of algorithms (pattern recognition, statistical measurement, sequence comparison, gene modeling, and data mining) must be combined into a coordinated toolkit to synthesize the complete analysis.
Fig. 3. Approximately 800 bases of DNA sequence (equivalent to 1/3,800,000 of the human genome), containing the first gene coding segment of four in the human Ras gene. The coding portion of the gene is located between bases 1624 and 1774. The remaining DNA around this does not contain a genetic message and is often referred to as junk DNA.
Fig. 4. (a) The artificial neural network used in GRAIL combines the results of a number of statistical tests to locate regions in the sequence that contain portions of genes with the genetic code for proteins. (b) A diagram showing the prediction of GRAIL on the human Ras gene (part of which is shown in Fig. 3). The peaks correspond to gene coding segments, and the connected bars represent the model of the gene predicted by the computer.
One of the key types of algorithms needed is pattern recognition. Methods must be designed to detect the subtle statistical patterns characteristic of biologically important sequence features, such as genes or gene regulatory regions. DNA sequences are remarkably difficult to interpret through visual examination. For example, in Fig. 3, it is virtually impossible to tell that part of the sequence is a gene coding region. However, when examined in the computer, DNA sequence has proven to be a rich source of interesting patterns, having periodic, stochastic, and chaotic properties that vary in different functional domains. These properties and methods to measure them form the basis for recognizing the parts of the sequence that contain important biological features.
In genomics and computational biology, pattern recognition systems often employ artificial neural networks or other similar classifiers to distinguish sequence regions containing a particular feature from those regions that do not. Machine-learning methods allow computer-based systems to learn about patterns from examples in DNA sequence. They have proven to be valuable because our biological understanding of the properties of sequence patterns is very limited. Also, the underlying patterns in the sequence corresponding to genes or other features are often very weak, so several measures must be combined to improve the reliability of the prediction. A well-known example of this is ORNLs GRAIL gene detection system, deployed originally in 1991, which combined seven statistical pattern measures using a simple feed-forward neural network [Fig. 4(a)]. GRAIL is able to determine regions of the sequence that contain genes [Fig. 4(b)], even genes it has never seen before, based on its training from known gene examples.
High-speed sequence comparison represents another important class of algorithms used to compare one DNA or protein sequence with another in a way that extracts how and where the two sequences are similar. Many organisms share many of the same basic genes and proteins, and information about a gene or protein in one organism provides insight into the function of its relatives or homologs in other organisms. Experiments in simpler organisms often provide insight into the importance of a gene in humans, so sequence comparison is a very important tool. Often the most accurate and sensitive methods for making this comparison are carried out using massively parallel computational platforms. To get a sense of the scale, examination of the relationship between 2 megabases of sequence (one days finished sequence) and a single database of known gene sequence fragments (called ESTs) requires the calculation of 1015 DNA base comparisons. And there are quite a number of databases to consider.
Fig. 5. Alignments of the protein sequences in the Ras family. First is the alignment of human Ras and mouse Ras, followed by human Ras with a related protein in yeast. The relationship in the latter case is much more difficult to detect, requiring computationally intensive sequence comparison methods. The letters in the sequence represent a pre-letter code for the 20 amino acids.
Two examples of sequence comparison for members of the same protein family are shown in Fig. 5. One shows a very similar relative to the human protein sequence query and the second a much weaker (and evolutionarily more distant) relationship. The sequence databases (which contain sequences used for such comparisons) are growing at an exponential rate, making it necessary to apply ever-increasing computational power to this problem.
Data mining and information retrieval. Methods are needed to locate and retrieve information relevant to newly discovered genes. If similar genes or proteins are discovered through sequence comparison, often experiments have been performed on one or more homologs that can provide insight into the newly discovered gene or protein. Relevant information is contained in more than 100 databases scattered throughout the world, including DNA and protein sequence databases, genome mapping databases, metabolic pathway databases, gene expression databases, gene function and phenotype databases, and protein structure data-bases. These data can provide insight into a genes biochemical or whole organism function, pattern of expression in tissues, protein structure type or class, functional family, metabolic role, and potential relationship to disease phenotypes. Using the Internet, researchers are developing automated methods to retrieve, collate, fuse, and dynamically link such database information to new regions of sequence. This is an important component of ORNLs Functional Genomics Initiative, because it helps to link experimental studies in the mouse to the larger context of the worlds genomic information.
The target data resources are very heterogeneous (i.e., structured in a variety of ways), and some are merely text-based and poorly formatted, making the identification of relevant information and its retrieval difficult. Intelligent information retrieval technology is being applied to this domain to improve the reliability of such systems. One challenge here is that information relevant to an important gene or protein may appear in any database at any time. As a result, systems now being developed dynamically update the descriptions of genes and proteins in our data warehouse and continually poll remote data resources for new information.
Data warehousing. The information retrieved by intelligent agents or calculated by the analysis system must be collected and stored in a local repository from which it can be retrieved and used in further analysis processes, seen by researchers, or downloaded into community databases. Numerous data of many types need to be stored and managed in such a way that descriptions of genomic regions and links to external data can be maintained and updated continually. In addition, large volumes of data in the warehouse must be accessible to the analysis systems running at multiple sites at a moments notice. Our plan involves using the High-Performance Storage System being implemented at ORNLs CCS.
Fig. 6. Diagram showing a region of 60 kilobases of sequence (horizontal axis) with several genes predicted by GRAIL (the up and down peaks represent gene coding segments of the forward and reverse strand of the double-stranded DNA helix). Visualization tools like the GRAIL interface provide researchers with an overview of large genomic regions and often have hyperlinks to underlying detailed information.
Visualization for data and collaboration. The sheer volume and complexity of the analyzed information and links to data in many remote databases require advanced data visualization methods to allow user access to the data. Users need to interface with the raw sequence data; the analysis process; and the resulting synthesis of gene models, features, patterns, genome map data, anatomical or disease phenotypes; and other relevant data. In addition, collaborations among multiple sites are required for most large genome analysis problems, so collaboration tools, such as video conferencing and electronic notebooks, are very useful. A display of several genes and other features from our GRAIL Internet server is shown in Fig. 6. Even more complex and hierarchical displays are being developed that will be able to zoom in from each chromosome to see the chromosome fragments (or clones) that have been sequenced and then display the genes and other functional features at the sequence level. Linked (or hyperlinked) to each feature will be detailed information about its properties, the computational or experimental methods used for its characterization, and further links to many remote databases that contain additional information. Analysis processes and intelligent retrieval agents will provide the feature details available in the interface and dynamically construct links to remote data.
The development of the sequence analysis engine represents part of the rapid changes in the biological sciences paradigm to one that makes much greater use of computation, networking, simulation and modeling, and sophisticated data management systems. Unlike any other existing system in the genomics arena, the sequence analysis engine will link components for data input, analysis, storage, update, and submission in a single distributed high-performance framework that is designed to carry out a dynamic and continual discovery process over a 10-year period. The approach outlined is flexible enough to use a variety of hardware and data resources; configure analysis steps, triggers, conditions, and updates; and provide the means to maintain and update the description of each genomic region. Users (individuals and large-scale producers) can specify recurrent analysis and data mining operations that continue for years. The combined computational process will provide new knowledge about the genome on a scale that is impossible for individual researchers using current methods. Such a process is absolutely necessary to keep up with the flood of data that will be gathered over the remainder of the Human Genome Project.
EDWARD C. UBERBACHER is head of the Computational Biosciences Section in ORNLs Life Sciences Division. He received his Ph.D. degree in chemistry from the University of Pennsylvania. In 1980, he conducted postdoctoral studies at the University of Pennsylvania Department of Biophysics and ORNLs Biology Division through the University of Tennessee–ORNL Graduate School of Biomedical Sciences. In this work, he investigated the structure and function of genetic materials using crystallography and tomographic image reconstruction in the electron microscope. In 1985 he became a consultant at ORNLs Center for Small-Angle Scattering Research, pursuing structural and dynamic studies of macromolecules in solution through use of neutron and X-ray scattering techniques. In 1987, he also became a research assistant professor at the Graduate School of Biomedical Sciences and an investigator in the Biology Division, where he focused on X-ray and neutron crystallography, scattering, and other biophysical methods. In 1988 he became a consultant in ORNLs Engineering Physics and Mathematics (EP&M) Division, where he developed artificial intelligence and high-performance computing methods for genomic DNA sequence analysis. In 1991 he joined the staff of the EP&M Division as the Informatics Group leader and received an R&D 100 award for the development of the Gene Recognition and Analysis Internet Link (GRAIL) system for analyzing DNA sequences. In 1997 he assumed his current position at ORNL. He is also an adjunct associate professor in the Graduate School of Biomedical Sciences at the University of Tennessee at Knoxville.
Next article | Contents | Search | Mail | Review Home Page | ORNL Home Page