DOE Human Genome Program Contractor-Grantee
87. Infrastructure and Tools for High Throughput Computational Genome Analysis
Doug Hyatt, Phil Locascio, Victor Olman, Manesh Shah, and Inna Vokler
Computational Biosciences Section, Life Sciences Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831
The Computational Biosciences Section at Oak Ridge National Laboratory provides Computational Genome Analysis resources to the DOE Joint Genome Institute, other major Genome Centers, and to the international biology community. In addition, these resources are also used internally to support the analysis of sequences in the ORNL Genome Channel and Genome Catalog systems. With the imminent publishing of the draft human genome sequence by the spring of 2000, challenges in the computational analysis of biological data are now critical. We have constructed a computational infrastructure to meet these new demands for processing sequence and other biological data for genome centers and for the biological community at large. Utilizing OBER's timely investment in a high performance resource at ORNL, we have developed the Genomic Integrated Supercomputing Toolkit (GIST) to address this critical throughput challenge and to provide advanced capabilities for the Genome Analysis Toolkit (GAT). Both systems are described below.
Genome Analysis Toolkit (GAT)
The Genome Analysis Toolkit incorporates a wide variety of analysis tools: exon and gene prediction tools, other kinds of feature recognition systems and database homology search systems. The exon and gene recognition systems include Grail, GrailExp and Genscan; and microbial gene prediction systems, Generation and Glimmer. Additionally, Grail suite of tools, consisting of CpG islands, PolyA sites, Simple and Complex Repeats, and BAC End analysis tools, have also been incorporated. Also included are NCBI STS E-PCR, RepeatMasker and TRNAScan-SE systems. Database homology systems include NCBI BLAST and Beauty post-processing. Supported organisms include human, mouse, arabidopsis, drosophila, and most sequenced microbial organisms.
Access to these resources is provided by the GAT client server system. Genome Analysis Toolkit is structured as a layered system. The innermost layer is the tool layer, which comprises the binary executables for the individual tools and associated configuration and data files required by these tools. The binaries are compiled for all supported hardware platforms and operating systems. The service layer, implemented in Perl, provides a platform-independent mode of tool execution. When a service script is invoked by the server, it determines the platform on which it is running and calls the appropriate tool binary. Rigorous error checking has been added at this layer to guarantee that errors in tool execution will be caught and reported to the server.
Access to individual services is provided through a master-slave server layer. The master server receives all analysis requests from clients and distributes them among the heterogeneous pool of slave machines to best utilize the available compute resources and to achieve optimal throughput. Compute-intensive analysis tasks like BLAST searches are directed to the GIST server, running on ORNL's IBM RS/6000 SP infrastructure, described below.
A generic, platform-independent command line (client) interface, written in Perl can be used to submit individual analysis requests to the server. A specialized batch processing tool, ornl_pipeline, has been developed to facilitate specification of customized analysis pipelines. ornl_pipeline, on invocation, reads a user specified configuration file, consisting of a set of analysis directives. A single directive can consist of a logical chain of analysis to be performed on the given sequence. The pipeline then interacts with the server, submitting the specified requests along with associated input data, and collecting the server responses. The output of one analysis is typically fed as input to the next analysis in the chain, in a pipelined fashion. All results are then suitably organized and reported.
GIST (Genomic Integrated Supercomputing Toolkit)
The initial tools included in GIST are a framework of high performance biological application servers that include massively parallel BLAST codes (versions of BLASTN, BLASTP, and BLASTX), which are at the heart of analyses processes such as gene modeling with GRAIL-EXP. We are currently in the process of adding gene modeling tools (e.g., GRAIL-EXP) and plan multiple sequence alignment, protein classification, protein threading, and phylogeny reconstruction (for both gene trees and species trees).
The GIST resources are utilized by the GAT server in a transparent fashion, permitting the gradual introduction of new algorithms and tools, without jeopardizing existing operations. Due to the logical decoupling of the query infrastructure, we have been able to produce an infrastructure with both excellent scaling abilities and many fault-tolerant characteristics. In testing the ability to run multiple instances of tools requiring BLAST we have demonstrated that the removal of any dependent services does not cause loss of data. Instead, where processing power is removed, we observe a graceful degradation of services as long as there is some instantiation of service available, and options to permit "never fail" operation, to cope with network failure and long running operations. GIST's logical structure can be thought of as having three overall components: client, administrator, and server. All components share a common infrastructure consisting of a naming service and query agent, with an administrator having policy control over agent behavior, and namespace profile.
The tools and servers are transparent to the user but able to manage the large amounts of processing and data produced in the various stages of enriching experimental biological information with computational analysis. The goal of GIST is not only to provide one-stop shopping to a genome sequence-data framework and interoperable tools but also to run the codes in the toolkit on platforms where the kinds of questions users can ask are not greatly affected by hardware limitations.
Located at Oak Ridge National Laboratory within both the Center for Computational Sciences and the Computational Biosciences section, the computational infrastructure consists of the centerpiece IBM SP3, some SGI SMP machines, a DEC Alpha Workstation cluster, and a trial Linux PC cluster. We are rapidly approaching beta-stage deployment testing; after testing performance and stability, we hope to deploy the framework at NERSC, other high-performance computing sites, and other collaborators.
(Research sponsored by the Office of Biological and Environmental Research, USDOE under contract number DE-AC05-96OR22464 with Lockheed Martin Energy Research Corp.)
|The online presentation of this publication is a special feature of the Human Genome Project Information Web site.|