DOE Human Genome Program Contractor-Grantee
88. Genome Information Warehouse: Information and Databases to Support Comprehensive Genome Analysis and Annotation
Miriam Land, Denise Schmoyer, Morey Parang, Jay Snoddy, Sergey Petrov, Richard Mural, and Ed Uberbacher
Computational Biosciences and Toxicology and Risk Analysis Sections, Life Sciences Division, Oak Ridge National Laboratory, Oak Ridge, TN 37830
Genome Information Warehouse (GIW) supports the ORNL-based genome annotation and analysis effort by integrating experimental data and computational predictions within a single framework. This is a heterogenous collection of different databases and data stores. The primary purpose of this data warehouse is to provide the data management for user interfaces and other analytical functions for genome information and genome sequence annotation. Some current user interfaces supported by this data warehouse include Genome Channel, Genome Catalog, U.S. node of Genome DataBase, and a SRS mirror of community databases. The information found in GIW includes comprehensive annotation for human and mouse genomic sequences and completed microbial genomes. While the genomic sequences, themselves, are available from NCBI, EBI or DDBJ, the genome features, especially predicted genes and proteins, that can be inferred from each sequence are not being annotated at a rate that matches the rate of sequencing. As the world's knowledge-base about gene, proteins, and their interrelationships continues to grow, new insights can be gained by analyzing and reanalyzing all existing data with a consistent, managed process. One function of the GIW is to provide automated operation support for a consistent annotation process that uses the Analysis Pipeline and its analysis tools to acquire this very useful information.
GIW makes the assumption that the computed and annotated links are not going to be permanent. Since the underlying databases and knowledge change, results are likely to change. For example, the archival data sets like the nonredundant database (NR) at NCBI continue to grow and change, so that a new Blast analysis of a specific gene model can identify additional proteins with good similarities. As the knowledge-base about genes grows, gene modeling methods continue to be refined and improved which provide the impetus for recalculating the gene predictions. Libraries of BAC ends and repetitive sequences continue to grow and can provide new analysis insights by reexamining established sequences. The GIW is supports the rerunning of annotation in order to provide researchers with good information and insight that was not available at the time a sequence was first published.
A significant challenge of GIW is to reanalyze existing sequences in a timely fashion while maintaining currency of the underlying archival data from legacy databases. Many of these critical, underlying archival databases do not have a very robust update mechanism; for example, new and modified sequences from NCBI must be recognized and processed and should not be confused with any previous versions of sequences or contigs. Changes to underlying databases may occur during an analysis cycle. To maintain consistency over all sequences, we need to create analysis versions or epochs that use a consistent archival dataset. Another challenge is to present rapidly evolving information to the user in a way that provides some consistency in navigation and retrieval of data. One major challenge is to continue to develop flexible data structures in biology that can adapt to the evolving understanding of how biological entities relate to each other and new desired user functions.
The GIW primarily uses Oracle 8i to store and manage new experimental and computational data that is created at ORNL. Archival data from other legacy databases in GIW is stored and managed with SRS, flat files, GDB (Sybase-backed), XML files, and others; this archival data must be stored and updated to facilitate the value-added computational cross-linking and annotation.
A few of the completed user interfaces to these GIW databases can be accessed through http://genome.ornl.gov/.
(Research sponsored by the Office of Biological and Environmental Research, USDOE under contract number DE-AC05-96OR22464 with Lockheed Martin Energy Research Corp.)
|The online presentation of this publication is a special feature of the Human Genome Project Information Web site.|