Terry Gaasterland, Natalia Maltsev, Ross Overbeek, Evgeni Selkov
Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL 60439
Through PUMA, MAGPIE, and metabolic reconstruction algorithms, we carry genome interpretation beyond the identification of gene products to a customized view of an organism's functional properties.
MAGPIE is a system designed to reside locally at the site of a genome project and actively carry out analysis of genome sequence data as it is generated.[1] DNA sequences produced in a sequencing project mature through a series of stages that each require different analysis activities. Even after DNA has been assembled into contiguous fragments and eventually into a single genome, it must be regularly reanalyzed. Any new data in public sequence databases may provide clues to the identity of genes. Over a year, for 2 megabases with 4-fold coverage, MAGPIE will request on the order of 100,000 outputs from remote analysis software, manipulate and manage the output, update the current analysis of the sequence data,[3] and monitor the project sequence data for changes that initiate reanalysis.
PUMA is a Web-based system offering integrated access to metabolic pathways, multiple sequence alignments, compounds, sequences, and gene products together with a general overview function. Effective interpretation of genomic sequence requires a functional overview, the ability to embed sequence data within a metabolic framework, alignments that integrate specific genes and corresponding proteins within a broader context, and a phylogenetic perspective. Beyond creating an integrated universe of biological data relevant to sequence interpretation, PUMA aims to support customized functional overviews for a large number of organisms. Over 200 such customized overviews are supported in the current release. These 200 include each well-represented organism in the sequence databases. A PUMA functional overview for an organism is generated by projecting the functions that have been assigned to gene products onto a general functional overview.
There are a number of possible perspectives around which general functional overviews can be constructed. The motivating force behind the PUMA functional overview is to create functional slots to hold the 12,000 alignments (from the collection generated and maintained by Randy Smith and his colleagues) and the 1600 metabolic pathway diagrams (extracted and provided by Evgeni Selkov from the Enzymes and Metabolic Pathways database). Since no single perspective determines an obviously superior functional organization, PUMA implements and graphically presents alternative organizations.
Once the functional overview has been established, it remains to pinpoint the organisms' exact metabolic pathways and establish how they interact.[2] This task, which we call metabolic reconstruction, begins by producing a set of established enzymes (i.e., enzymes with strong similarities in identified coding regions to existing sequences for which the enzymatic function is known) and putative enzymes (i.e., enzymes with weak similarity to sequences of known function). From these initial "hits", within a phylogenetic perspective, we identify an initial set of pathways. This set can be used to generate a set of expected enzymes (i.e., enzymes that have not been clearly detected, but that would be expected given the set of hypothesized pathways) and missing enzymes (i.e., enzymes that occur in the pathways but for which no sequence has yet been biochemically identified for any organism). Further reasoning identifies tentative connective pathways and necessary pathways, as follow from growth medium requirements.
*Work supported in part by a grant from the Director, Office of Energy Research, Office of Health and Environmental Research of the U.S. Department of Energy under contract XX-XX-NN-NNNNN.
[1] T. Gaasterland and C. Sensen, MAGPIE: A Multipurpose Automated Genome Project Investigation Environment for Ongoing Sequencing Projects. In Bacterial Genomes: Physical Structure and Analysis, ed. G. Wienstock et al. (to appear).
[2] T. Gaasterland, J. Lobo, N. Maltsev, and G. Chen. Assigning Function to CDS Through Qualified Query Answering. In Proc. 2nd Int. Conf. Intell. Syst. for Mol. Bio., Stanford U. (1994).
[3] T. Gaasterland and E. Selkov. Automatic Reconstruction of Metabolic Structure from Incomplete Genome Sequence Data. In Proc. Int. Conf. Intell. Syst. for Mol. Bio., Cambridge, England (1995).