Human Genome 1993 Program Report: Informatics
Date Published: March 1994
*DBEMP: Data Base on Enzymes and Metabolic Pathways
Evgenij E. Selkov
Laboratory for Mathematical Simulation of Multienzyme Systems; Institute of Theoretical and Experimental Biophysics; Russian Academy of Sciences; 142292 Pushchino, Moscow Region, Russia
The ability to relate sequence data to a comprehensive representation of functional gene roles and coded proteins is highly desirable. Data describing metabolic pathways and their regulation will also be vital in creating a framework for studying disorders that result from disruptions of specific metabolic systems. Since the early 1980s, the research group under E. Selkov has been encoding quantitative data from more than 14,000 journal articles into the DataBase on Enzymes and Metabolic Pathways (DBEMP). This database contains by far the most extensive data set relating to enzymes and metabolism and provides a critical functional context for the emerging volume of sequence data.
During 1993, the Russian team achieved the following.
The collaborating team at Argonne National Laboratory (ANL) has incorporated selections from DBEMP into the ANL Genobase, a system of integrated biological databases that allows users to query numerous data repositories about genetic sequences and proteins. As a result, 192 selected metabolic pathways have been integrated with the Swiss Protein Data Bank, portions of the EMBL Nucleic Acid Data Bank (including sequences for E. coli and five other organisms), the Enzyme Data Bank, the Blocks database, the ECO2DB data bank, and the Prosite motif-pattern data bank.
New Approaches to Recognizing Functional Domains in Biological Sequences
Gary D. Stormo
Department of Molecular, Cellular, and Developmental Biology; University of Colorado; Boulder, CO 80309-0347
303/492-1476, Fax: -7744, Internet: stormo@boulder.colorado.edu
Problems in identifying coding regions and other important functional domains in genomic DNA sequences will be approached using a combination of dynamic programming and neural network methods. Dynamic programming returns optimal partitioning of sequences into regions of different classes, given a particular weighting of evidence for those classes. The neural network is used to find the weights that maximize the performance of dynamic programming predictions. Dynamic programming can also be used to obtain suboptimal sequence partitioning, which can be very effective in assessing prediction reliability in different regions and in providing alternative partitioning models in cases where a high degree of reliability is not achieved. Combining an optimization procedure like dynamic programming with a machine-learning procedure like neural networks should be applicable to a wide range of problems beyond those being studied in this project.
Human Genome Center Informatics Group
Edward H. Theil, Arun Aggarwal, Donn Davy, Suzanna Lewis, Victor Markowitz, John McCarthy, Sam Pitluck, Eugene Veklerov, and Manfred Zorn
Human Genome Center; Lawrence Berkeley Laboratory; Berkeley, CA 94720
510/486-7501, Fax: -5936, Internet: ehtheil@lbl.gov
The Informatics Group of the Lawrence Berkeley Laboratory (LBL) Human Genome Center develops software to address problems related to the electronic capture, representation, and organization of data generated by the genome laboratories. The group works with biologists and engineers at the center to provide many forms of computer assistance, including custom software, access to external databases, and tools for portability of data. In addition, the group is concerned with longer-range problems of sequence and clone assembly, database design, and tools for data management.
The Flydb database has been developed to represent physical map information for the model organism Drosophila melanogaster. Flydb, which can display physical maps and in situ images, contains a contig-assembly algorithm based on the clone-limited sequence tagged site (STS) mapping strategy used at LBL. Another database, 21Bdb, displays physical maps of chromosome 21 based on STS markers and includes yeast artificial chromosomes, polymorphic repeats, and P1 clones known to contain these markers. 21Bdb also allows maps to be manipulated by moving STSs from one relative ordering to another with "click and drag." These databases support the full functionality associated with ACEDB (Caenorhabditis elegans database) as well as the ability to retrieve digitized images used in the mapping process.
Other projects include a sequence-assembly system based on the directed strategy used at LBL (a collaboration with Baylor College of Medicine), new ways to visualize the results of sequence analysis, and high-level tools to model laboratory protocols as part of data flow.
Software for Sequence Assembly Based on the Directed Approach
Eugene Veklerov, Suzanna Lewis, Christopher Martin, Sam Pitluck, and Edward Theil
Human Genome Computing Group; Lawrence Berkeley Laboratory; Berkeley, CA 94720
Veklerov: 510/486-7532, Fax: -6816, Internet: veklerov@cse.lbl.gov
Martin: 510/486-5654, Fax: -6816, Internet: chrism@guyana.lbl.gov
Existing software packages do not fully support the directed DNA sequencing strategy in use at Lawrence Berkeley Laboratory (LBL). Specifically, they are inadequate in the following areas:
Algorithms: The assembly algorithms were originally designed for the shotgun strategy and not to take advantage of all the information available to biologists using a directed strategy. Algorithms that properly use this information can overcome performance difficulties when the sequences become very long or when repeated sequences cause ambiguities.
Data Model: The sequencing strategy developed at LBL relies on a hierarchy of maps of increasingly higher resolution. The various pieces of sequencing software must be able to incorporate all these maps into a comprehensive data model.
User Interface: The large volume of data generated by large-scale sequencing requires that all data be available in a simple graphical form. The most time-consuming operations should be fully automated while still allowing the biologist to override automatic procedures.
We have written several programs that alleviate some difficulties in applying the Staden xdap package to our strategy. These programs perform several disjoint functions, including:
The programs will be incorporated into new, much more flexible software designed to remedy some of the inadequacies of existing packages. Because of Smalltalk's fast production of prototypes and superior data-modeling capabilities, we are using it to implement the system in a collaboration with Charles Lawrence's group at Baylor College of Medicine.
Using Metadata To Automatically Generate User Interfaces for Genomic Databases
Manfred D. Zorn
Information and Computing Sciences Division; Lawrence Berkeley Laboratory; Berkeley, CA 94720
510/486-5041, Fax: -4004, Internet: mdzorn@lbl.gov, BITNET: mdzorn@lbl
The Human Genome Project has a growing need to manage and distribute information. Databases for this purpose are often cumbersome for biologists to use or require extensive effort to build friendlier user interfaces. Adaptations of database structure to the changing needs of an evolving research area lead to costly modifications of user-interface applications.
We are developing software for the automatic generation of graphical, user-friendlier, forms-based user interfaces from high-level database definitions. An extended-entity-relationship (EER) model captures real-world objects and defines the underlying database. The EER schema, which constitutes part of the metadata, is used to create a user-interface object model that is stored in a configuration file. A generic user-interface application reads in the configuration file to produce a user interface for a particular database. The object definition in the configuration file defines not only the elements in the user interface but also an internal self-describing data structure and mappers that specify the translation between the database and the user-interface formats. Procedures that access the database and retrieve information are created by specifying queries in an EER-based query language for the objects in the configuration file. Thus the user interface and the connection to the underlying database are generated automatically, and database changes are easily propagated to create a modified user interface.
In the past year, the dramatic increase in the rate of new-sequence generation has presented a major challenge for sequence analysis. Increasingly longer sequences are being analyzed as finished sequences become larger than 100 kb, and database size doubles almost every year for sequence-similarity searches. Sophisticated computing technology for tackling these problems already exists in faster machines, parallel processing, and distributed computing. However, optimal access requires detailed knowledge of particular resources.
POET, the Parallel Object-Oriented Environment and Toolkit, is modeled after the X11 toolkit and enables both high- and low-level control of computational methods. The object-oriented programming paradigm offers data encapsulation and methods for hiding implementation details to present a unified object view to the user. Existing software can be adapted to exploit the power of parallel processing. Thus sequence analysis can be performed transparently to the user in reasonable time where POET divides either the query sequence or the database into multiple pieces to run on parallel computers or on a number of workstations in a distributed environment.
We are developing BioPOET, a prototype system that integrates sequence analysis into a friendly user interface and performs comparisons of large sequences. The user interface, developed in ParcPlace Smalltalk (produced by ParcPlace Systems) allows parameter specification for several analysis options and for launching the analysis program. A graphical display presents the results to the user.
The objective of this project is to identify computational problems of fundamental importance to molecular biologists engaged in the Human Genome Project, devise new algorithmic approaches for solving these problems, program and test the algorithms that are developed, and make useful computer code available to the biology community. An educational component of this project is the training of Ph.D.'s in computer science who will be qualified to take up careers in computational biology.
Nearly all our research concerns the design and adaptation of data structures and algorithms for solving problems in sequence analysis or "stringology." This includes problems in string alignment and matching, local similarity search, restriction site mapping, clone ordering, and fragment assembly. Our emphasis is on finding solutions that are programmable, useful, and effective, as well as elegant and theoretically satisfying.
Building on prior results, we plan the following lines of investigation.
1.Local Similarity Search: Adaptation of Chang-Lawler filtering technique for approximate pattern matching. O(kn) dynamic programming algorithm when k is prespecified bound on number of errors. Dynamic programming for nonlinear scoring functions.
2.Detection of Random Repeats and Palindromes: Improvement of suffix-tree and other algorithms for random repeats, approximate palindromes, contiguous tandem repeats, etc.
3.Comparison of Alignments: Measure of similarity of two alignments. Dynamic programming table analysis to generate most dissimilar optimal alignments. Alignment comparison and its relation to parametric analysis carried out by PARAL.
4.Multiple String Alignment: Bounded-error heuristics for alternative scoring functions. Application of multiple common substring computation. Multiple alignments and consensus strings related to phylogenetic tree.
5.Clone Ordering: Dynamic programming for generation of least-cost agreement with probe data. Adaptation of traveling salesman algorithms.
6.Sequencing by Hybridization: Information theoretic analysis of hybridization-array design. Pooling of oligos and/or clones. Application of graph algorithms.
7.Fast Fourier Transform (FFT): Combinatorial interpretation of FFT algorithm when used ¦o generate match counts. Match counts as filter for matching algorithms. FFT as subprocedure in other algorithms.
8.Evolutionary Reconstruction Under High-Order Mutations: Algorithms to find the least-cost reconstruction of a set of sequences where high-order mutations such as inversions, repetitions, and recombinations are permitted in addition to point mutations.
The goal of this work is to extend, refine, and apply the principal investigator's research to linguistic analysis of biological sequences. A software system will be created to perform sophisticated pattern-recognition and related functions at abstraction and expression levels beyond current general-purpose pattern-matching systems for biological sequences; it will also perform with more-uniform language, environment, and graphical user interface and with greater flexibility, extensibility, embeddability, and ability to incorporate other algorithms than possible with current special-purpose analytic software. Specific aims are:
1.Extended development of the graphical user interface and visualization tools. A current dynamic parse-visualization tool will be enhanced and supplemented with static data-visualization routines for high-level iconic depiction of parse results. A graphical interface will be implemented to support interactive grammar development and refinement in a rapid-prototyping mode.
2.Development of embeddability "hooks" for incorporation of and by other algorithms. The system will be made into a platform for applying other algorithms in a hierarchical fashion; focusing them on regions of interest; providing a uniform environment for input, output, and parameter management; and assembling results into the grammar's structural model. The grammar system will be made embeddable in other platforms where appropriate.
3.Incorporation of advanced parser technology and application to eukaryotic gene parsing. Current developments in areas such as island and probabilistic parsing will be embedded in the system, driven by the specific practical problem of efficiently recognizing protein-coding eukaryotic genes. Current statistical and heuristic gene-finding algorithms will be adapted to grammatical expression to allow for greater flexibility and contextually structured application.
4.Extension of input formats accepted and header information processed by the parser. For graphical depiction and high-level parsing, the current GenBank® flat-file entry parser will be extended to handle a variety of other formats and extract additional information from features tables. Facilities will also be developed for transparent connection to relational databases and ASN.1-formatted data streams.
5.Extension of the grammar system to encompass protein sequence at multiple levels. The parser will be extended to accept single-letter protein code as the primary sequence for describing motifs. Longer-term goals include the development of secondary structure grammars and the potential description of tertiary structures using coordinate grammars.
6.Collaborations aimed at specific biological and computational problems. To drive system development farther in biologically relevant directions, collaborations for grammar development will be undertaken with biologists and for parser development with computational biologists. A facility will be provided for remote access to the parser.
7.Distribution and promotion of software and associated libraries. Periodic software releases will be accompanied by full documentation and a reasonable level of support, particularly in developing new grammars. Grammars for use with the parser or other programs will be maintained in a central, publicly accessible repository of biological feature specifications.
Computational Support for the Human Genome Center: Statistical and Mathematical Analysis, Data Processing, and Databasing
GnomeView: A Graphical Interface to the Human Genome
Robust Contig Construction
HGIR: Information Management for a Growing Map
Identification of Genes in Anonymous DNA Sequences
BISP: VLSI Solutions to Sequence-Comparison Problems
Efficient Identification and Analysis of Low- and Medium-Frequency Repeats
A Human Genome Database
Genome Assembly Manager
Laboratory Information Management System (LIMS) for Megabase Sequencing
Database Tools Development
A Computer System for Access to Distributed Genome Mapping Data
Applying Machine Learning Techniques to DNA Sequence Analysis
Computational Analysis and Support for Extensive Physical Mapping of Genomes
Informatics Support for Mapping in Mouse-Human Homology Regions
An Intelligent System for High-Speed DNA Sequence Pattern Analysis and Interpretation
Biopoet: A System for Large-Scale Sequence Analysis
Manfred D. Zorn, Jane Macfarlane,(1) and Robert Armstrong(2)
Human Genome Center and (1)Information and Computing Sciences Division; Lawrence Berkeley Laboratory; Berkeley, CA 94720
510/486-5041, Fax: -4004, Internet: mdzorn@lbl.gov, BITNET: mdzorn@lbl
(2)Sandia National Laboratory; Livermore, CA 94550Projects Renewed in FY 1993
Efficient Algorithms and Data Structures in Support of DNA
Mapping and Sequence Analysis
Eugene Lawler and Daniel Gusfield(1)
Electronics Research Laboratory; University of California; Berkeley, CA 94720
510/642-4019, Fax: -5775, Internet: lawler@arpa.berkeley.edu
(1)Division of Computer Science; University of California; Davis, CA 95616
916/752-7131, Fax: -4767, Internet: gusfield@cs.ucdavis.eduFoundations for a Syntactic Pattern Recognition System for Genomic DNA Sequences
David B. Searls
Department of Genetics; University of Pennsylvania School of Medicine; Philadelphia, PA 19104-6145
215/573-3107, Fax: -3111, Internet: dsearls@cbil.humgen.upenn.eduProjects Continuing into FY 1993
Elbert Branscomb, Tom Slezak, David Nelson, and Anthony V. Carrano
Human Genome Center; Biology and Biotechnology Research Program; Lawrence Livermore National Laboratory; Livermore, CA 94551
510/422-5681, Fax: /423-3608, Internet: elbert@alu.llnl.gov
Richard J. Douthart, Joanne E. Pelkey, and David A. Thurman
Life Sciences Center; Pacific Northwest Laboratory; Richland, WA 99352
509/375-2653, Fax: -3649, Internet: dick@gnome.pnl.gov
Michael Cinkosky, Randall Dougherty, Vance Faber, Mark Goldberg,(1) Mark Mundt, Robert Pecherer, Doug Sorenson, and David Torney
Theoretical Biology and Biophysics Group; Los Alamos National Laboratory; Los Alamos, NM 87545
Torney: 505/667-7510, Fax: /665-3493, Internet: dct@life.lanl.gov
(1)Rensselaer Polytechnic Institute; Troy, NY 12181
James W. Fickett, Michael J. Cinkosky, Michael A. Bridgers, Henry T. Brown,
Christian Burks, Philip E. Hempfner, Tran N. Lai, Debra Nelson,(1) Robert M. Pecherer, Doug Sorenson, Peichen H. Sgro, Robert D. Sutherland, Charles D. Troup, and Bonnie C. Yantis
Theoretical Biology and Biophysics Group; Los Alamos National Laboratory; Los Alamos, NM 87545
505/665-5340, Fax: -3493, Internet: jwf@life.lanl.gov
(1)Department of Human Genetics; University of Utah; Salt Lake City, UT 84112
Christopher A. Fields and Carol A. Soderlund(1)
The Institute for Genomic Research; Gaithersburg, MD 20878
301/869-9056, Fax: -9423
(1)Sanger Center; Cambridge, U.K.
Tim Hunkapiller, Leroy Hood, Ed Chen,1 and Michael Waterman(2)
Department of Molecular Biotechnology; University of Washington; Seattle, WA 98195
206/685-7365, Fax: -7302, Internet: tim@mudhoney.mbt.washington.edu
(1)Jet Propulsion Laboratory; Pasadena, CA 91109
(2)University of Southern California; Los Angeles, CA 90089
Jerzy Jurka, Aleksandar Milosavljevic,(1) Jolanta Walichiewicz, and Sherman Yang
Linus Pauling Institute of Science and Medicine; Palo Alto, CA 94306
415/327-4064, Fax: -8564, Internet: jurek@jmullins.stanford.edu
(1)Biological/Medical Research Division; Argonne National Laboratory; Argonne, IL 60439-4833
David Kingsbury, Ken Fasman, and Peter L. Pearson
Genome Data Base; Johns Hopkins University School of Medicine; Baltimore, MD 21205
410/955-7058, Fax: /614-0434, Internet: dkingsbu@gdb.org
Charles B. Lawrence, Eugene W. Myers,(1) and Sandra Honda
Department of Cell Biology; Baylor College of Medicine; Houston, TX 77030-3498
713/798-6226, Fax: /790-1275, Internet: chas@mbir.bcm.tmc.edu
(1)Department of Computer Science; University of Arizona; Tucson, AZ 85721
Victor M. Markowitz
Data Management Group and Human Genome Center; Information and Computing Sciences Division; Lawrence Berkeley Laboratory; Berkeley, CA 94720
510/486-6835, Fax: -4004, Internet: v_markowitz@lbl.gov
Victor M. Markowitz,(1,2) Arie Shoshani,(1) and Ernest Szeto(1)
(1)Data Management Group and (2)Human Genome Center; Information and Computing Sciences Division; Lawrence Berkeley Laboratory; Berkeley, CA 94720
510/486-6835, Fax: -4004, Internet: v_markowitz@lbl.gov
Thomas G. Marr and Andrew Reiner
Cold Spring Harbor Laboratory; Cold Spring Harbor, NY 11724
516/367-8393, Fax: -8416, Internet: marr@cshl.org
Jude W. Shavlik, Michiel O. Noordewier,(1) Geoffrey Towell, Mark Craven, Andrew Whitsitt, Kevin Cherkauer, and Lorien Pratt(1)
Department of Computer Science; University of Wisconsin; Madison, WI 53706
608/262-7784, Fax: -9777, Internet: shavlik@cs.wisc.edu
(1)Department of Computer Science; Rutgers University; New Brunswick, NJ 08903
Tom Blackwell, David Balding, Frederic Fairfield, Jim Fickett, Catherine Macken, Karen Schenk, David Torney, Burton Wendroff, and Clive Whittaker
Los Alamos National Laboratory; Los Alamos, NM 87545
Torney: 505/667-7510, Fax: /665-3493, Internet: dct@life.lanl.gov
Edward Uberbacher, Richard Mural,(1) Eugene Rinchik,(2) and Richard Woychik(1)
Engineering Physics and Mathematics Division and (1)Biology Division; Oak Ridge National Laboratory; Oak Ridge, TN 37831-6364
615/574-6134, Fax: -7860, Internet: ube@ornl.gov or ube@ubersun.epm.ornl.gov
(2)Sarah Lawrence College; Bronxville, NY 10708
Edward Uberbacher, Richard Mural,(1) Ralph Einstein, and Reinhold Mann
Engineering Physics and Mathematics Division and (1)Biology Division; Oak Ridge National Laboratory; Oak Ridge, TN 37831-6364
615/574-6134, Fax: -7860, Internet: ube@ornl.gov or ube@ubersun.epm.ornl.gov