Part I Index
Daniel Davison and Randall Smith
Baylor College of Medicine; Houston, TX 77030
713/798-3738, Fax: -3759, email@example.com
We are providing a variety of molecular biologyrelated search and analysis services to Genome Program investigators to improve the identification of new genes and their functions. These services are available via the BCM Search Launcher World Wide Web (WWW) pages which are organized by function and provide a single point-of-entry for related searches. Pages are included for 1) protein sequence searches, 2) nucleic acid sequence searches, 3) multiple sequence alignments, 4) pairwise sequence alignments, 5) gene feature searches, 6) sequence utilities, and 7) protein secondary structure prediction. The Protein Sequence Search Page, for example, provides a single form for submitting sequences to WWW servers that provide remote access to a variety of different protein sequence search tools, including BLAST, FASTA, SmithWaterman, BEAUTY, BLASTPAT, FASTAPAT, PROSITE, and BLOCKS searches. The BCM Search Launcher extends the functionality of other WWW services by adding additional hypertext links to results returned by remote servers. For example, links to the NCBI's Entrez database and to the Sequence Retrieval System (SRS) are added to search results returned by the NCBI's WWW BLAST server. These links provide easy access to Medline abstracts, links to related sequences, and additional information which can be extremely helpful when analyzing database search results. For novice or infrequent users of sequence database search tools, we have preset the parameter values to provide the most informative firstpass sequence analysis possible.
A batch client interface to the BCM Search Launcher for Unix and Macintosh computers has also been developed to allow multiple input sequences to be automatically searched as a background task, with the results returned as individual HTML documents directly on the user's system. The BCM Search Launcher as well as the batch client are available on the WWW at URL http://gc.bcm.tmc.edu:8088/search-launcher/launcher.html.
The BCM/UH Server Core provides the necessary computational resources and continuing support infrastructure for the BCM Search Launcher. The BCM/UH Server Core is composed of three network servers and currently supports electronic mail and WWW-based access; ultimately, specialized clientserver access will also be provided. The hardware used includes a 2048processor MasPar massively parallel MIMD computer, a DEC Alpha AXP/OSF1, a Sun 2-processor SparcCenter 1000 server, and several Sun Sparc workstations.
In addition to grouping services available elsewhere on the WWW and providing access to services developed at BCM and UH, the BCM/UH Server Core will also provide access to services from developers who are unwilling or unable to provide their own Internet network servers.
Grant Nos.: DOE, DEFG039SER62097/A000; National Library of Medicine, R01LM05792; National Science Foundation, BIR 9111695; National Research Service Award, F32HG0013301; NIH, P30HG00210 and R01HG0097301.
A Freely Sharable DatabaseManagement System Designed for Use in ComponentBased, Modular Genome Informatics Systems
Steve Rozen,1 Lincoln Stein,1 and Nathan Goodman
The Jackson Laboratory; Bar Harbor, ME 04609
Goodman: 207/288-6158, Fax: -6078, firstname.lastname@example.org
1Whitehead Institute for Biomedical Research; Cambridge, MA 02139
We are constructing a data-management component, built on top of commercial data-management products, tuned to the requirements of genome applications. The core of this genome data manager is designed to:
· support the semantic and object-oriented data models that have been widely embraced for representing genome data,
· provide domain-specific builtin types and operations for storing and querying bimolecular sequences,
· provide built-in support for tracking laboratory work flows, and admit further extensions for other specialpurpose types,
· allow core facilities to be readily extended to meet the diverse needs of biological applications
The core data manager is being constructed on top of Sybase, Oracle, and Informix Universal Server. The software is available free of charge and is freely redistributable.
We will be reporting progress on the core data manager's architecture and interface at the URLs above, and we solicit comments on its design.
DOE Grant No. DEFG0295ER62101.
Originally called Database Management Research for the Human Genome Project, this project was initiated in 1995 at the Massachusetts Institute of TechnologyWhitehead Institute.
A Software Environment for Large-Scale Sequencing
Department of Cell Biology; Baylor College of Medicine; Houston, TX 77030
713/798-8271, Fax: -3759; email@example.com
Our approach is to implement software systems which manage primary laboratory sequence data and explore and annotate functional information in genome sequence and gene products.
Three software systems have been developed and are being used: two sequence data managers which use different sequence assembly packages, FAK and Phrap, and a series of analysis and annotation tools which are available via the Internet. In addition, we have developed a prototype application for data mining of sequence data as it is related to metabolic pathways.
Products of this project are the following:
1. GRM a sequence reconstruction manager using the FAQ assembly engine (available since October 1995).
2. GFP a sequence finishing support tool using the Phrap assembly engine (available since March 1996).
3. A series of gene recognition tools (available since early 1996).
4. A tool for visualizing metabolic pathways data and exploring sequence data related to metabolic pathways (prototype available since August 1996).
DOE Grant No. DE-FG03-94ER61618.
Generalized Hidden Markov Models for Genomic Sequence Analysis
David Haussler, Kevin Karplus,1 and Richard Hughey1
Computer Science Department and 1Computer Engineering Department; University of California; Santa Cruz, CA 95064
408/459-2105, Fax: -4829, firstname.lastname@example.org
We have developed an integrated probabilistic method for locating genes in human DNA based on a generalized hidden Markov model (HMM). Each state of a generalized HMM represents a particular kind of region in DNA, such as an initial exon for a gene. The states are connected by transitions that model sites in DNA between adjacent regions, e.g. splice sites. In the full HMM, parametric statistical models are estimated for each of the states and transitions. Generalized HMMs allow a variety of choices for these models, such as neural networks, high order Markov models, etc. All that is required is that each model return a likelihood for the kind of region or transition it is supposed to model. These likelihoods are then combined by a dynamic programming method to compute the most likely annotation for a given DNA contig. Here the annotation simply consists of the locations of the transitions identified in the DNA, and the labeling of the regions between transitions with their corresponding states.
This method has been implemented in the genefinding program Genie, in collaboration with Frank Eeckman, Martin Reese and Nomi Harris at Lawrence Berkeley Labs. David Kulp, at UCSC, has been responsible for the core implementation. Martin Reese developed the splice site models, promoter models, and datasets. You can access Genie at the second www address given above, submit sequences, and have them annotated. Nomi Harris has written a display tool called Genotater that displays Genie's annotation along with the annotation of other genefinders, as well as the location of repetitive DNA, BLAST hits to the protein database, and other useful information. Papers and further information about Genie can be found at the first www address above. Since the ISMB '96 paper, Genie's exon models have been extended to explicitly incorporate BLAST and BLOCKS database hits into their probabilistic framework. This results in a substantial increase in gene predicting accuracy. Experimental results in tests using a standard set of annotated genes showed that Genie identified 95% of coding nucleotides correctly with a specificity of 88%, and 76% of exons were identified exactly.
DOE Grant No. DE-FG03-95ER62112.
Identification, Organization, and Analysis of Mammalian Repetitive DNA Information
Genetic Information Research Institute; Palo Alto, CA 94306
415/3265588 Fax: -2001, email@example.com
There are three major objectives in this project: organization of databases of mammalian repetitive sequences, development of specialized software for analysis of repetitive DNA, and sequence studies of new mammalian repeats.
Our approach is based on extensive usage of computer tools to investigate and organize publicly available sequence information. We also pursue collaborative research with experimental laboratories. The results are widely disseminated via the internet, peer reviewed scientific publications and personal interactions. Our most recent research concentrates on mechanisms of retroposon integration in mammals (Jurka, J., PNAS, in press; Jurka, J and Klonowski, P., J. Mol. Evol. 43:685689).
We continue to develop reference collections of mammalian repeats which became a worldwide resource for annotation and study of newly sequenced DNA. The reference collections are being revised annually as part of a larger database of repetitive DNA, called Repbase. The recent influx of sequence data to public databases created an unprecedented need for automatic annotation of known repetitive elements. We have designed and implemented a program for identification and elimination of repetitive DNA known as CENSOR.
Reference collections of mammalian repeats and the CENSOR program are available electronically (via anonymous ftp to ncbi.nih.gov; directory repository/repbase). CENSOR can also be run via electronic mail (mail "help" message to firstname.lastname@example.org).
DOE Grant No. DEFG03-95ER62139.
*TRRD, GERD and COMPEL: Databases on Gene-Expression Regulation as a Tool for Analysis of Functional Genomic Sequences
A.E. Kel, O.A. Podkolodnaya, O.V. Kel, A.G. Romaschenko, E. Wingender,1 G.C. Overton,2 and N.A. Kolchanov
Institute of Cytology and Genetics; Novosibirsk, Russia
Kolchanov: +7-3832/353-335, Fax: -336 or /356-558, email@example.com
1Gesellschaft für Biotechnologische Forschung; Braunschweig, Germany
2Department of Genetics; University of Pennsylvania School of Medicine; Philadelphia, PA 19104-6145
The database on transcription regulatory regions in eukaryotic genomes (TRRD) has been developed . The main principle of data representation in TRRD is modular structure and hierarchy of transcription regulatory regions. TRRD entry corresponds to a gene as entire unit. Information on gene regulation is provided (cellcycle and cell type specificity, developmental stage-specificity, influence of various molecular signals on gene expression). TRRD database contains information about structural organization of gene transcription regulatory region. TRRD contains description of known promoters and enhancers in 5', 3' regions and in introns. Description of binding sites for transcription factors includes nucleotide sequence and precise location, name of factors that bind to the site, experimental evidences for the binding site revealing. We provide crossreferences to TRANSFAC database  for both sites and factors as well as for genes. TRRD 3.3 release includes 340 vertebrate genes.
The Gene Expression Regulation Database (GERD) collects information on features of genes expression as well as information about gene transcription regulation. The current release of GERD contains 75 entries with information on expression regulation of genes expressed in hematopoietic tissues in the course of ontogenesis and blood cells differentiation. COMPEL database contains information about composite elements which are functional units essential for highly specific transcription regulation . Direct interactions between transcription factors binding to their target sites within composite elements result in convergence of different signal transduction pathways. Nucleotide sequences and positions of composite elements, binding factors and types of their DNA binding domains, experimental evidence confirming synergistic or antagonistic action of factors are registered in COMPEL. Crossreferences to TRANSFAC factors table are given. TRRD and COMPEL are provided by crossreferences to each other. COMPEL 2.1 release includes 140 composite elements.
We have developed a software for analysis of transcription regulatory region structure. The CompSearch program is based on oligonucleotide weight matrix method. To collect sets of binding sites for the matrixes construction we have used TRANSFAC and TRRD databases. The CompSearch program takes into account the fine structure of experimentally confirmed NFATp/AP1 composite elements collected in COMPEL (distances between binding sites in composite elements, their mutual orientation). By means of the program we have found potential composite elements of NFATp/AP1 type in the regulatory regions of various cytokine genes. Analysis of composite elements could be the first approach to reveal specific patterns of transcription signals encoding regulatory potential of eukaryotic promoters.
2. Wingender E., Dietze P., Karas H., and Knuppel R. TRANSFAC: a database on transcription factors and their DNA binding sites (1996). Nucl. Acids Res., 1996, v. 24, pp. 238-241.
3. Kel O.V., A.G. Romaschenko, A.E. Kel, E. Wingender, N.A. Kolchanov. A compilation of composite regulatory elements affecting gene transcription in vertebrates (1995). Nucl. Acids Res., v. 23, pp. 4097-4103.
Wingender, E., Kel, A. E., Kel, O. V., Karas, H., Heinemeyer, T., Dietze, P., Knueppel, R., Romaschenko, A. G. and Kolchanov, N. A. (1997). TRANSFAC, TRRD and COMPEL: Towards a federated database system on transcriptional regulation. Nucleic Acids Res., in press.
Ananko E.A., Ignatieva E.V., Kel A.E., Kolchanov N.A (1996). WWWTRRD: Hypertext information system on transcription regulation. Computer Science and Biology. Proceedings of the German Conference on Bioinformatics (GCB'96), R. Hofestadt, T. Lengauer, M. Löffler, D. Schomburg (eds.). University of Leipzig, Leipzig 1996, pp. 153-155.
A.E. Kel, O.V. Kel, O.V. Vishnevsky, M.P. Ponomarenko, I.V. Ischenko, H. Karas, N.A. Kolchanov, H. Sklenar, E. Wingender (1997). TRRD and COMPEL databases on transcription linked to TRANSFAC as tools for analysis and recognition of regulatory sequences. (1997) LECTURE NOTES IN COMPUTER SCIENCE, in press.
Holger Karas, Alexander Kel, Olga Kel, Nikolay Kolchanov, and Edgar Wingender (1997). Integrating knowledge on gene regulation by a federated database approach: TRANSFAC, TRRD and COMPEL. Jurnal Molekularnoy Biologii (Russian), in press.
Kel A.E., Kolchanov N.A., Kel O.V., Romaschenko A.G., Ananko E.A., Ignatyeva E.V., Merkulova T.I., Podkolodnaya O.A., Stepanenko I.L., Kochetov A.V., Kolpakov F.A., Podkolodniy N.L., Naumochkin A.A. (1997). TRRD: A database on transcription regulatory regions of eukaryotic genes. Jurnal Molekularnoy Biologii (Russian) in press.
O.V. Kel, A.E. Kel, A.G. Romaschenko, E. Wingender, N.A. Kolchanov (1997). Composite regulatory elements: classification and description in the COMPEL data base. Jurnal Molekularnoy Biologii (Russian), in press.
Data-Management Tools for Genomic Databases
Victor M. Markowitz and IMin A. Chen
Information and Computing Sciences Division; Lawrence Berkeley National Laboratory; Berkeley, CA 94720
510/486-6835, Fax: -4004, firstname.lastname@example.org
The Object-Protocol Model (OPM) data management tools provide facilities for constructing, maintaining, and exploring efficiently molecular biology databases. Molecular biology data are currently maintained in numerous molecular biology databases (MBDs), including large archival MBDs such as the Genome Database (GDB) at Johns Hopkins School of Medicine, the Genome Sequence Data Base (GSDB) at the National Center for Genome Resources, and the Protein Data Bank (PDB) at Brookhaven National Laboratory. Constructing, maintaining, and exploring MBDs entail complex and timeconsuming processes.
The goal of the Object-Protocol Model (OPM) data management tools is to provide facilities for efficiently constructing, maintaining, and exploring MBDs, using applicationspecific constructs on top of commercial database management systems (DBMSs). The OPM tools will also provide facilities for reorganizing MBDs and for exploring seamlessly heterogenous MBDs. The OPM tools and documentation are available on the Web and are developed in close collaboration with groups maintaining MBDs, such as GDB, GSDB, and PDB.
Current work focuses on providing new facilities for constructing and exploring MBDs. The specific aims of this work are:
(1) Extend the OPM query language with additional constructs for expressing complex conditions, and enhance the OPM query optimizer for generating more efficient query plans.
(2) Develop enhanced OPM query interfaces supporting MBDspecific data types (e.g., protein data type) and operations (e.g., protein data display and 3D search), and assisting users in specifying and interpreting query results.
(3) Provide support for customizing MBD interfaces.
(4) Extend the OPM tools with facilities for managing permissions (object ownership) in MBDs, and for physical database design of relational MBDs, including specification of indexes, allocation of segments, and handling of redundant (denormalized) data.
(5) Develop OPM tools for constructing and maintaining multiple OPM views for both relational and nonrelational (e.g., ASN.1, AceDB) MBDs. For a given MBD, these tools will allow customizing different OPM views for different groups of scientists. For heterogeneous MBDs, this tool will allow exploring them using common OPM interfaces.
(6) Develop tools for constructing OPM based multidatabase systems of heterogeneous MBDs and for exploring and manipulating data in these MBDs via OPM interfaces. As part of this effort, the OPMbased multidatabase system which consists currently of GDB 6.0 and GSDB 2.0, will be extended to include additional MBDs, primarily GSDB 2.2 (when it becomes available), PDB, and Genbank.
(7) Develop facilities for reorganizing OPM-based MBDs.The database reorganization tools will support automatic generation of procedures for reorganizing MBDs following restructuring (revision) of MBD schemas.
In the past year, the OPM data management tools have been extended in order to address specific requirements of developing MBDs such as GDB 6 and the new version of PDB.
The current version of the OPM data management tools (4.1) was released in June 1996 for Sun/OS, Sun/Solaris and SGI. The following OPM tools are available on the Web at http://gizmo.lbl.gov/opm.html:
(1) an editor for specifying OPM schemas;
(2) a translator of OPM schemas into relational database specifications
(4) a translator of OPM queries into SQL queries;
(5) a retrofitting tool for constructing OPM schemas (views) for existing relational genomic databases;
(6) a tool for constructing Webbased form interfaces to MBDs that have an OPM schema; this tool was developed by Stan Letovsky at Johns Hopkins School of Medicine, as part of a collaboration.
The OPM data management tools have been highly successful in developing new genomic databases, such as GDB 6 (released in January 1996) and the relational version of PDB, and in constructing OPM views and interfaces for existing genomic databases such as GSDB 2.0. The OPM data management tools are currently used by over ten groups in USA and Europe. The research underlying these tools is described in several papers published in scientific journals and presented at database and genome conferences.
In the past year the OPM tools have been presented at database and bioinformatics conferences, including the International Symposium on Theoretical and Computational Genome Research, Heidelberg, Germany, March 1996, the Workshop on Structuring Biological Information, Heidelberg, Germany, March 1996, the Meeting on Genome Mapping and Sequencing, Cold Spring Harbor, May 1996, the International Sybase User Group Conference, May 1996, the Bioinformatics Structure Conference, Jerusalem, November 1996, and the Pacific Symposium on Bioinformatics, January 1997.
The results of the research and development underlying the OPM tools work have been presented in papers published in proceedings of database and bioinformatics conferences; these papers are available at http://gizmo.lbl.gov/opm.html#Publications.
DOE Contract No. DE-AC03-76SF00098.
The Genome Topographer: System Design
S. Cozza, D. Cuddihy, R. Iwasaki, M. Mallison, C. Reed, J. Salit, A. Tracy, and T. Marr
Cold Spring Harbor Laboratory; Cold Spring Harbor, NY 11724
Marr: 516/367-8393, Fax: -8461, email@example.com or firstname.lastname@example.org
Genome Topographer (GT) is an advanced genome informatics system that has received joint funding from DOE and NIH over a number of years. DOE funding has focused on GT tools supporting computational genome analysis, principally on sequence analysis. GT is scheduled for public release next spring under the auspices of the Cold Spring Harbor Human Genome Informatics Research Resource. GT has 17 major existing frameworks: 1. Views, including printing, 2. Default manager, 3. Graphical User Interface, 4. Query, 5. Project Manager, 6. Workspace Manager, 7. Asynchronous Process Manager, 8. Study Manager, 9. Help, 10. Application, 11. Notification, 12. Security, 13. World Wide Web Interface, 14. NCBI, 15. Reader, 16. Writer, 17. External Database Interface. GT Frameworks are independent sets of VisualWorks (client) or SmallTalkDB (GemStone) classes which interact to perform the duties required to satisfy the responsibilities of the specific framework. Each framework is clearly defined and has a welldefined interface to use it. These frameworks are used over and over in GT to perform similar duties in different places. GT has basic tools and special tools. Basic tools get used many times in different applications, while special tools tend to be special purpose, designed to do fairly limited things, although the distinction is somewhat arbitrary. Tools typically use several frameworks when they get assembled. Basic Tools: 1. Project Browser, 2. Editor/Viewer, 3. Query, 4. NCBI Entrez, 5. File reader/writer, 6. Map comparison, 7. Database Administrator, 8. Login, 9. Default, 10. Help. Special Tools: 1. Study Manager, 2. Compute Server, 3. Sequence Analysis, 4. Genetic Analysis. These frameworks and tools are combined with a comprehensive database schema of very rich biological expression linked with plugable computational tools. Taken together, these features allow users to construct, with relative ease, online databases of the primary data needed to study a genetic disease (or genes and phenotypes in general) from the stage of family collection and diagnostic ascertainment through cloning and functional analysis of candidate genes, including mutational analysis, expression information, and screening for biochemical interactions with candidate molecules. GT was designed on the premise that a highly informative, visual presentation of comprehensive data to a knowledgeable user is essential to their understanding. The advanced software engineering techniques that are promoted by using relatively new object oriented products has allowed GT to become a highly interactive and visuallyoriented system that allows the user to concentrate on the problem rather than on the computer. Using the rich data representational features characteristic of this technology, the GT software enables users to construct models of realworld, complex biological phenomena. These unique features of GT are key to the thesis that such a system will allow users to discover otherwise intractable networks of interactions exhibited by complex genetic diseases.
The VisualWorks development environment allows the development of code
that runs unchanged across all major workstation and personal computers,
including PCS, Macintoshes and most Unix workstations.
A Flexible Sequence Reconstructor for LargeScale DNA Sequencing: A Customizable Software System for Fragment Assembly
Gene Myers and Susan Larson
Department of Computer Science; University of Arizona; Tucson, AZ 85721
602/621-6612, Fax: -4246, email@example.com
We have completed the design and begun construction of a software environment in support of DNA sequencing called the "FAKtory". The environment consists of (1) our previously described software library, FAK, for the core combinatorial problem of assembling fragments, (2) a Tcl/Tk based interface, and (3) a software suite supporting a modest database of fragments and a processing pipeline that includes clipping and vector prescreening modules. A key feature of our system is that it is highly customizable: the structure of the fragment database, the processing pipeline, and the operation of each phase of the pipeline are specifiable by the user. Such customization need only be established once at a given location, subsequently users see a relatively simple system tailored to their needs. Indeed one may direct the system to input a raw dataset of say ABI trace files, pass them through a customized pipeline, and view the resulting assembly with two button clicks.
The system is built on top of our FAK software library and as a consequence
one receives (a) highsensitivity overlap detection, (b) correct resolution
to large highfidelity repeats, (c) near perfect multialignments, and (d)
support of constraints that must be satisfied by the resulting assemblies.
The FAKtory assumes a processing pipeline for fragments that consists of
an INPUT phase, any number and sequence of CLIP, PRESCREEN, and TAG phases,
followed by an OVERLAP and then an ASSEMBLY phase. The sequence of clip,
prescreen, and tag phases is customizable and every phase is controlled
by a panel of usersettable preferences each of which permits setting the
phase's mode to AUTO, SUPERVISED, or MANUAL. This setting determines the
level of interaction required by the user when the phase is run, ranging
from none to handson. Any diagnostic situations detected during pipeline
processing are organized into a log that permits one to confirm, correct,
or undo decisions that might have been made automatically.
The system permits one to maintain a collection of alternative assemblies, to compare them to see how they are different, and directly manipulate assemblies in a fashion consistent with sequence overlaps. The system can be customized so that a priori constraints reflecting a given sequencing protocol (e.g. doublebarreled or transposon-mapped) are automatically produced according to the syntax of the names of fragments (e.g. X.f and X.r for any X are mates for doublebarreled sequencing). The system presents visualizations of the constraints applied to an assembly, and one may experiment with an assembly by adding and/or removing constraints. Finally, one may edit the multialignment of an assembly while consulting the raw waveforms. Special attention was given to optimizing the ergonomics of this timeintensive task.
DOE Grant No. DEFG0394ER61911.
The Role of Integrated Software and Databases in Genome Sequence Interpretation and Metabolic Reconstruction
Terry Gaasterland, Natalia Maltsev, Ross Overbeek, and Evgeni Selkov
Mathematics and Computer Science Division; Argonne National Laboratory; Argonne, IL 60439
630/252-4171, Fax: -5986, firstname.lastname@example.org
As scientists successfully sequence complete genomes, the issue of how to organize the large quantities of evolving sequence data becomes paramount. Through our work in comparative whole genome analysis (MAGPIE, Gaasterland) and metabolic reconstruction algorithms (WIT, Overbeek, Maltsev, and Selkov), we carry genome interpretation beyond the identification of gene products to customized views of an organism's functional properties.
MAGPIE is a system designed to reside locally at the site of a genome
project and actively carry out analysis of genome sequence data as it is
generated.1,2 DNA sequences produced in a sequencing project
mature through a series of stages that each require different analysis
activities. Even after DNA has been assembled into contiguous fragments
and eventually into a single genome, it must be regularly reanalyzed. Any
new data in public sequence databases may provide clues to the identity
of genes. Over a year, for 2 megabases with 4fold coverage, MAGPIE will
request on the order of 100,000 outputs from remote analysis software,
manipulate and manage the output, update the current analysis of the sequence
data, and monitor the project sequence data for changes that initiate reanalysis.
Once an automated functional overview has been established, it remains to pinpoint the organisms' exact metabolic pathways and establish how they interact.To this end, the WIT (What Is There) system supports efforts to develop metabolic reconstructions. Such constructions, or models, are based on sequence data, clearly established biochemistry of specific organisms, understanding of the interdependencies of biochemical mechanisms. WIT thus offers a valuable tool for testing current hypotheses about microbial behavior. For example, a reconstruction may begin with a set of established enzymes (enzymes with strong similarities in identified coding regions to existing sequences for which the enzymatic function is known) and putative enzymes (enzymes with weak similarity to sequences of known function). From these initial "hits," within a phylogenetic perspective, we identify an initial set of pathways. This set can be used to generate a set of expected enzymes (enzymes that have not been clearly detected, but that would be expected given the set of hypothesized pathways) and missing enzymes (enzymes that occur in the pathways but for which no sequence has yet been biochemically identified for any organism). Further reasoning identifies tentative connective pathways.
In addition to helping curators develop metabolic reconstructions, WIT
lets users examine models curated by experts, follow connections between
more than two thousand metabolic diagrams, and compare models (e.g., which
of certain genes that are conserved among bacterial genomes are found in
higher life). The objective is to set the stage for meaningful simulations
of microbial behavior and thus to advance our understanding of microbial
biochemistry and genetics.
 T. Gaasterland, J. Lobo, N. Maltsev, and G. Chen. Assigning Function to CDS Through Qualified Query Answering. In Proc. 2nd Int. Conf. Intell. Syst. for Mol. Bio., Stanford U. (1994).
 T. Gaasterland and E. Selkov. Automatic Reconstruction of Metabolic Structure from Incomplete Genome Sequence Data. In Proc. Int. Conf. Intell. Syst. for Mol. Bio., Cambridge, England (1995).
Database Transformations for Biological Applications
G.Christian Overton, SusanB. Davidson,1 and Peter Buneman1
Department of Genetics and 1Department of Computer and Information Science; University of Pennsylvania;
Philadelphia, PA 19104
Overton: 215/5733105, Fax: -3111, email@example.com.
Davidson: 215/8983490, Fax: 0587, firstname.lastname@example.org
Buneman: 215/8987703, Fax: -0587,email@example.com
We have implemented a generalpurpose query system, Kleisli, that provides access to a variety of "nonstandard" data sources (e.g., ACeDB, ASN.1, BLAST), as well as to "standard" relational databases. The system represents a major advance in the ability to integrate the growing number and diversity of biology data sources conveniently and efficiently. It features a uniform query interface, the CPL query language, across heterogeneous data sources, a modular and extensible architecture, and most significantly for dealing with the Internet environment, a programmable optimizer. We have demonstrated the utility of the system in composing and executing queries that were considered difficult, if not unanswerable, without first either building a monolithic database or writing highly application-specific integration code (details and examples available at URL above).
In conjunction with other software developed in our group, we have assembled
a toolset that supports a range of data integration strategies as well
as the ability to create specialized data warehouses initialized from community
databases. Our integration strategy is based upon the concept of "mediators",
which serve a group of related applications by providing a uniform structural
interface to the relevant data sources. This approach is costeffective
in terms of query development time and maintenance. We have examined in
detail methods for optimizing queries such as "retrieve all known human
sequence containing an Alu repeat in an intragenic region" where the data
sources are heterogeneous and distributed across the Internet.
We have tested Morphase by applying it to a variety of different transformation problems involving Sybase, ACE and ASN.1. For example, we used it to specify a transformation between the Sanger Center's Chromosome 22 ACE database (ACE22DB) and a Chromosome 22 Sybase database (Chr22DB), as well as between a portion of GDB and Chr22DB. Some of these transformations had already been handcoded without our tools, forming a basis for comparison.
Once the semantic correspondences between objects in the various databases were understood, writing the transformation program in Morphase was easy, even by a nonexpert of the system. Furthermore, it was easy to find conceptual errors in the transformation specification. In contrast, the handcoded programs were obtuse, difficult to understand, and even more difficult to debug.
DOE Grant No. DE-FG02-94ER61923.
S.B. Davidson, C. Overton and P. Buneman, "Challenges in Integrating Biological Data Sources," J. Computational Biology 2 (1995), pp 557-572.
A. Kosky, "Transforming Databases with Recursive Data Structures," PhD Thesis, December 1995.
S.B. Davidson and A. Kosky, "Effecting Database Transformations Using Morphase," Technical Report MSCIS9605, University of Pennsylvania.
A. Kosky, S.B. Davidson and P. Buneman, "Semantics of Database Transformations," Technical Report MSCIS9525, University of Pennsylvania, 1995.
K. Hart and L. Wong, "Pruning Nested Data Values Using Branch Expressions With Wildcards," In Abstracts of MIMBD, Cambridge, England, July 1995.
Las Vegas Algorithm for Gene Recognition: Suboptimal and ErrorTolerant Spliced Alignment
SingHoi Sze and Pavel A. Pevzner1
Departments of Computer Science and 1Mathematics;
University of Southern California; Los Angeles, CA 90089
Pevzner: 213/740-2407, Fax: -2424;
Recently, Gelfand, Mironov, and Pevzner (Proc. Natl. Acad. Sci. USA, 1996, 90619066) proposed a spliced alignment approach to gene recognition that provides 99% accurate recognition of human gene if a related mammalian protein is available. However, even 99% accurate gene predictions are insufficient for automated sequence annotation in largescale sequencing projects and therefore have to be complemented by experimental gene verification. 100% accurate gene predictions would lead to a substantial reduction of experimental work on gene identification. Our goal is to develop an algorithm that either predicts an exon assembly with accuracy sufficient for sequence annotation or warns a biologist that the accuracy of a prediction is insufficient and further experimental work is required. We study suboptimal and errortolerant spliced alignment problems as the first steps towards such an algorithm, and report an algorithm which provides 100% accurate recognition of human genes in 37% of cases (if a related mammalian protein is available). For 52% of genes, the algorithm predicts at least one exon with 100% accuracy.
DOE Grant No. DEFG0397ER62383.
Foundations for a Syntactic Pattern- Recognition System for Genomic DNA Sequences: Languages, Automata, Interfaces, and Macromolecules
David B. Searls and G. Christian Overton1
SmithKline Beecham Pharmaceuticals; King of Prussia, PA 19406
610/270-4551, Fax: -5580, firstname.lastname@example.org
1Department of Genetics; University of Pennsylvania; Philadelphia, PA 19104
Viewed as strings of symbols, biological macromolecules can be modelled as elements of formal languages. Generative grammars have been useful in molecular biology for purposes of syntactic pattern recognition, for example in the author's work on the GenLang pattern matching system, which is able to describe and detect patterns that are probably beyond the capability of a regular expression specification. More recently, grammars have been used to capture intramolecular interactions or longdistance dependencies between residues, such as those arising in folded structures. In the work of Haussler and colleagues, for example, stochastic contextfree grammars have been used as a framework for "learning" folded RNA structures such as tRNAs, capturing both primary sequence information and secondary structural covariation. Such advances make the study of the formal status of the language of biological macromolecules highly relevant, and in particular the finding that DNA is beyond contextfree has already created challenges in algorithm design.
Moreover, to date, such methods have not been able to capture relationships between strings in a collection, such as those that arise via intermolecular interactions, or evolutionary relationships implicit in alignments. Recently we have attempted to remedy this by showing (1) how formal grammars can be extended to describe interacting collections of molecules, such as hybridization products and, potentially, multimeric or physiological protein interactions, and (2) how simple automata can be used to model evolutionary relationships in such a way that complex modelbased alignment algorithms can be automatically generated by means of visual programming. These results allow for a useful generalization of the languagetheoretic methods now applied to single molecules.
In addition, we describe a new software packagebioWidgetfor the rapid development and deployment of graphical user interfaces (GUIs) designed for the scientific visualization of molecular, cellular and genomics information. The overarching philosophy behind bioWidgets is componentry: that is, the creation of adaptable, reusable software, deployed in modules that are easily incorporated in a variety of applications and in such a way as to promote interaction between those applications. This is in sharp distinction to the common practice of developing dedicated applications. The bioWidgets project additionally focuses on the development of specific applications based on bioWidget componentry, including chromosomes, maps, and nucleic acid and peptide sequences.
The current set of bioWidgets has been implemented in Java with the goal in mind of delivering local applications and distributed applets via Intranet/Internet environments as required. The immediate focus is on developing interfaces for information stored in distributed heterogeneous databases such as GDB, GSDB, Entry, and ACeDB. The issues we are addressing are database access, reflecting database schemas in bioWidgets, and performance. We are also directing our efforts into creating a consortium of bioWidget developers and endusers. This organization will create standards for and encourage the development of bioWidget components. Primary participants in the consortium include Gerry Rubin (UC Berkeley) and Nat Goodman (Jackson Labs).
DOE Grant No. DE-FG02-92ER61371.
D.B. Searls, "Formal Grammars for Intermolecular Structure," First International Symposium on Intelligence in Neural and Biological Systems, 30-37 (1995).
D.B. Searls and K.P. Murphy, "AutomataTheoretic Models of Mutation and Alignment," Third International Conference on Intelligent Systems for Molecular Biology, 341-349 ( 1995).
D.B. Searls, "bioTk: Componentry for Genome Informatics Graphical User Interfaces," Gene 163 (2):GC116 (1995).
Analysis and Annotation of Nucleic Acid Sequence
David J. States, Ron Cytron, Pankaj Agarwal, and Hugh Chou
Institute for Biomedical Computing; Washington University; St. Louis, MO 63108
314/3622134, Fax: 0234, email@example.com
Bayesian estimates for sequence similarity: There is an inherent relationship
between the process of pairwise sequence alignment and the estimation of
evolutionary distance. This relationship is explored and made explicit.
Assuming an evolutionary model and given a specific pattern of observed
base mismatches, the relative probabilities of evolution at each evolutionary
distance are computed using a Bayesian framework. The mean or the median
of this probability distribution provides a robust estimate of the central
value. Bayesian estimates of the evolutionary distance incorporate arbitrary
prior information about variable mutation rates both over time and along
sequence position, thus requiring only a weak form of the molecularclock
These techniques and estimates are used to infer the duplication history of the genomic sequence in C. elegans and in S. cerevisae. Our results indicate that repeats discovered using a single scoring matrix show a considerable bias in subsequent evolutionary distance estimates.
Model based sequence scoring metrics: PAM based DNA comparison metric has been extended to incorporate biases in nucleotide composition and mutation rates, extending earlier work (States, Gish and Altschul, 1993). A codon based scoring system has been developed that incorporates the effects biased codon utilization frequencies.
A dynamic programming algorithm has been developed that will optimally align sequences using a choice of comparison measures (noncoding vs. coding, etc.). We are in the process of evaluating this approach as a means for identifying likely coding regions in cDNA sequences.
Efficient sequence similarity search tools: Most sequence search tools have been designed for use with protein sequence queries a few hundred residues long. The analysis of genomic DNA sequence necessitates the use of queries hundreds of kilobases or even megabases in length. A memory and computationally efficient search tool has been developed for the identification of repeats and sequence similarity in very large segments of nucleic acid sequence. The tool implements optimal encoding of the word table, repeat filters, flexible scoring systems, and analytically parameterized search sensitivity. Output formats are designed for the presentation of genomic sequence searches.
Federated databases: A sybase server and mirror for GSDB are being developed to facilitate the annotation of repeat sequence elements in public data repositories.
DOE Grant No. DE-FG02-94ER61910.
Gene Recognition, Modeling, and Homology Search in GRAIL and genQuest
Ying Xu, Manesh Shah, J.Ralph Einstein, Sherri Matis, Xiaojun Guan, Sergey Petrov, Loren Hauser,1 RichardJ. Mural,1 and EdwardC. Uberbacher
Computer Science and Mathematics and 1Biology
Divisions; Oak Ridge National Laboratory; Oak Ridge, TN 37831
Uberbacher: 423/574-6134, Fax: -7860, firstname.lastname@example.org
GRAIL is a modular expert system for the analysis and characterization of DNA sequences which facilitates the recognition of gene features and gene modeling. A new version of the system has been created with greater sensitivity for exon prediction (especially in AT rich regions), more accurate splice site prediction, and robust indel error detection capability. GRAIL 1.3 is available to the user in a Motif graphical clientserver system (XGRAIL), through WWW-Netscape, by e-mail server, or callable from other analysis programs using Unix sockets.
In addition to the positions of protein coding regions and gene models, the user can view the positions of a number of other features including polyA addition sites, potential Pol II promoters, CpG islands and both complex and simple repetitive DNA elements using algorithms developed at ORNL. XGRAIL also has a direct link to the genQuest server, allowing characterization of newly obtained sequences by homologybased methods using a number of protein, DNA, and motif databases and comparison methods such as FastA, BLAST, parallel SmithWaterman, and special algorithms which consider potential frameshifts during sequence comparison.
Following an analysis session, the user can use an annotation tool which is part of the XGRAIL 1.3 system to generate a "feature table" report describing the current sequence and its properties. Links to the GSDB sequence database have been established to record computerbased analysis of sequences during submission to the database or as third party annotation.
Gene Modeling and ClientServer GRAIL: In addition to the current coding region recognition capabilities based on a multiple sensorneural network and rule base, modules for the recognition of features such as splice junctions, transcription and translation start and stop, and other control regions have been constructed and incorporated into an expert system (GAP III) for reliable computerbased modeling of genes. Heuristic methods and dynamic programming are used to construct first pass gene models which include the potential for modification of initially predicted exons. These actions result in a net improvement in gene characterization, particularly in the recognition of very short coding regions. Translation of gene models and database searches are also supported through access to the genQuest server (described below).
Model Organism Systems: A number of model organism systems have been designed and implemented and can be accessed within the XGRAIL 1.3 client including Escherichia coli, Drosophila melanogaster and Arabidopsis thaliana. The performance of these systems is basically equivalent to the Human GRAIL 1.3 system. Additional model organism systems, including several important microorganisms, are in progress.
Error Detection in Coding Sequences: Singlepass DNA sequencing is becoming a widely used technique for gene identification from both cDNA and genomic DNA sequences. An appreciably higher rate of base insertion and deletion errors (indels) in this type of sequence can cause serious problems in the recognition of coding regions, homology search, and other aspects of sequence interpretation. We have developed two error detection and "correction" strategies and systems which make lowredundancy sequence data more informative for gene identification and characterization purposes. The first algorithm detects sequencing errors by finding changes in the statistically preferred reading frame within a possible coding region and then rectifies the frame at the transition point to make the potential exon candidate frameconsistent. We have incorporated this system in GRAIL 1.3 to provide analysis which is very error tolerant. Currently the system can detect about 70% of the indels with an indel rate of 1%, and GRAIL identifies 89% of the coding nucleotides compared to 69% for the system without error correction. The algorithm uses dynamic programming and runs in time and space linear to the size of the input sequence.
In the second method, a Smith-Waterman type comparison is facilitated in which the frame of DNA translation to protein sequence can change within the sequence. The transition points in the translation frame are determined during the comparison process and a best match to potential protein homologs is obtained with sections of translations from more than one frame. The algorithm can detect homologies with a sensitivity equivalent to SmithWaterman in the presence of 5% indel errors.
Detection of Regulatory Regions: An initial Polymerase II promoter detection system has been implemented which combines individual detectors for TATA, CAAT, GC, cap, and translation start elements and distance information using a neural network. This system finds about 67% of TATA containing promoters with a false positive rate of one per 35 kilobases. Additionally a systems to detect potential polyA addition sites and CpG islands has been incorporated into GRAIL.
The GenQuest Sequence Comparison Server: The genQuest server is an integrated sequence comparison erver which can be accessed via email, using Unix sockets from other applications, Netscape, and through a Motif graphical clientserver system. The basic purpose of the server system is to facilitate rapid and sensitive comparison of DNA and protein sequences to existing DNA, protein, and motif databases. Databases accessed by this system include the daily updated GSDB DNA sequence database, SwissProt, the dbEST expressed sequence tag database, protein motif libraries and motif analysis systems (Prosite, BLOCKS), a repetitive DNA library (from J. Jurka), Genpept, and sequences in the PDB protein structural database. These options can also be accessed from the XGRAIL graphical client tool.
The genQuest server supports a variety of sequence query types. For searching protein databases, queries may be sent as amino acid or DNA sequence. DNA sequence can be translated in a user specified frame or in all 6 frames. DNADNA searches are also supported. User selectable methods for comparison include the SmithWaterman dynamic programming algorithm, FastA, versions of BLAST, and the IBM dFLASH protein sequence comparison algorithm. A variety of options for search can be specified including gap penalties and option switches for SmithWaterman, FastA, and BLAST, the number of alignments and scores to be reported, desired target databases for query, choice of PAM and Blosum matrices, and an option for masking out repetitive elements. Multiple target databases can be accessed within a single query.
Additional Interfaces and Access: Batch GRAIL 1.3 is a new "batch" GRAIL client allows users to analyze groups of short (300-400 bp) sequences for coding character and automates a wide choice of database searches for homology and motifs. A Command Line Sockets Client has been constructed which allows remote programs to call all the basic analysis services provided by the GRAILgenQuest system without the need to use the XGRAIL interface. This allows convenient integration of selected GRAIL analyses into automated analysis pipelines being constructed at some genome centers. An XGRAIL Motif Graphical Client for the GRAIL release 1.3 has been constructed using Motif with versions for a wide variety of UNIX platforms including Sun, Dec, and SGI. The email version of GRAIL can be accessed at email@example.com and the email version of genQuest can be accessed at Q@ornl.gov. Instructions can be obtained by sending the word "help" to either address. The Motif or Sun versions of XGRAIL, batch GRAIL, and XgenQuest client software are available by anonymous ftp from grailsrv.lsd.ornl.gov (184.108.40.206). Both GRAIL and genQuest are accessible over the World Wide Web (URL http://compbio.ornl.gov). Communications with the GRAIL staff should be addressed to GRAILMAIL@ornl.gov.
DOE Contract No. DEAC05840R21400.
Informatics Support for Mapping inMouseHuman Homology Regions
Edward Uberbacher, Richard Mural,1 Manesh Shah, Loren Hauser,1 and Sergey Petrov
Computer Science and Mathematics Division and 1Biology Division; Oak Ridge National Laboratory; Oak Ridge, TN 37831
423/574-6134, Fax: -7860, firstname.lastname@example.org
The purpose of this project is to develop databases and tools for the Oak Ridge National Laboratory (ORNL) MouseHuman Mapping Project, including the construction of a mapping database for the project; tools for managing and archiving cDNAs and other probes used in the laboratory; and analysis tools for mapping, interspecific backcross, and other needs. Our initial effort involved installing and developing a relational SYBASE database for tracking samples and probes, experimental results, and analyses. Recent work has focused on a corresponding ACeDB implementation containing mouse mapping data and providing numerous graphical views of this data. The initial relational database was constructed with SYBASE using a schema modeled on one implemented at the Lawrence Livermore National Laboratory (LLNL) center; this was because of documentation available for the LLNL system and the opportunity to maximize compatibility with human chromosome 19 mapping. (Major homologies exist between human chromosome l9 and mouse chromosome 7, the initial focus of the ORNL work.)
With some modification, our ACeDB implementation was modeled somewhat on the Lawrence Berkeley National Laboratory (LBNL) chromosome 21 ACeDB system and designed to contain genetic and physical mouse map data as well as homologous human chromosome data. The usefulness of exchanging map information with LLNL (human chromosome 19) and potentially with other centers has led to the implementation of procedures for data export and the import of human mapping data into ORNL databases.
User access to the system is being provided by workstation formsbased data entry and ACeDB graphical data browsing. We have also implemented the LLNL database browser to view human chromosome l9 data maintained at LLNL, and arrangements are being made to incorporate mouse mapping information into the browser. Other applications such as the Encyclopedia of the Mouse, specific tools for archiving and tracking cDNAs and other mapping probes, and analysis of interspecific backcross data and YAC restriction mapping have been implemented.
We would like to acknowledge use of ideas from the LLNL and LBNL Human Genome Centers.
DOE Contract No. DEAC05840R21400.
SubmitData: Data Submission to Public Genomic Databases
Manfred D. Zorn
Software Technologies and Applications Group;
Information and Computing Sciences Division; Lawrence
Berkeley National Laboratory; University of California; Berkeley CA 94720
510/486-5041, Fax: -4004, email@example.com
Making information generated by the various genome projects available to the community is very important for the researcher submitting data and for the overall project to justify the expenses and resources. Public genome databases generally provide a protocol that defines the required data formats and details how they accept data, e.g., sequences, mapping information. These protocols have to strike a balance between ease of use for the user and operational considerations of the database provider, but are in most cases rather complex and subject to change to accommodate modifications in the database.
SubmitData is a user interface that formats data for submission to GSDB or GDB. The user interface serves data entry purposes, checking each field for data types, allowed ranges and controlled values, and gives the user feedback on any problems. Besides onetime submissions, templates can be created that can later be merged with TABdelimited data files, e.g., as produced by common spreadsheet programs. Variables in the template are then replaced by values in defined columns of the input data file. Thus submitting large amounts of related data becomes as easy as selecting a format and supplying an input filename. This allows easy integration of data submission into the data generation process.
The interface is generated directly from the protocol specifications. A specific parser/compiler interprets the protocol definitions and creates internal objects that form the basis of the user interface. Thus a working user interface, i.e., static layout of buttons and fields, data validation, is automatically generated from the protocol definitions. Protocol modifications are propagated by simply regenerating the interface.
The program has been developed using ParcPlace VisualWorks and currently supports GSDB, GDB and RHdb data submissions. The program has been updated to use VisualWorks 2.0.
DOE Contract No. DEAC0376SF00098.
Note: The proceedings of the 1997 DOE Human Genome Program Contractor-Grantee Workshop VI, which include updated research abstracts, can be found at: