Transforming Molecular Biology Databases Using Morphase[1]

S.B. Davidson, A. Kosky, C. Overton and P. Buneman

Dept. of Computer and Information Science & Dept. of Genetics, University of Pennsylvania, Philadelphia, PA 19104 Email: {susan, coverton, peter} @cis.upenn.edu

The Human Genome Project (HGP) involves a proliferation of different databases, including both archival (GenBank, GSDB) and local notebook databases. These databases frequently use incompatible structures to represent the same or overlapping data, and further may be implemented in a variety of data-models and database management systems (DBMSs), including object-oriented systems (ACEDB, OPM, Object Store), flat-relational databases (SYBASE), and structured text files (ASN.1). It is frequently necessary to transform data between these incompatible databases and data-models. For example data stored in a database at one HGP site may have an impact on the experiments being carried out at another site, and therefore needs to be stored in the local laboratory notebook database at the second site. Further useful tools, such as data browsing or analysis tools, may be implemented for a particular DBMS or database schema, and it is desirable to move data from another database into this system so as to apply these tools. What is required is more than a uniform user interface to distributed databases: data must be integrated and transformed into the structures required by other databases or applications. The problem is aggravated by the rapid evolution of database schemas that results from constantly changing experimental and analysis techniques. Any data transformation system needs to undergo frequent change to reflect these schema evolutions. Finally, there are an increasing number of instantiated integrated genomic databases such as the Integrated Genomic Database (IGD[2]) which require transformations from the member databases into the new structure.

Implementing such transformations by hand on a case by case basis is time consuming and error prone. Consequently there is a need for a method of specifying and implementing transformations in a uniform way, allowing transformations to be specified across a wide variety of different data-models, and to be formally analyzed and verified. Morphase is a prototype system for specifying transformations between data sources and targets in an intuitively appealing, declarative language based on Horn clause logic. Transformation specifications are then translated into an underlying database programming language, CPL[3], for implementation. The data-types underlying Morphase include arbitrarily nested records, sets, variants, lists and object identity, thus capturing the types common to most data formats, in particular ASN.1[4] and ACE[5].

The CPL implementation of Morphase can be connected to a wide variety of data sources through data drivers, modular interfaces that mediate between the internal language of CPL and distributed data sources. Additional drivers for new data sources can easily be added as they arise. In particular, drivers to connect CPL to ASN.1, ACEDB and SYBASE have been developed; other drivers, for example to OPM, are currently being developed. The drivers are used to query as well as update data sources which are instances of their type, e.g. the Sybase driver can be used for our local Sybase database, Chr22DB. In this way, data can be read from multiple heterogeneous data sources, transformed using Morphase according to the desired output format, and inserted into the target data source.

We have tested Morphase by applying it to a variety of different transformation problems involving Sybase, ACE and ASN.1. In particular, we used it to specify a transformation between the Sanger Center's Chromosome 22 ACE database (ACE22DB) and the Philadelphia Genome Center's Chromosome 22 Sybase database (Chr22DB), as well as between a portion of GDB and Chr22DB. Some of these transformations had already been hand-coded without our tools, forming a basis for comparison. Once the semantic correspondences between objects in the various databases were understood, writing the transformation program in Morphase was easy, even by a non-expert of the system. Furthermore, it was easy to find conceptual errors in the transformation specification. In contrast, the hand-coded programs were obtuse, difficult to understand, and even more difficult to debug.

[1] This research was supported by a grant from the Director, Health Effects and Life Science Research Division, Office of Health and Environmental Research of the U.S. Department of Energy under contract DOE DE-FG02-94-ER-61923 Sub 1.

[2] O. Ritter et al., Computers and Biomedical Research, 27:97-115 (1994).

[3] P. Buneman et al., Proceedings of the 21'st International Conference on Very Large Data Bases (September 1995).

[4] "NCBI ASN.1 Specification", National Library of Medicine, Bethesda, MD (1992).

[5] J. Thierry-Mieg and R. Durbin, "Syntactic Definitions for the ACEDB Data Base Manager" Tech Report MRC Laboratory for Molecular Biology, Cambridge, CB2 2QH, UK, (1992).


Abstracts scanned from text submitted for January 1996 DOE Human Genome Program Contractor-Grantee Workshop.

Return to Table of Contents