![]() |
|
| Archive Edition | |
|
Sponsored
by the U.S. Department of
Energy Human Genome Program
|
Santa Fe, New Mexico, November 13-17, 1994
|
Introduction to the Workshop
The electronic form of this document may be cited in the following style: Abstracts scanned from text submitted for November 1994 DOE Human Genome Program Contractor-Grantee Workshop. Inaccuracies have not been corrected. |
New Techniques for Integration of Biological Data SourcesPeter Buneman [1], Susan Davidson [1], and Chris Overton [2] Much biological data resides in sources that are not conventional databases. Examples include NCBI's ASN.1 database, ACEDB, numerous text files, and the output of sequence matching algorithms (FASTA and BLAST). Such sources cannot be queried by conventional database languages nor are there systems that facilitate the restructuring from one of these formats to another. To this end, we are developing related systems. One is a general-purpose query language, CPL, that subsumes existing database languages and provides interfaces for these varied sources. The second is a transformation tool, TSL, that allows the declarative specification of how one data source may be mapped onto a second. CPL is a language for data access that has evolved from some basic ideas in category theory. It allows us to generalize relational query languages to a much wider range of data types, including those used in the sources mentioned above and to object-oriented database management systems. For example, our system can answer queries such as: "For sequences in the interval p11.1-q13.2 of human chromosome 22, find all alu elements internal to a gene domain". This query is answered first by retrieving map information from the Genome DataBase, and sequence annotation from GenBank (relational or ASN.1). Extraction of sequence corresponding to the primary transcript of each gene is done by the application program QGB (developed locally). And finally, FASTA is used to compare these sequences with a database of alu elements. These programs were easily added to our system, and are called in the same way as queries to data sources. Data restructuring is one of the most common and difficult tasks in the current biological database environment. Databases are usually designed through a "conceptual modeling" tool and then translated into some practical database management system. The problem is that people want to reason about, transform, and query the conceptual structure, while actually manipulating the physical representation. Current techniques for doing this are largely based on variants of the entity-relationship model, a model open to semantic misinterpretation and one that fails to take account of an adequate variety of data types. We have developed transformation techniques and prototype tool, TSL, for schema and data transformation that, like CPL, is based on the natural semantics of the underlying data types. These tools greatly enhance our ability to transform, integrate and access heterogeneous data sources. While they may be used for the physical construction of a monolithic database from the existing data sources, we believe this new approach to database languages calls into doubt the advisability of constructing large, inflexible databases. [1] S.B. Davidson, A.S. Kosky and B. Eckman, "Facilitating Transformations in a Human Genome Project Database". To appear at Conference on Information and Knowledge Management, 1994.
|
Send the url of this page to a friend
To read pdf files, download the free Acrobat Reader software.
Last modified: Wednesday, October 29, 2003
Home * Contacts * Disclaimer
Base URL: www.ornl.gov/hgmis
Site sponsored by the U.S. Department of Energy
Office of Science, Office
of Biological and Environmental Research, Human
Genome Program