Introduction to the Workshop
URLs Provided by Attendees
- Ethical, Legal, and Social Issues
The electronic form of this document may be cited in the following style:
Human Genome Program, U.S. Department of Energy, DOE Human Genome Program Contractor-Grantee Workshop IV, 1994.
Abstracts scanned from text submitted for November 1994 DOE Human Genome Program Contractor-Grantee Workshop. Inaccuracies have not been corrected.
New Techniques for Integration of Biological Data Sources
Peter Buneman , Susan Davidson , and Chris Overton 
 Department of Computer and Information Science, University of Pennsylvania,  Department of Genetics, University of Pennsylvania School of Medicine
Much biological data resides in sources that are not conventional databases. Examples include NCBI's ASN.1 database, ACEDB, numerous text files, and the output of sequence matching algorithms (FASTA and BLAST). Such sources cannot be queried by conventional database languages nor are there systems that facilitate the restructuring from one of these formats to another. To this end, we are developing related systems. One is a general-purpose query language, CPL, that subsumes existing database languages and provides interfaces for these varied sources. The second is a transformation tool, TSL, that allows the declarative specification of how one data source may be mapped onto a second.
CPL is a language for data access that has evolved from some basic ideas in category theory. It allows us to generalize relational query languages to a much wider range of data types, including those used in the sources mentioned above and to object-oriented database management systems. For example, our system can answer queries such as: "For sequences in the interval p11.1-q13.2 of human chromosome 22, find all alu elements internal to a gene domain". This query is answered first by retrieving map information from the Genome DataBase, and sequence annotation from GenBank (relational or ASN.1). Extraction of sequence corresponding to the primary transcript of each gene is done by the application program QGB (developed locally). And finally, FASTA is used to compare these sequences with a database of alu elements. These programs were easily added to our system, and are called in the same way as queries to data sources.
Data restructuring is one of the most common and difficult tasks in the current biological database environment. Databases are usually designed through a "conceptual modeling" tool and then translated into some practical database management system. The problem is that people want to reason about, transform, and query the conceptual structure, while actually manipulating the physical representation. Current techniques for doing this are largely based on variants of the entity-relationship model, a model open to semantic misinterpretation and one that fails to take account of an adequate variety of data types. We have developed transformation techniques and prototype tool, TSL, for schema and data transformation that, like CPL, is based on the natural semantics of the underlying data types.
These tools greatly enhance our ability to transform, integrate and access heterogeneous data sources. While they may be used for the physical construction of a monolithic database from the existing data sources, we believe this new approach to database languages calls into doubt the advisability of constructing large, inflexible databases.
 S.B. Davidson, A.S. Kosky and B. Eckman, "Facilitating Transformations in a Human Genome Project Database". To appear at Conference on Information and Knowledge Management, 1994.
 P. Buneman, L. Libkin, D. Suciu, V. Tannen and L. Wong. "Comprehension Syntax", Sigmod Record 23(1):87-96, March 1994.
 G.C. Overton, J.S. Aaronson, J. Haas and J. Adams, "QGB: A System for Querying Sequence Database Fields and Features." Journal of Computational Biology, Vol 1(1), 3-13 1994.