On-Demand Integration of Biological Data[1]

J. Crabtree, L. Wong, P. Buneman, S.B. Davidson, and C. Overton

Dept. of Computer and Information Science & Dept. of Genetics, University of Pennsylvania, Philadelphia, PA 19104 Email: {crabtree, limsoon, peter, susan, coverton}@cis.upenn.edu

We have implemented a general-purpose query system, CPL/Kleisli, that provides access to a variety of "unconventional" data sources (e.g., ACeDB, ASN.1, BLAST), as well as to "standard" relational databases. The system represents a major advance in the ability to integrate the growing number and diversity of biology data sources conveniently and efficiently. It features a uniform query interface across heterogeneous data sources, a modular and extensible architecture, and most significantly for dealing with the Internet environment, a programmable optimizer. We have demonstrated the utility of our system in composing and executing queries that were considered difficult, if not unanswerable, without first either building a monolithic database or writing highly application-specific integration code (details and examples available at http://agave.humgen.upenn.edu/cpl/cplhome.html). In conjunction with other software developed in our group, we have assembled a toolset that supports a range of data integration strategies as well as the ability to create specialized databases initialized from community databases (see abstracts by Buneman et. al and Davidson et. al in this meeting). Our integration strategy is based upon the concept of "mediators", which serve a group of related applications by providing a uniform structural interface to the relevant data sources. This approach is cost-effective in terms of query development time and maintenance. Here we discuss recent results in optimizing queries such as "retrieve all known human sequence containing an Alu repeat in an intragenic region" where the data sources are heterogeneous and distributed across the Internet.

CPL already optimizes queries in such a way as to minimize response time, and it is open to the addition of both new data sources and algebraic rules governing the use of those data sources. To determine how best to augment CPL's query transformation rules, we have conducted a series of performance tests across different combinations of biological data sources. For the test query "retrieve all sequence entries with a CDS feature located on chromosome A," a CPL query spanning GDB and GSDB approaches the response time of the query executed directly on GSDB, which maintains a local copy of the necessary GDB information. In contrast, the best version of the same query executed across GDB and NCBI-Entrez using our ASN.1 query engine is at least an order of magnitude slower. The tests allowed us to identify optimization rules which apply to large classes of queries and are hence reusable.

To aid in the iterative process of identifying potential bottlenecks and introducing rules to circumvent them, we are developing a set of graphical profiling tools. One such tool displays the alternative query plans generated by the system and a second monitors the actual execution of a query plan, displaying the pattern of data source accesses generated by the system. Profile analysis has enabled us to identify which of the data retrievals were dominating the time spent on our test queries, much as a programming language profiler can reveal how much time is being spent in a specific subroutine or loop. The difference in our case is that the time taken to fetch a particular piece of data is in general dependent on many more variables (network traffic, remote server usage, and so on).

We have found that for CPL (or an analogous system) to decide which optimization strategy to employ in a given situation requires access to meta-data pertaining to the data sources it accesses. Extending the current system to be aware of such information where it is available will bring a twofold advantage. On the one hand, such information is almost certain to be essential in arbitrating between different optimization rules and classes of rules. On the other hand, since there is often overlap between what the optimizer needs to know to generate an efficient plan and what a user needs to know to compose a query, an obvious extension is to enable the system to guide a user's query based on its (necessarily) up-to-date knowledge of the data sources. A competent query interface should serve not only to hide irrelevant details, but also to provide relevant details. Thus our two immediate goals-usability and efficiency-are not necessarily orthogonal, as they might first appear, and we hope to exploit the connection.

[1] This research was supported by a grant from the Director, Health Effects and Life Science Research Division, Office of Health and Environmental Research of the U.S. Department of Energy under contract DOE DE-FG02-94-ER-61923 Sub 1.


Abstracts scanned from text submitted for January 1996 DOE Human Genome Program Contractor-Grantee Workshop.

Return to Table of Contents