DOE Human Genome Program Contractor-Grantee
92. Refreshing Curated Data Warehouses Using XML
Susan B. Davidson, Hartmut Liefke, and G. Christian Overton
Center for Bioinformatics, University of Pennsylvania, Philadelphia, PA 19104-6389
The process of building a new database relevant to some field of study in biology involves transforming, integrating and cleansing multiple external data sources, as well as adding new material and annotations. Such databases are commonly called curated warehouses (or materialized views) due to the fact that they are derived from other databases with value added. Building them entails two primary problems:
1) specifying and implementing the transformation and integration from the underlying source databases to the view database.
2) automating the refresh process.
Previously, we have reported on the development of the Kleisli system for implementing data transformation and integration (the first problem). In this abstract, we focus how XML can be used to solve the second.
XML is a "self-describing" or semi-structured data format that is increasingly being used for data exchange. More recently, XML query languages and storage techniques have been proposed which enable its use in data-warehousing; we study the problem of using XML to detect and propagate updates. Note that determining how the underlying data sources have changed is a complicated problem, due to the fact that biomedical databases propagate their updates in one of three ways:
a) Producing periodic new versions;
b) Timestamping data entries; and
c) Keeping a list of additions and corrections; each element of the list is a complete entry.
We have developed efficient "diff" techniques for comparing old versions of entries with updated versions of entries which produce the minimal updates in XML. Using these minimal updates, we show that the curated warehouse can be incrementally updated rather than recomputed from scratch for a large class of warehouse definitions.
|The online presentation of this publication is a special feature of the Human Genome Project Information Web site.|