Bioinformatics Section 

DOE Human Genome Program Contractor-Grantee Workshop VIII
February 27-March 2, 2000  Santa Fe, NM


Home
 
PDF

Author Index
Sequencing
Table of Contents
Abstracts   
Instrumentation
Table of Contents
Abstracts
Mapping 
Table of Contents
Abstracts
Bioinformatics
Table of Contents
Abstracts
Function and cDNA Resources
Table of Contents
Abstracts

Microbial Genome Program
Table of Contents
Abstracts
Ethical, Legal, and Social Issues
Table of Contents
Abstracts
Infrastructure
Table of Contents
Abstracts

Ordering Information

Abstracts from
Past Meetings

92. Refreshing Curated Data Warehouses Using XML

Susan B. Davidson, Hartmut Liefke, and G. Christian Overton

Center for Bioinformatics, University of Pennsylvania, Philadelphia, PA 19104-6389

susan@cis.upenn.edu

The process of building a new database relevant to some field of study in biology involves transforming, integrating and cleansing multiple external data sources, as well as adding new material and annotations. Such databases are commonly called curated warehouses (or materialized views) due to the fact that they are derived from other databases with value added. Building them entails two primary problems:

1) specifying and implementing the transformation and integration from the underlying source databases to the view database.

2) automating the refresh process.

Previously, we have reported on the development of the Kleisli system for implementing data transformation and integration (the first problem). In this abstract, we focus how XML can be used to solve the second.

XML is a "self-describing" or semi-structured data format that is increasingly being used for data exchange. More recently, XML query languages and storage techniques have been proposed which enable its use in data-warehousing; we study the problem of using XML to detect and propagate updates. Note that determining how the underlying data sources have changed is a complicated problem, due to the fact that biomedical databases propagate their updates in one of three ways:

a) Producing periodic new versions;

b) Timestamping data entries; and

c) Keeping a list of additions and corrections; each element of the list is a complete entry.

We have developed efficient "diff" techniques for comparing old versions of entries with updated versions of entries which produce the minimal updates in XML. Using these minimal updates, we show that the curated warehouse can be incrementally updated rather than recomputed from scratch for a large class of warehouse definitions.

 


The online presentation of this publication is a special feature of the Human Genome Project Information Web site.