Introduction to the Workshop
URLs Provided by Attendees
- Ethical, Legal, and Social Issues
The electronic form of this document may be cited in the following style:
Human Genome Program, U.S. Department of Energy, DOE Human Genome Program Contractor-Grantee Workshop IV, 1994.
Abstracts scanned from text submitted for November 1994 DOE Human Genome Program Contractor-Grantee Workshop. Inaccuracies have not been corrected.
Recursive Relational Representation for DNA and Attribute-Value Lists: Techiques for Reducing Schema Modifications
Robert M. Pecherer , Joe M. Gatewood , and Robert D. Sutherland 
 Theoretical Biology and Biophysics Group; T-10, MS K710; Los Alamos National Laboratory; Los Alamos, New Mexico 87545.  Genomics and Structural Biology Group; LS-2, MS 880, LANL.
Database representations to support genome physical mapping and sequencing necessarily change in response to new protocols, new kinds of data, and changes in data requirements for tools developed to analyze this data. Schemas for new entities and changes to existing schemas may induce changes to browsers, interfaces and existing tools if they are to continue to operate correctly. We propose a Recursive Relational representation for DNA Segments which reduces the impact of introducing new DNA Segment types, and the use of an Attribute-Value List field to eliminate schema modification when new data items are needed in existing schemas.
Genome physical mapping and DNA sequencing are typically performed by recursive decomposition. In a typical physical mapping project, chromosomes are cut into clonable units which are subject to characterization by fingerprinting and/or hybridization probing. These smaller DNA segments (which may themselves be further decomposed to even smaller DNA segments) are collectively analyzed and organized by their characteristics into regionalized and/or overlapping sets (contigs). DNA Sequencing is similar: An interesting segment of DNA is sequenced in a series of experiments which generate nucleotide sequence data for consecutive "steps" obtained by "walking" along the segment with the pcr reaction. A consensus sequence is obtained by assembling the sequences obtained for the individual pieces.
With respect to a database representation for modeling the physical decomposition and logical reassembly for both physical mapping and sequencing, the traditional approach (used for example in the original, Los Alamos "Lab NoteBook" and currently in GDB) has been to design one database table for each type of DNA segment: chromosome, yac, cosmid, contig, restriction fragment, sequencing experiment, etc. These individual tables are then connected by specialized "linking" tables which associate (for example) the cosmid clones in one table to the containing contig in another. However, except for size ranges, each of these units is fundamentally the same: A segment of DNA. We propose a generalized relational schema for "DNA Segment", illustrate its use for several sequencing applications, and consider the database consequences for users and systems.
A related database problem concerns the introduction of new data items (or modification or deletion of existing data items) for existing database entities. Schema modification typically has a rippling effect on all software that accesses the affected tables. These may include standardized reports, browsers (that are not automatically generated from schemas) and tools developed for data analysis. This may happen even if the software has no "interest" in the new or modified data items. To address this problem and minimize the impact of these relatively minor schema changes, we borrow the technique of "Attribute-Value Lists" and implement them as a schema-independent, extensible table field. We illustrate several examples, describe the implementation and discuss some implications of the approach.
Research funded by U.S. Department of Energy under Contract W-7405-ENG-36.