Introduction to the Workshop
URLs Provided by Attendees
- Ethical, Legal, and Social Issues
The electronic form of this document may be cited in the following style:
Human Genome Program, U.S. Department of Energy, DOE Human Genome Program Contractor-Grantee Workshop IV, 1994.
Abstracts scanned from text submitted for November 1994 DOE Human Genome Program Contractor-Grantee Workshop. Inaccuracies have not been corrected.
Software Support for Large Scale Sequencing
Joe Gatewood, Robert M. Pecherer, and Elaine Best
 Genomics and Structural Biology Group; LS-2, MS 880; Los Alamos National Laboratory; Los Alamos, New Mexico 87545.  Theoretical Biology and Biophysics Group; T-10, MS K710, LANL.  Applications Programming Group; CIC-12, MS B295; LANL.
Current and projected DNA sequencing rates effectively prohibit direct human interaction on experimentally derived raw sequence data -- i.e., inspection, elimination of cloning artifacts, and editing in general. The investigator-as-data-processor bottleneck is further compounded during sequence analysis where DNA homology comparisons against public sequence databases result in redundant or extraneous information: Relevant homology is diluted by the irrelevant.
Our goals in developing software to support large scale sequencing are:
- High speed sequence data entry where routine computer processing is the rule and human intervention the exception;
- Sequence analysis where homology comparison and reporting are interactively customizable to enduser needs;
- For STS generation and sequencing, primer selection is automated and customizable;
- Sequence order relationships and base confidence information are captured during sequencing and exploited in consensus sequence assembly.
The first three goals have been addressed. For primer directed sequencing, we have developed database representations and an exploring assembly algorithms.
Our system architecture includes four components: Enduser interface, database, context management, and analytical tools. The enduser interface is implemented using Gain Momentum (Sybase) and programmed in GEL, a proprietary scripting language specialized for interactive I/O and task management. Database functionality is provided by the Relational DBMS Sybase using recursive DNA representations. (See "Recursive Relational Representation for DNA and Attribute-Value Lists: Techniques for Reducing Schema Modifications", Pecherer et al., DOE Human Genome Contractors Workshop IV, Santa Fe, NM, November 13-17, 1994.) Context management and analytical tools are written in C for performance and flexibility. Context management provides data management capability for the objects and collections obtained from the database and/or operated upon by analysis software. The analytical tools include homology comparison, feature selection, and primer selection algorithms.
All analytical tools are designed as integral system components to avoid file parsing. We have implemented BLAST with a global alignment capability (BLASTga) to avoid segmented DNA homologies and have incorporated post screening of homology results to eliminate redundant extraneous information resulting from repetitive elements.
Research funded by U.S. Department of Energy under Contract W-7405-ENG-36.