Introduction to the Workshop
URLs Provided by Attendees
- Ethical, Legal, and Social Issues
The electronic form of this document may be cited in the following style:
Human Genome Program, U.S. Department of Energy, DOE Human Genome Program Contractor-Grantee Workshop IV, 1994.
Abstracts scanned from text submitted for November 1994 DOE Human Genome Program Contractor-Grantee Workshop. Inaccuracies have not been corrected.
Software Support for High-Throughput DNA Sequencing
Charles Lawrence[1,2], Victor Solovyev, and Eugene Myers
 Departments of Cell Biology and Human and Molecular Genetics, Baylor College of Medicine, Houston, TX 77030.  Department of Computer Science, University of Arizona, Tucson, AZ.  Corresponding author.
One of the barriers to achieving high-throughput DNA sequencing is the processing and management of the sequence data from the time of its generation by the sequencing hardware to its emergence as finished, edited and annotated DNA sequence. The bottleneck is not due to any major unsolved technical challenge, but to the lack of effective, integrated software support for the process.
Over the past 3 years, we have developed an integrated software system whose main goal is to automate the management of sequence data for high-throughput projects, and to provide effective interactive tools for use when human interaction with the data is necessary. The development of the system has been facilitated by using an object-oriented approach for the design of the system, and using tools that support object-oriented system development for its implementation. The Genome Reconstruction Manager (CRM) provides several advances in software support for high-throughput DNA sequencing: support for random, directed, and mixed sequencing strategies; a novel subsystem for fragment assembly (developed by E. Myers, Univ. Arizona); a commercial object database management system for data storage; a client/server architecture for using network computational servers; and an underlying data model that can evolve to support fully-automatic sequence reconstruction.
In a related research project, we have studied the association of error in sequence data with the quality of the underlying primary data. DNA sequence predicted from polyacrylamide gel-based technologies is inaccurate because of variations in the quality of the primary data due to limitations of the technology, and sequence-specific variations due to nucleotide interactions within the DNA molecule and with the gel. The ability to recognize the probability of error in the primary data is useful in reconstructing the target sequence for a DNA sequencing project, and in estimating the accuracy of the final sequence.
We used linear discriminant analysis to assign position-specific probabilities of incorrect, over- and under- prediction of nucleotides for each predicted nucleotide position in the primary sequence data generated by a gel-based DNA sequencing technology. Using this method, correct base predictions can be separated from incorrect, over- and under- predictions with Mahalanobis distances of 7.4, 11.8 and 6.7 respectively. Applying the discriminant to a test data set, a table associating discriminant scores with the probability of a prediction error is easily calculated. This information can then be used to assign the probability of error for each of the three error types to each base position in new sequence data with values between <0.0001 to 0.68.
Measurements of accuracy associated with the primary sequence data can be used as the basis for eliminating or minimizing the human editing of assembled sequence data and to automatically generate consensus sequence with confidence estimates at each position in the final sequence. In future work, we will integrate the results of the error analysis research in GRM with the goal of automating the generation of consensus sequence.
Lawrence, C.B, Honda, S., Parrott, N.W., Flood, T.C., Gu, L, Zhang, L., Jain, M., Larson, S., and Myers, E.W.1994. The Genome Reconstruction Manager: A software environment for supporting high-throughput DNA sequencing. Genomics (in press).
Lawrence, C.B. and Solovyev, V.V.1994. Assigning position-specific error probabilities to primary DNA sequence data. Nucl. Acids Res. 22: 1272- 1280.