Introduction to the Workshop
URLs Provided by Attendees
- Ethical, Legal, and Social Issues
The electronic form of this document may be cited in the following style:
Human Genome Program, U.S. Department of Energy, DOE Human Genome Program Contractor-Grantee Workshop IV, 1994.
Abstracts scanned from text submitted for November 1994 DOE Human Genome Program Contractor-Grantee Workshop. Inaccuracies have not been corrected.
Statistical Methods to Improve DNA Sequencing Accuracy
David O. Nelson [1,2], and Terence P. Speed 
 Human Genome Center, L-452, Biology and Biotechnology Research Program, Lawrence Livermore National Laboratory, Livermore, California 94550.  Statistics Department, University of California, Berkeley, California 92041.
LLNL is investigating statistical approaches to the problem of determining the DNA sequence underlying data obtained from fluorescence-based gel electrophoresis. Several features of electrophoresis make it interesting to statisticians and probabilists:
In addition, the data generation process in fluorescence-based sequencing poses interesting statistical problems:
- the physical, chemical, and stochastic behavior of the process is complex and still not completely understood
- the yield of fragments of any given size can be quite small and variable
- the mobility of fragments of a given size can depend in predictable ways on the terminating base
Recently published approaches to base calling, such as Giddings et al.  and Tibbetts et al. , address some of these issues using elementary statistics and heuristic decision procedures. While such approaches do tend to out perform the native software, the level of improvement in four dye-per-lane systems appears to diminish rapidly beyond 350-400 bases. Further improvements through software will have to come from a more sophisticated approach to recovering sequence from signal.
- the data consists of samples from one or more continuous, non-stationary signals
- boundaries between segments generated by distinct elements of the underlying sequence are ill-defined or nonexistent in the signal
- the sampling rate of the signal greatly exceeds the transition rate of the underlying discrete sequence
Our approach to signal recovery and base calling involves combining a stochastic model of the electrophoresis process, which describes the diffusion of DNA through a gel, with adaptive equalization techniques from digital communications theory to recover the underlying sequence. We will present the initial results of our investigation of the extent to which this approach enables us to increase base calling accuracy by providing a rational, statistical foundation to the process of deducing sequence from signal.
Research by D. O. Nelson was performed under the auspices of the U. S. Department of Energy by Lawrence Livermore National Laboratory under contract no. W-7405-ENG-48, with additional support from NSF grant DMS-91-13527. Research by T. P. Speed was partially supported by NSF grant DMS-91-13527.
 Giddings, M.-C., R. L. Brumley, M. Haker, and L. M. Smith (1993). An adaptive, objectoriented strategy for base calling in DNA sequence analysis. Nucleic Acids Research, 21(19), 4530-4540.
 Tibbetts, C., J. M. Bowling, and J. B. Golden III, (1994). Neural networks for automated base calling of gel-based DNA sequencing ladders. In J. C. Venter (Ed.), Automated DNA Sequencing and Analysis Techniques, Chapter 31, 219-229. Academic Press.