![]() |
|
| Archive Edition | |
|
Sponsored
by the U.S. Department of
Energy Human Genome Program
|
Santa Fe, New Mexico, November 13-17, 1994
|
Introduction to the Workshop
The electronic form of this document may be cited in the following style: Abstracts scanned from text submitted for November 1994 DOE Human Genome Program Contractor-Grantee Workshop. Inaccuracies have not been corrected. |
Estimating Consensus DNA SequencesTom Blackwell Individual DNA sequence reads are commonly truncated after 350 - 500 base pairs to exclude less accurate data at the tail of each read. A data analysis originally developed in a different context may help take advantage of this lower quality DNA sequence information. The analysis forms a consensus from any number of inaccurate observations of a single DNA sequence; it allows for insertions and deletions in each observed sequence, as well as substitutions. This analysis is more tolerant of insertions and deletions than algorithms currently used, because it averages over all possible multiple alignments of the observed sequence data, rather than fixing on one alignment. Thus the consensus does not depend on details of one choice of alignment. The averaging process relies on estimates of the base calling error rate at each site in the observed sequences. Others have recently described several ways to make such estimates. Consensus sequence formation is only one component in an integrated approach to sequence assembly. The analysis was developed in the context of proposed 'single molecule' DNA sequencing methods. In this context, one may suppose that error rates are constant throughout the observed sequences, and that errors occur independently from one base to another within each observed sequence, and independently from one observed sequence to the next. It is of interest to know how much observational error can be tolerated and still produce a consensus which is close to the underlying DNA sequence. 2.5Mb of simulations show that the consensus formed with this approach will reach an accuracy of 10-4 errors per nucleotide when it is constructed either using seven observed sequences with 6% error rate in each (insertions, deletions and substitutions combined) or using twenty observed sequences with 20% error rate. It reaches an accuracy of 10(-3) errors per nucleotide when constructed with sixty observed sequences at 50% error rate. These results are obtained with equal proportions of insertions, deletions and substitutions among the base calling errors, and equal chance of error at every site. This work was begun at the Center for Human Genome Studies, Los Alamos National Laboratory under U.S. DOE grant B 04861/F118 and continued under Office of Naval Research grant N00014-86K-0246 and National Science Foundation grant DMS-91-04990 to Herman Chernoff at Harvard University. Current support includes U.S. DOE grant FG02-87-ER60565 to George Church at Harvard Medical School.
|
Send the url of this page to a friend
Last modified: Wednesday, October 29, 2003
Home * Contacts * Disclaimer
Base URL: www.ornl.gov/hgmis
Site sponsored by the U.S. Department of Energy
Office of Science, Office
of Biological and Environmental Research, Human
Genome Program