DOE Genomes
-

Human Genome Project Information


Archive

logo

DOE Human Genome Program Contractor-Grantee Workshop IV

Santa Fe, New Mexico, November 13-17, 1994

PDF

Introduction to the Workshop
URLs Provided by Attendees

Abstracts
Mapping
Informatics
Sequencing
Instrumentation
Ethical, Legal, and Social Issues
Infrastructure
 

The electronic form of this document may be cited in the following style:
Human Genome Program, U.S. Department of Energy, DOE Human Genome Program Contractor-Grantee Workshop IV, 1994.

Abstracts scanned from text submitted for November 1994 DOE Human Genome Program Contractor-Grantee Workshop. Inaccuracies have not been corrected.

Estimating Consensus DNA Sequences

Tom Blackwell
Howard Hughes Medical Institute and Department of Genetics,
Harvard Medical School, 200 Longwood Ave. Boston, MA 02115
617/432-0503, Fax:-7266, Internet: blackwel@twod.med.harvard.edu

Individual DNA sequence reads are commonly truncated after 350 - 500 base pairs to exclude less accurate data at the tail of each read. A data analysis originally developed in a different context may help take advantage of this lower quality DNA sequence information.

The analysis forms a consensus from any number of inaccurate observations of a single DNA sequence; it allows for insertions and deletions in each observed sequence, as well as substitutions. This analysis is more tolerant of insertions and deletions than algorithms currently used, because it averages over all possible multiple alignments of the observed sequence data, rather than fixing on one alignment. Thus the consensus does not depend on details of one choice of alignment. The averaging process relies on estimates of the base calling error rate at each site in the observed sequences. Others have recently described several ways to make such estimates. Consensus sequence formation is only one component in an integrated approach to sequence assembly.

The analysis was developed in the context of proposed 'single molecule' DNA sequencing methods. In this context, one may suppose that error rates are constant throughout the observed sequences, and that errors occur independently from one base to another within each observed sequence, and independently from one observed sequence to the next. It is of interest to know how much observational error can be tolerated and still produce a consensus which is close to the underlying DNA sequence.

2.5Mb of simulations show that the consensus formed with this approach will reach an accuracy of 10-4 errors per nucleotide when it is constructed either using seven observed sequences with 6% error rate in each (insertions, deletions and substitutions combined) or using twenty observed sequences with 20% error rate. It reaches an accuracy of 10(-3) errors per nucleotide when constructed with sixty observed sequences at 50% error rate. These results are obtained with equal proportions of insertions, deletions and substitutions among the base calling errors, and equal chance of error at every site.

This work was begun at the Center for Human Genome Studies, Los Alamos National Laboratory under U.S. DOE grant B 04861/F118 and continued under Office of Naval Research grant N00014-86K-0246 and National Science Foundation grant DMS-91-04990 to Herman Chernoff at Harvard University. Current support includes U.S. DOE grant FG02-87-ER60565 to George Church at Harvard Medical School.


Last modified: Wednesday, October 29, 2003

Home * Contacts * Disclaimer

Document Use and Credits
Publications and webpages on this site were created by the U.S. Department of Energy Genome Program's Biological and Environmental Research Information System (BERIS). Permission to use these documents is not needed, but please credit the U.S. Department of Energy Genome Programs and provide the website http://genomics.energy.gov. All other materials were provided by third parties and not created by the U.S. Department of Energy. You must contact the person listed in the citation before using those documents.

Base URL: www.ornl.gov/hgmis

Site sponsored by the U.S. Department of Energy Office of Science, Office of Biological and Environmental Research, Human Genome Program