Introduction to the Workshop
URLs Provided by Attendees
- Ethical, Legal, and Social Issues
The electronic form of this document may be cited in the following style:
Human Genome Program, U.S. Department of Energy, DOE Human Genome Program Contractor-Grantee Workshop IV, 1994.
Abstracts scanned from text submitted for November 1994 DOE Human Genome Program Contractor-Grantee Workshop. Inaccuracies have not been corrected.
Estimating Consensus DNA Sequences
Howard Hughes Medical Institute and Department of Genetics,
Harvard Medical School, 200 Longwood Ave. Boston, MA 02115
617/432-0503, Fax:-7266, Internet: email@example.com
Individual DNA sequence reads are commonly truncated after 350 - 500 base pairs to exclude less accurate data at the tail of each read. A data analysis originally developed in a different context may help take advantage of this lower quality DNA sequence information.
The analysis forms a consensus from any number of inaccurate observations of a single DNA sequence; it allows for insertions and deletions in each observed sequence, as well as substitutions. This analysis is more tolerant of insertions and deletions than algorithms currently used, because it averages over all possible multiple alignments of the observed sequence data, rather than fixing on one alignment. Thus the consensus does not depend on details of one choice of alignment. The averaging process relies on estimates of the base calling error rate at each site in the observed sequences. Others have recently described several ways to make such estimates. Consensus sequence formation is only one component in an integrated approach to sequence assembly.
The analysis was developed in the context of proposed 'single molecule' DNA sequencing methods. In this context, one may suppose that error rates are constant throughout the observed sequences, and that errors occur independently from one base to another within each observed sequence, and independently from one observed sequence to the next. It is of interest to know how much observational error can be tolerated and still produce a consensus which is close to the underlying DNA sequence.
2.5Mb of simulations show that the consensus formed with this approach will reach an accuracy of 10-4 errors per nucleotide when it is constructed either using seven observed sequences with 6% error rate in each (insertions, deletions and substitutions combined) or using twenty observed sequences with 20% error rate. It reaches an accuracy of 10(-3) errors per nucleotide when constructed with sixty observed sequences at 50% error rate. These results are obtained with equal proportions of insertions, deletions and substitutions among the base calling errors, and equal chance of error at every site.
This work was begun at the Center for Human Genome Studies, Los Alamos National Laboratory under U.S. DOE grant B 04861/F118 and continued under Office of Naval Research grant N00014-86K-0246 and National Science Foundation grant DMS-91-04990 to Herman Chernoff at Harvard University. Current support includes U.S. DOE grant FG02-87-ER60565 to George Church at Harvard Medical School.