James N. Labrenz and Tim Hunkapiller
University of Washington, Department of Molecular Biotechnology, Box 357730, Seattle, WA 98195.
As our technical ability to generate DNA sequence data increases at an ever expanding rate, debate persists as to the best choice of strategies to pursue in order to provide both optimal biology and economy within genome-scale projects. Although many technical issues remain to be resolved, it is first critical to define the "end product" that minimally satisfies the community's objective. In other words, in order to determine the best strategy, we need first to define the model of completeness and accuracy we require of the final or representational sequence for a given portion of the genome. The debate can be intense here as well, for there are many opinions as to just how important accuracy of the final sequence is in defining its biological 'reality' or at least its usefulness.
Unfortunately, because of the significant subjective component of data analysis of traditional sequencing methods, these examinations are difficult to pursue as so little is understood of the true nature of error in the raw data, let alone its relationship to the accuracy of the final sequence. We are pursuing a rigorous examination of principal DNA sequence data of a large-scale, genomic sequencing project generated with automated sequencing instruments under various protocol regimens. Our objective is to establish the nature of variation between the raw and consensus data in order to (1) establish better rules for translating the raw instrument data into called bases; (2) provide a quantitative basis for developing 'confidence' values for raw base calls; (3) compensate for the impact of error on feature-identification (gene finding, database comparisons, etc.); and (4) provide a suite of software and database tools that will provide for the consistent and automated error analysis of large amounts of DNA sequence data. This tool will allow researchers to better evaluate variation in laboratory procedures (which polymerase is best, are long gels worth it, etc.), to evaluate the efficacy of different base calling methods and to assess the efficiency of various algorithms critical to successful sequence assembly (end clipping, repeat sequence discrimination, overlap analysis, etc.). In addition, the tool will provide a quality control evaluation for the day-to-day sequencing effort.
It is expected that these efforts will provide insight into the biology of sequencing (i.e., how polymerases interact with their substrate), help in the development of better tools for data characterization and provide a quantitative approach to protocol optimization. It is assumed that a better understanding of the raw DNA sequence data, will allow for the extraction of more useful information with the same amount of laboratory effort, hence impacting on the choice of sequencing strategies and the economies of large-scale projects. We report here on our software development efforts as well as representative analyses of data from a mega-base sequencing project.
* Supported by a grant from the Director, Office of Energy Research, Office of Health andEnvironmental Research of the U.S. Department of Energy under contract DE
 DOE Human Genome Distinguished Postdoctoral Fellow.
Return to Table of Contents