Phil Green
Molecular Biotechnology Department, Univ. of Washington, Seattle
The human genome project is moving into its decisive final phase, in which the genome sequence will be determined in large-scale efforts carried out in a number of laboratories. Although current technology appears largely adequate to the task, it will be essential to reduce as much as possible the need for skilled human labor. Editing (correction of base calls and assembly errors) is at present one of the most skill-intensive aspects of genome sequencing, and as such is a bottleneck to increased throughput, a potential source of uneven sequence quality, and an obstacle to more widespread participation in genomic sequencing by the community. We are working towards the long term goal of completely removing the need for human intervention at this stage, with the short-term goals of improving the accuracy of assembly and base-calling, and of more precisely delineating sequence regions requiring human review.
We have developed a program (phred) for making improved base calls and quality assessment of processed ABI 373A and 377 trace data, and an assembly program (phrap). Overall, phred base calls have approximately 40% fewer errors than ABI base calls. Phred's quality measures, which take into account peak spacing, the location of unresolved peaks, and the size of any uncalled peaks, allow identification of subsets of the read having error rates of specified levels. In typical data sets, about 25% of the usable read length consists of bases that can be identified as having an error rate less than 1 per 10kb, and about 60% is identifiable as having an error rate less than 1 per kb. This has important implications for the depth of coverage required in shotgun sequencing, since it implies that low error rates may be attainable when some regions are single-stranded.
Phrap uses quality information, both direct (from phred analysis) and indirect (from read comparison), to delineate the likely accurate base calls; this helps distinguish repeats, and permits use of the full (untrimmed) reads in assembly. In outline, the key assembly steps are as follows: (1) Reads are compared pairwise using a fast implementation of the Smith-Waterman algorithm. Alignment scores are then adjusted to reflect the qualities of discrepant bases, and the list of matches is ranked by these adjusted scores. At this stage anomalous reads (e.g. chimeras) are also identified. (2) A greedy assembly algorithm is used to construct a layout of read overlaps, based on the pairwise comparisons. (3) The contig sequence is constructed from the layout as a "mosaic" of the highest quality parts of the reads; this is done by finding an optimal path through an appropriately defined weighted directed graph. (4) The quality of the assembly is analyzed by enumerating discrepancies between reads and the contig sequence, "weak joins" that are potential sites of misassembly, and consistency of forward/reverse read pairs. (5) A probability of error (reflecting the amount and quality of trace data) is computed for each sequence position. This can be used to focus human editing on particular regions, and to automate decision-making about where additional data is needed.
In collaboration with L. Rowen and with the St. Louis / Sanger consortium, we have begun systematic studies of the performance of these programs on representative cosmid datasets. For 9 mammalian and 9 C. elegans cosmids, the complete (final, but unedited) sets of ABI traces were analyzed using phred to obtain base calls and quality measures, and the reads were reassembled using phrap. In each case, all reads (apart from chimeras and singlet or doublet "contaminants") assembled into 1 or 2 contigs. There were no false joins. The per base error rates (relative to the human edited standard) for the 18 cosmids averaged 1 error per 4 kb, with less than 1 error per 20 kb in the phrap "high quality" bases (which constitute 95% of the total sequence).
These results suggest that it should be possible to substantially reduce editing labor in the near future without significantly compromising sequence accuracy.