Introduction to the Workshop
URLs Provided by Attendees
- Ethical, Legal, and Social Issues
The electronic form of this document may be cited in the following style:
Human Genome Program, U.S. Department of Energy, DOE Human Genome Program Contractor-Grantee Workshop IV, 1994.
Abstracts scanned from text submitted for November 1994 DOE Human Genome Program Contractor-Grantee Workshop. Inaccuracies have not been corrected.
Towards Simplifying Fragment Assembly
Department of Computer Science, University of Arizona, Tucson, AZ 85721
The key computational problem in assembling fragments in a project involving some amount of shotgun sequencing is to determine which overlaps between fragments to use in forming contigs. Particularly when the underlying target sequence is repetitive, there are often several possible overlaps and different choices lead to distinct and often suboptimal reconstructions of the target. We have discovered a very simple preprocess which identifies and melds all fragments that must overlap in any optimal solution, leaving us with an assembly problem that is small and for which the key combinatorial choices are clear. For example, on sequence that is non-repetitive, the process usually melds all fragments providing the one, unique layout for the fragments. This confirms an observation by many that the simple "greedy" algorithm used in most fragment assembly systems is adequate for such problems. The real potential of our simplification arises in problems involving repetitive sequence. For example, given a sequence with 3 exact copies of a 2kb repeat, our process results in 5 contigs that involve 6 overlaps and for which the distinct arrangements are easy to enumerate and the optimal one is self-evident.
We will present our simplification strategy (easily understood by non-mathematicians) along with empirical results showing the reductions gained for various types of sequence. We will then discuss some interesting aspects of the reduced problems and how we can lever them to rigorously solve for the target sequence. What is particularly exciting here is that the combinatorial reduction raises our expectations of being able to solve problems involving *constraints* and *repetitive* sequence in a rigorous, accurate way. It also permits us to correctly formulate the assembly problem in terms of the most probable solution using the one-sided Kolmogorov-Smimov statistic.
If time permits, we will also report on our progress to assembly 50,000 fragments covering a 700kb stretch of E. Coli with no augmenting or supporting information (data supplied by Fred Blattner's lab). This problem involves 13mb of raw data and is of a scale and order of magnitude larger than any tried before. We are using a combination of supercomputers, state-of-the-art algorithms, and of course, our new simplification technique.