TIGR Assembler:A New Tool for Assembling Large Shotgun Sequencing Projects

Granger G. Sutton, Owen White, Mark D. Adams and Anthony R. Kerlavage

The Institute for Genomic Research, 9712 Medical Center Drive, Rockville, MD 20850

In large shotgun sequencing projects DNA fragments are assembled into a consensus sequence. The basic approach is to compare each pair of fragments to find overlaps and use this information to build a consensus sequence. Two obstacles are the large number of pairwise comparisons and the presence of repetitive elements. TIGR Assembler(1) uses a fast initial comparison of fragments (similar to BLAST) to eliminate the need for a more sensitive comparison between most fragment pairs greatly reducing the computer search time. TIGR Assembler recognizes potential repetitive elements by determining which fragments have more potential overlaps than expected given a random distribution of fragments. Repetitive elements are dealt with in a number of ways: repetitive regions are assembled last so that maximum information from non-repetitive regions can be used, the stringency of match criteria is increased in repetitive regions, and constraints involving fragments sequenced from both ends of a clone are used. Short repetitive elements less than half the length of the average fragment are usually not a problem because they are most often spanned by a single fragment. Likewise, repetitive elements which are significantly less similar than the fragment sequencing accuracy (e.g. 94% similar vs. 98% accurate) can be handled by increasing the match stringency. For long, nearly identical repetitive elements sequencing from both ends of clones of known average length and reasonably small variance is essential. This allows fragments which are totally contained in a repetitive element to be properly placed by TIGR Assembler based on the position of their corresponding clone mate. This technique will not work for repetitive regions longer than the average clone length. For very long, nearly identical repetitive regions a second library of much longer clones sequenced from both ends is necessary for TIGR Assembler to determine which flanking regions should be joined. TIGR Assembler can fill the very long repetitive regions with a consensus sequence or the exact sequence can be determined by walking the repeat containing clone. The basic steps in the TIGR Assembler algorithm are as follows: 1) perform pairwise fragment comparisons for the entire data set to generate a list of potential fragment overlaps. 2) use the distribution of the number of potential overlaps for each fragment to label fragments as repeat or non-repeat. 3) start with a non-repeat fragment as the initial assembly seed or a repeat fragment if no non-repeat fragment is left; quit if no fragments remain. 4) use potential overlap list to attempt merges between the current assembly and non-repeat fragments. 5) when no potential overlaps with non-repeat fragments remain for the current assembly, increase the stringency of the match criteria and enforce clone length constraints when attempting to merge with repeat fragments. 6) if due to a merge with a repeat fragment, a non-repeat fragment is added to the potential overlap list go to step 4. 7) when there are no fragments left on the current potential overlap list, output information about the current assembly and go to step 3. TIGR Assembler has been used to assemble the complete genomes of H. influenzae and M. genitalium.

1. Sutton G., White O., Adams M. and Kerlavage A., (1995), TIGR Assembler: A New Tool for Assembling Large Shotgun Sequencing Projects, Genome Science & Technology, 1(1), 9-19.


Abstracts scanned from text submitted for January 1996 DOE Human Genome Program Contractor-Grantee Workshop.

Return to Table of Contents