Samuel Pitluck, Arun Aggarwal, Frank Eeckman, Eugene Veklerov
Human Genome Informatics Group, Lawrence Berkeley National Laboratory, Berkeley CA 94720
The Lawrence Berkeley National Laboratory Human Genome Center pioneered the directed sequencing strategy. Using this strategy, we map all the sequence initiation sites using transposons. This information can be used to facilitate the assembly process. We decided to use the Fragment Assembly Kernel (FAK-Larson, Jain, and Meyers, 1994) in our assembly procedures because it can handle up-front mapping information as constraints. FAK has been incorporated into two C-language programs. One program (SPASS) is used to assemble 300-450 bp fragments into contigs of 3000-4500 bp in length. The second program is used to find a tiling path for our double ended sequencing strategy.
Our assembly program is able to make use of the information that is available about the fragments that come from transposon pairs. Each transposon site yields two reverse complimented sequences that overlap by 5 bp. The information for each transposon site is summarized in a constraint file. This constraint file along with all sequence fragments is read into our assembly program for processing. We used SPASS to assemble the 3 kb subclones and compared its performance to XBAP (Staden, 1992). We report on the results of this comparison. In general SPASS has produced fewer contigs than XBAP.
We also built an interface to FAK using SPACE, a variant of ACeDB (Durbin and Mieg, 1991). ACeDB has been used mostly as a database program. In SPACE we have added the capability of trace editing, assembly, as well as fragment and contig display. Because FAK consists of a library of functions it is easy to customize both the input and output to and from the assembler. The assembler package communicates with SPACE via constraint and fragment files written in .ace format. This process is transparent to the user. Thus, users are able to select fragments for assembly and call on our assembler to assemble the fragments. The results can then be readily displayed within SPACE.
We have also developed an algorithm to find tiling paths used in our double ended sequencing strategy. Here, we used the compare function in FAK to find "hits" between all the end fragments of all 3 kb subclones in a particular 80 kb P1 clone. We use 192 random clones and generate 384 end fragments from these. The compare function returns a score for each comparison. The score "roughly reflects the length of the overlap with a deduction for mismatches in the alignment." We accept two fragments as overlapping if the score is greater than 15. After all the "hits" are determined, a separate program extracts all the possible tiling paths. We present examples of these comparisons and tiling paths.
*This work was supported by the U.S. Department of Energy under Contract Number DE-AC03-76SF00098