|Genome Informatics Section
DOE Human Genome Program Contractor-Grantee Workshop
110. Improved Specificity and Sensitivity in Sequence Similarity Search Through the Use of Suboptimal Alignment Based Score Filtering
Lisa Gu and David States
The specificity of molecular sequence similarity search is often limited by the presence of repetitive elements present in biological sequences. Both repeat filtering and biased content filtering methods have been proposed to alleviate these problems, however these methods can mask off large portions of some query sequences limiting the utility of subsequent searches. We have examined the use of suboptimal alignment to automatically identify robust regions of sequence similarity and use this indirectly to filter out the repetitive regions whose alignment is not definite. In this algorithm, the alignment confidence is assessed by comparing the score of the optimal alignment in a pair of residues are aligned with the highest score for an alignment in which the two residues are not paired. Varying degrees of stringency can be applied by raising the threshold for accepting an aligned pair. A "confidently aligned residues" (CAR) score is obtained by performing an optimal Smith-Waterman optimal alignment and subtracting the pairwise score for those residues pairs in that alignment that can not be confidently aligned.
Protein families rich in repetitive sequence were examined and members within the same family were aligned with each other. The results CAR scores were compared to those obtained using the XNU filter as a masking technique and WU-BLASTP (2.0) as the search algorithm. For the collagen family, whose members have extensive and highly repetitive regions, CAR based scoring is uniformly more sensitive in the detection of family members compared with XNU + BLAST. Alignments are missed by XNU + BLASTP as a result of excessive masking by XNU, but large numbers of false positive alignments are seen if BLAST is run without XNU. On the other hand, XNU + BLASTP is, in some cases, able to detect regions of similarity in the myosin heavy chain family, which has some members with a minimal amount of repetitive region. For non-collagen, non-myosin repetitive sequence proteins, CAR scores detected a significant number of similarities missed by XNU + BLAST and in no case was a similarity detected by XNU + BLAST missed with CAR scoring. Our results can be explained by the fact that suboptimal alignment algorithm imposes a more stringent constraint on the alignment between two sequences than BLASTP. Moreover, since the members have minimal repetitive regions, masking by XNU does not cause a tremendous loss of information. CAR scores appear to be a useful tool for enhancing the performance of sequence similarity search in the face of repetitive sequence regions.
|Author Index||Sequencing Technologies||Microbial Genome Program|
|Search||Mapping||Ethical, Legal, & Social Issues|