|Genome Informatics Section
DOE Human Genome Program Contractor-Grantee Workshop
118. Analysis of Ribosomal RNA Sequences by Combinatorial Clustering
Poe Xing, Casimir Kulikowski, Ilya
Muchnik, Inna Dubchak, Sylvia Spengler, Manfred Zorn, and Denise Wolf
In our present study, multi-aligned sequences of eukaryotic and procaryotic small subunit rRNA were analyzed using a novel clustering procedure in an attempt to extract subsets of sequences sharing common features. This procedure includes two new models - data segmentation and a core separation and consists of the following four steps: a) sequence segmentation and identification of likely conserved segments according to some specific criterion (i.e. gap frequency); b) clustering of sequences based on each of these segments; c) intersection of clustering results from all the conserved segments; d) comparison of the results of the steps a)-c) with a phylogenetic tree.
Segmentation is a result of global optimization of a new objective function that finds the most homologous consequent partition of a given set of aligned sequences. It was developed as a very efficient and simple dynamic programming procedure. Segmentation was performed on the multi-alignment of 409 eucaryotic rRNA sequences and, independently, on the multialignment of 6205 procaryotic rRNA sequences. In both cases we tested different levels of granularity of segmentation by changing total number of segments. The position and the length of the conserved segments in the multi-alignment were relatively stable. Segment-specific score function discriminated sequence segments mostly composed of gaps from those less frequently interrupted by gaps. Among eucaryotes we found seven conserved segments with less than 20% gaps in the segment, and among procaryotes - nine conserved segments with less than 40% of gaps.
Using the novel clustering procedure, we examined these, minimally interrupted by gaps, segments of the multi-alignment. Every segment was analyzed individually by the clustering procedure, which extracted optimal (exact and unique) subset of 'correlated elements' among all aligned sequences. From each segment we obtained one core cluster and one complementary tail cluster. In the core cluster, all sequences were close to each other and also similar to the consensus sequence of the corresponding segment. For this reason, we call the core cluster a 'homogeneous group', and the tail cluster a 'heterogeneous group'. The sizes of the homogeneous groups derived from each segment in eucaryotes were 284, 344, 361, 343, 366, 335, 317 sequences, respectively. From this result, we can see that rRNA sequences are indeed highly conserved in eukaryotic organisms since among 409 analyzed sequences, a majority belongs to the homologous groups. In procaryotes homogeneous groups derived from each segment contained 3838, 3343, 2378, 2447, 4312, 2641, 1491, 837, 3179 sequences, respectively. Although a relative fraction of sequences in the homologous groups is lower than in eucaryotes, it is still significant and reached 69 % for one of the segments.
Clusters resulting from different conserved segments are fairly consistent. We performed the intersection of all clustering results on all segments by labeling each sequence with an occurrence label. Although there are 27, or 128 types of occurrence patterns possible among seven conserved segments of eucaryotes, only 33 patterns were observed, which indicatesd a significant deviation from a random sequence classification. Furthermore, of the 33 patterns, only 4 patterns could be considered significant because they were shared by a large enough number of sequences. To integrate clustering information from all conserved segments, we ranked each sequence according to its occurrence label, and aggregated them based on the rank. We found that 249 of the 409 rRNA sequences fell into the group with the highest rank: 7, which means they are homologous as determined by clustering of all seven conserved sequence segments. In procaryotes distribution of patterns is also non-random, although clusters resulting from 9 different conserved segments are not very consistent. Among 29, or 512 types of occurrence patterns, 320 patterns were observed, and among those only 11 combinations were represented by more than 100 sequences and 249 by less than 20. 59 Sequences fell into the group with the highest rank: 9, which means they were homologous as determined by clustering of all the nine conserved sequence segments. There were 415, 705, and 940 sequences in the clusters of rank 8, 7 and 6 respectively, which also suggests a substantial homology among the sequences. There are 470 sequences in the cluster of rank 0, meaning that these sequences share little similarity among all nine conserved segments.
Prevalence of the homologous sequences in all segments indicates that using only conserved sequence segments greatly reduces the effect of random information from non-conserved or nonessential sequence fragments on the evaluation of relationship between sequences. Comparison of the phylogenetic classification of the rRNA sequences with our clustering results showed that each phylum usually corresponds to one or two major clusters that are adjacently ranked in our analysis. The advantage of presented algorithm is that: (1) We avoid the interference of frequent gaps that exist in the multi-aligned sequences, and base our clustering only on uninterrupted sequence segments potentially corresponding to essential functional units of rRNA molecules. (2) By identifying these conserved segments, in future we will be able to develop new procedures to cluster unaligned sequences. (3) The algorithm provides the means to apply a polynomial clustering procedure of O(n2) by using the special properties of the objective function defined on the conserved segments.
Since our clustering is based on an objective criterion defined by specific statistical properties of the sequences, and uses no prior knowledge of the biological relevance of the sequences being analyzed, the consistency of our clustering result with an independently derived phylogenetic organization of the associated organisms suggests that it is feasible to apply such an objective and stable clustering method to discover phylogenetic correlations among large number of biological sequences. It can serve as a framework to organize these sequences in an efficient and easily searchable manner.
|Author Index||Sequencing Technologies||Microbial Genome Program|
|Search||Mapping||Ethical, Legal, & Social Issues|