|Genome Informatics Section
DOE Human Genome Program Contractor-Grantee Workshop
113. Multi-Resolution Molecular Sequence Classification
David J. States, Zhengyan Kan, and
Classification is the most reliable and widely used basis for inferring macromolecular function from primary sequence. Beginning with the pioneering work of Margaret Dayhof, a number of sequence classification algorithms have been proposed based including sequence signatures (Prosite), profiles (blocks), HMMs (pfam), and transitive closure relationships (HHS and others). There are intrinsically conflicting constraints on domain classifications that makes it difficult to achieve satisfactory performance in all applications all of the time. Classes must be general enough to represent all of the members of a class, but this generality limits the information content of any single pattern and reduces the sensitivity with which members can be detected. Further, the stochastic nature of mutations may result in domain detection in some sequences and failure to detect domains in other closely related sequences. In transitive closure methods where we are attempting to infer domain structure from similarity relationships, variations in the extent of sequence covered by sequence alignments may further confuse matters and result in the failure to consistently recognize a domain. Instead the algorithm defines several related domains with overlapping membership and sequence extents.
Here we present a novel approach to molecular sequence classification that addresses some of these problems. A multi-resolution approach is employed in which sequences are first classified into transitive closure groups (TCGs) on the basis of high scoring global sequence alignments. These TCGs are then grouped into superfamilies based on inferred domain content and local sequence similarity relationships. All of the members of a TCG are assumed to have identical domain structure providing more redundancy in the data available for domain definition and avoiding inconsistent domain annotation between closely related sequences. To date, 14,227 transitive closure groups with more than two members have been defined in a classification of non-redundant protein sequences derived from SwissProt, PIR, OWL, TREMBL, and GenBank. Work on HMM representations for TCG and the grouping of TCGs into superfamilies is on-going. Relating the annotation and literature reference accessible through primary sequence classification with the structure-based classification being developed at SDSC is proposed as a goal for the Molecular Sciences Thrust.
|Author Index||Sequencing Technologies||Microbial Genome Program|
|Search||Mapping||Ethical, Legal, & Social Issues|