Beyond the Identification of Transcribed Sequences:
Functional and Expression Analysis

11th Annual Workshop
November 9-12, 2001
Washington D.C.


Abstracts * Speakers * Organizers * Original Announcement

Annotating Proteomes using Highly Specific Protein Scoring Matrices

Serge Saxonov
Beckman Center, Room B 403
Stanford University School of Medicine
Stanford, CA 94305
telephone: 650-723-5976
fax: 650-725-6044
email: saxonov@stanford.edu
prestype: Platform
presenter: Serge Saxonov

Serge Saxonov, Qiaojuan Su and Douglas L. Brutlag
Department of Biochemistry, Stanford University, Stanford California, 94305-5307

In order to identify new functions in protein families and superfamilies, we have developed eBLOCKS, a database of ungapped alignments (blocks) of highly conserved protein regions. eBLOCKS automatically builds blocks directly from protein sequences such as the SWISS-PROT database. Each unclassified protein sequence is used as a PSI-BLAST query and compared against the entire sequence database. The resulting PSI-BLAST alignments are analyzed by a modified K-means clustering algorithm to generate protein groups with different levels of similarity, representing protein families and super-families. Each group of conserved sequences are aligned heuristically and trimmed into ungapped regions. The current eBLOCKS database contains 81,413 blocks. The completely automated eBLOCKS database has several advantages over BLOCKS+ database, which is built from protein groups pre-defined in a number of protein family databases. The eBLOCKS database is more comprehensive: it covers the majority of BLOCKS+ database and yet 65% of its blocks are novel to eBLOCKS.

Instead of representing a region by only one block as in BLOCKS+, eBLOCKS database enumerates blocks representing different family levels for each conserved region and thus maximizes sensitivity and specificity when used to search an unknown sequence. Unlike BLOCKS+, eBLOCKS does not require three conserved positions in the blocks, and can thus incorporate blocks with variability allowed at all positions.

Evaluation tests show that eBLOCKS has greater sensitivity than BLOCKS+. In particular, we have used eBLOCKS and BLOCKS+ as scoring matrices to search the human proteome for significant hits at the specificity of 10-3. We were thus able to annotate 67% of the human proteins in RefSeq with eBLOCKS vs. 54% when using BLOCKS+. This rate approaches the level of what one can achieve with homology searches. Annotating proteins with blocks gives more accurate annotation information than homology-based searches since the use of blocks focuses precisely on the functionally important regions. The eBLOCKS database is available on the World Wide Web at http://fold.stanford.edu/eblocks/




  Abstract List


Abstracts * Speakers * Organizers * Original Announcement

Genetic Meetings