DOE Human Genome Program Contractor-Grantee
64. An Informatics Framework for Transcriptome Annotation
Brian Brunk1, Jonathan Crabtree1, Mark Gibson1, Chris Overton1, Debra Pinney1, Jonathan Schug1, Chris Stoeckert1, Jian Wang1, Ihor Lemischka2, Kateri Moore2, and Robert Phillips2
1Center for Bioinformatics, University of Pennsylvania, Philadelphia, PA 19104-6021 and 2Princeton University, Princeton, NJ
It is now feasible to define the transcriptional state of a eukaryotic cell with reasonable precision by combining multiple gene expression technologies, e.g., EST analysis with microarrays. However, few of the 10,000 - 20,000 different transcripts expressed in a cell are well characterized in terms of function and cell role. In a collaborative effort, we have begun the identification and characterization of the transcripts produced in the mouse hematopoietic stem cell. The Princeton group has enriched for the stem cell from fetal mouse liver by sorting for cells positive for the markers AA4.l, Sca-1 and c-Kit and low in Lin. A normalized, subtracted (against a stromal cell cDNA library) cDNA library was generated from these cells. A similar strategy was adopted in the construction of a stromal cell library. ESTs were generated from both libraries and analyzed through an automated computational annotation pipeline followed by expert manual annotation. Currently approximately 4000 stem cell and 3000 stromal cell ESTs have been carefully annotated leading to a well-defined "molecular phenotype" of each cell type and opening the way for follow-up analyses of novel genes of interest. Based on this prototype annotation process, we have developed an integrated informatics framework for the systematic annotation of cell-specific transcriptomes. The system combines data management and visualization facilities with automated and manual data analysis components accessible through a Java servlet-based architecture. Using the K2 technology for accessing distributed databases, it integrates computationally annotated mouse and human genomes (GAIA system), computationally annotated mouse and human transcriptomes built from dbEST ESTs and known mRNAs (DOTS), and protein sequences in SwissProt. The K2 facility also provides access to a number of other remote databases and analysis services. Computational annotation steps include: clustering and assemble of ESTs/mRNAs to form consensus transcribed sequences (TSs); gene finding by similarity to TSs; similarity of TSs across species and in proteins; and assignment of cell roles/functions to TSs using computational and manual analyses. Manual annotation steps include: assessment of quality of consensus sequence to identify artifacts; refinement of cell role/function assignment; and characterization of alternative splicing. Results of the characterization of the stem and stromal cell molecular phenotypes will be presented.
|The online presentation of this publication is a special feature of the Human Genome Project Information Web site.|