DOE Human Genome
|Home||Sequencing||Functional Genomics Index|
|Author Index||Sequencing Technologies||Microbial Genome Program|
|Search||Mapping||Ethical, Legal, & Social Issues|
|121. The Regulatory Network of
Matthew N. Ashby, Tod Flak, and
Darren H. Wong
Eukaryotic cells possess the ability to
orchestrate the expression of thousands of genes in response to a changing
environment. While numerous genome sequencing projects of eukaryotic model
organisms are currently under way, only that of the yeast Saccharomyces
cerevisiae has been completed. The modest size of the yeast genome,
approximately 6000 hypothetical open reading frames, represents a significant
opportunity to study the organization and inter-relationships of the regulation
of gene expression on a genomic scale. The Genome Reporter Matrix (GRM)
consists of a high density array of yeast colonies each harboring one of
over 6000 yeast promoter-reporter fusions. The GRM can measure patterns
of gene expression in living cells in response to external stimuli or mutations.
The response of yeast exposed to an extensive panel of environmentally
important compounds as well as exposure to ionizing radiation will be examined
at the level of changes in gene expression. Compensatory changes in gene
regulation will also be examined in response to a collection of mutations.
Analyses of the 1300 expression profiles of a set of 864 reporters in response
to pharmaceutical agents revealed the presence of 26 unique regulons. These
analyses will be extended to over 6000 reporters in response to the proposed
environmental stimuli. The generality of the regulons identified from these
experiments will be assessed by a series of directed experiments in human
cells in tissue culture. These experiments will provide a map or framework
for the regulatory circuitry within a eukaryote and help determine the
extent of the evolutionary conservation between yeast and human cells.
122. Genomic Hot Spots for Homologous Recombination
Jerzy Jurka, Jiong Ma, and Sun-Yu
Non-LTR retrotransposons, or retroposons integrate at short, consensus-defined DNA targets in mammals in a process mediated by L1 element1,2. These targets appear to be hot spots for homologous recombination. We have determined that significant recombination occurs only in cells lacking p53 tumor suppressor protein, such as C33A cell line. Co-transfection of p53 gene to C33A inhibited the recombination. We have also studied recombinogenic effects of different mutations within the targets. The results will be presented. We will discuss implications of our research for understanding genomic instability in cancer and germ line cells as well as its potential applications in gene therapy.
1Jurka, J. Proc. Natl. Acad.
Sci. U.S.A. 94: 1872-1877 (1997).
123. Development and Application of Subtractive Hybridization-Based Approaches to Facilitate Gene Discovery
Maria de Fatima Bonaldo, Brian Berger,
and Marcelo Bento Soares
It is widely recognized that the generation
of Expressed Sequence Tags (ESTs) from 3' terminal exons of cDNA clones
randomly picked from libraries constitutes an efficient strategy to identify
genes (Adams et al. 1992; Adams et al. 1991; Adams et al. 1995; Adams et
al. 1993; Houlgatte et al. 1995; Khan et al. 1992; Matsubara and Okubo
1993; Okubo et al. 1992). However, it is important to acknowledge that
despite its advantages, there are several problems associated with the
EST approach. One of the problems commonly observed in large scale EST
programs is the redundant generation of ESTs corresponding to the most
common RNAs (i.e. mRNAs of the super-prevalent and intermediate frequency
classes, mitochondrial RNAs, and rRNAs). This is a problem that can significantly
impair the overall efficiency of a gene discovery program that relies solely
on the generation of ESTs from cDNA clones randomly picked from standard
libraries. The use of normalized cDNA libraries has been shown to expedite
gene discovery in large scale EST programs (Berry et al. 1995; Hillier
et al. 1996). Because in a typical normalized cDNA library the frequency
of all clones is within an order of magnitude range (Soares et al. 1994),
redundant identification of the most common RNAs is greatly minimized.
Normalized libraries can be generated by a number of reassociation-kinetics
based approaches (Bonaldo et al. 1996; Soares and Bonaldo 1996; Soares
et al. 1994). It is noteworthy, however, that the process of normalization
only contributes to minimize redundancies within (not across) libraries.
Redundant identification of ESTs derived from mRNAs that are expressed
in multiple tissues and therefore are represented in multiple libraries
constitutes a major problem at advanced phases of gene discovery programs.
The use of normalized libraries cannot help to solve this problem. Hence,
we have argued that this problem can be more effectively addressed by the
use of subtractive libraries that are progressively enriched for novel
ESTs (Bonaldo et al. 1996). With this support from the U.S. Department
of Energy, we have developed a subtractive hybridization-based gene discovery
strategy, which we named "Serial Subtraction of Normalized Libraries",
which involves the generation of ESTs from subtracted libraries enriched
for novel cDNAs. Serial Subtraction of Normalized Libraries is an iterative
approach whereby all arrayed cDNA clones from a library (which have been
or will be used for generation of ESTs) are pooled and used as a driver
in a subtractive hybridization with the library from which they originated.
Since the representation of the driver population is significantly reduced
in the resulting subtracted library, redundant generation of ESTs, regardless
of abundance, is significantly minimized. Hence, every new library of a
series is enriched for novel ESTs. Most importantly, however, this process
enhances the proportional representation of rare transcripts rather significantly.
It should be emphasized that such transcripts are likely to be missed in
more random sampling approaches, unless very large numbers of ESTs are
generated from a library, which inevitably ends up becoming costly and
inefficient due to the very high redundancy levels that are reached. This
strategy has been successfully applied in the rat gene discovery program
that we are conducting at The University of Iowa with NIH support. We have
been able to minimize redundancies rather significantly and thus maintain
a high frequency of identification of novel ESTs (62 % overall average)
after a total of approximately 32,000 ESTs submitted to GenBank since February
1998. Most importantly, we were able to identify approximately 20,000 unique
clusters from a total of 32,000 3' ESTs, a gene discovery efficiency that
is unprecedented in any EST program described to date.
124. Generation of Large-Insert Mouse cDNA Libraries
Lisa Stubbs1, Jimmy
and Xiaojia Ren1
We have developed straightforward, reliable
and efficient protocols for generating large-insert clone libraries from
size-selected cDNA. We have found enzyme combinations and RNA preparation
protocols that routinely produce double-stranded cDNA products with excellent
representation of very large cDNA fragments (5-15 kb). After cDNA synthesis,
the products are fractionated on sucrose gradients, and each size fraction
is cloned and plated as a separate sub-library. Size fractions are chosen
for screening according to information obtained from Northern blot analysis
or from PCR screening of pooled sublibrary clones. Our most recent efforts,
funded as part of the JGI functional genomics pilot program, have focused
upon improvements in methods of library production and screening; we have
also begun to experiment with new methods to normalize large insert pools
before cloning. Using these improved protocols, we have recently created
a series of new mouse cDNA libraries representing brain, thymus, ovary,
and other mouse tissues. We will work with I.M.A.G.E. to share these resource
libraries with the Genome research community as widely as possible, to
serve as a resource for isolation of full-length cDNA clones.
125. The DOTS Resource for Gene Expression Analysis and Genome Annotation
Chris Overton, Brian Brunk, Jonathan
Crabtree, Philip Le, and Jules Milgram
We have created a linking database that integrates
a wide range of high quality, carefully analyzed information on eukaryotic
transcribed sequences. This new resource, termed DOTS (Database of Transcribed
Sequences, builds upon and substantially expands previous work on creating
LENS, a database linking information on ESTs generated in the IMAGE/WashU/Merck
project. The DOTS resource supports gene expression analyses ongoing
at Penn and large-scale genome annotation as part of the DOE sponsored
Genome Annotation Collaboratory. The central organizing concept of the
database is a representation for mature messenger and structural RNAs
and their predicted sequences accompanied by links to genomic sequences
and proteins, and associated information, e.g., gene expression arrays
and gene expression experiments. Construction of DOTS requires ongoing
computational analyses to identify putative transcribed sequences as
determined from databases of experimentally identified mRNAs, ESTs and
genomic sequences. In the long term, however, most of the effort in
building and maintaining DOTS involves integration of data from across
multiple online resources (and to some extent directly from the scientific
literature). For example, DOTS incorporates keywords and functional
taxonomies from GenBank, OMIM, SwissProt, and EGAD among others, enabling
complex queries such as "Display all transcription factors with a greater
than 4-fold difference in mRNA abundance level at day 11 of erythropoiesis
in adult and cord blood." The integration process is facilitated through
the K2 system, developed at Penn, for integration of information in
distributed, heterogenous databases. Presentation of data is through
the bioWidgets visualization toolkit also developed Penn.
126. Web Based Quality Reporting of Completed DNA Sequencing
Robert D. Sutherland
I have created a Web based mechanism to report quality statistics on completed DNA sequencing projects. The motivation for this project was to streamline the processing of sequence and phrap quality data to the Web in automated manner for access by the public. This function crosses over facility boundaries and provides a single point of access for all of the JGI sequencing data. Currently, the JGI is required to (1) run quality codes to create the sequencing statistics, (2) transfer summary statistics to an Excel spreadsheet, (3) convert the spreadsheets to HTML and post to the Web. This is costly in time and effort, and the data from all sites cannot be viewed in a single report.
This new process automates the above steps in a single Web application that will meet the increased growth of sequencing within the JGI.
The Web implementation of this project is in three parts. The first part connects Perl scripts to the Web to run quality numbers on sequencing projects. This is all done locally at each facility. Once the quality numbers are acceptable, the Web part gathers more information about the project and posts all the data to one common database. The third Web part is the reporting mechanism which can give standard reports or create limited ad hoc queries to see only a portion of the sequencing projects.
This application utilizes several integrated technologies including the WWW, CGI, and Database engines using HTML, Perl, SQL, and C-shell.
The reporting part of the application can
be accessed at http://jgi.doe.gov. This
work is funded by the United States Department of Energy.
127. IMAGEne II: EST Clustering and Ranking of I.M.A.G.E. cDNA Clones Corresponding to Known and Unknown Genes
Peg Folta, Tom Kuczmarski, and Christa
With just under 2 million entries in dbEST, the ability to select the best cDNA clone(s) to conduct costly research is becoming increasingly difficult. The I.M.A.G.E. Consortium has developed the IMAGEne II product to increase the value of its cDNA collection by organizing its corresponding dbEST information at a gene level. IMAGEne II first clusters I.M.A.G.E. clones to both known and "unknown" genes. It then ranks the clones within a cluster as to their ability to represent the gene. A java-based display allows users to query against this database of information and view the alignment of the clusters at the nucleotide level. While the current product deals with the human species only, it will soon be extended to include other species and multi-species clusters.
This work was performed by Lawrence Livermore
National Laboratory under the auspices of the U.S. Department of Energy,
Contract No. W-7405-Eng-48.
128. Screening for Mutant Phenotypes in Mice at ORNL
D.K. Johnson, K.C. Goss, G.S. Sega,
J.C. Schryver, M.J. Paulus, M.N. Ericson, and L.S. Webb
The Life Sciences, Instrumentation and Controls, and Chemical and Analytical Sciences Divisions at Oak Ridge National Laboratory have launched a broad-based, high-throughput primary screening program designed to recover mouse mutations exhibiting subtle phenotypes. We are validating screening tools for behavioral, biochemical, morphological, and physiological changes induced by experimental mutagenesis by screening about 100 test-class mice per week.
Our behavior-testing set currently includes the Porsolt forced swim test, rotorod, Poly-Track open-field activity system, and PhotobeaM activity monitor. We are introducing modifications into our cued and contextual fear conditioning test for learning and memory deficits and in our startle response tests, which have not proved adequate for reliable mutant identification so far. Biochemical tests include gas chromatography/mass spectrometry analysis of fatty acids, organic acids, and neurotransmitters in blood and tissue, as well as standard package analysis on an Abbott Cell-dyne 3500 Hematology Analyzer. For urine, we perform standard dipstick and specific gravity tests.
Tool development includes a microCT scanner with image analysis software for mice, and a subdermal microbiosensor for the measurement of activity patterns, heart rate, body temperature, and, eventually, blood pressure. We are organizing joint screening programs with clinical and academic institutions across the state of Tennessee in order to broaden our screening and greatly enhance our expertise. Our goal is to maximize the number of whole-organism mutant phenotypes that we can detect in a high-throughput, broad-based, and cost-effective primary screening effort at ORNL.
[Research sponsored jointly by the Office of Health and Environmental Research, USDOE, under contract DE-AC05-960R22464 with Lockheed Martin Energy Systems, Inc., and by the National Center for Human Genome Research (HG 00370).]
129. Using Overlapping Deletions in the Analysis of Recessive Phenotypes
Yun You, Hanna Chao, Sarah Mentzer,
Rebecca Bergstrom1, and John Schimenti1
Chromosomal deletions have been exploited to perform a systematic characterization of functional units in Drosophila melanogaster. The Human Genome Project will generate nucleotide sequences of 109 base pairs, an estimated 80,000 to 100,000 genes in human, and only a small percentage of them has a known role. As a model system, the mouse is an indispensable tool to decipher mammalian gene function. A high throughput method has recently been developed to induce chromosomal deletions at any region of the mouse genome by radiation in embryonic stem (ES) cells. Lines of mutant mice carrying deletions around the D17Aus9 locus have been generated by this strategy. Deletion analysis of mutant mice called D17Aus9df10J carrying a small deletion showed that an early lethal gene is located near the D17Aus9 locus. Early lethality renders further deletion analysis of this region difficult. In our deletion analysis this problem was easily avoided by using Del(17)T7J, another mutant line carrying a deletion, which does not encompass the D17Aus9 locus, but overlap with the deleted region found in D17Aus9df10J. By crossing D17Aus9df10J /+ to Del(17)T7J /+, the heterozygous compound deletions unveiled a late action recessive lethal locus. The deletion analysis data and initial characterization of this lethal mutant will be presented.
Above results illustrate the importance to generate sets of overlapping deletion complexes in the mouse chromosome 15 mutagenesis project at Oak Ridge National Laboratory (see abstract of E. Rinchik, et. al). The deletions will be used as mapping tools to locate the ENU-induced point mutations, and will also serve as reagents to identify functional units and clone genes important for mouse development along chromosome 15.
[Research currently sponsored by USDOE, under contract DE-AC05-960R22464 with Lockheed Martin Energy Research, Inc.]
130. Germline Deletion Complexes in Embryonic Stem Cells for Mapping Gene Function in Mouse-Human Homology Regions
Edward J. Michaud, Irina Khrebtukova,
Carmen M. Foster, and Tuan Vo-Dinh
Rapid progress has been made by the human genome sciences community in the last several years in generating nearly complete physical maps of several human chromosomes. In the very near future, the map positions and DNA sequence of the estimated 100,000 genes that make up a healthy individual will also be known. Sequence information alone, however, is often insufficient to ascertain the biological roles that genes play in normal human development and health. In order to determine the organismal function of every human gene and to understand how specific DNA mutations in genes result in birth defects and disease, strategies will need to be employed that are cost effective, scaleable to the entire genome, and that complement the mapping and sequencing data.
One powerful approach for mapping the biological functions of many human genes that reside along large segments of chromosomes is to generate nested sets of chromosomal deletions in the homologous regions in mice. Deletion complexes at defined loci on mouse chromosomes permit fine-structure gene-function maps to be constructed, based on heritable mutations with specific phenotypes, that are then correlated with the available physical maps. Unfortunately, deletion complexes are currently available for only about 14% of the mouse genome. However, a new method was recently described (You et al., Nature Genet. 15:285-288, 1997; Thomas et al., Proc. Natl. Acad. Sci. USA 95:1114-1119, 1998) that permits deletion complexes to be generated anywhere in the mouse genome in F1 hybrid embryonic stem (ES) cells. The method is rapid and cost effective because the deletions are generated at defined locations in the genome and selected for in the ES cells. Additionally, many different deletions can be generated in one experiment and the extent of the deletion breakpoints can be mapped with available polymorphic markers before producing lines of mice.
The objective of this project is to develop ES-cell reagents to facilitate the generation of functional maps in gene-rich regions that are homologous to portions of human chromosomes being mapped and sequenced by the Joint Genome Institute. The initial focus of this project will be to generate nested sets of chromosomal deletions in ES cells for the proximal 23 cM of mouse Chr 7 (human 19q homology) and a 16 cM region of proximal mouse Chr 11 (human 5q homology). ES-cell clones containing chromosomal deletions are the first reagents that will be generated and made available to the scientific community. As the project progresses, the goal will be to generate lines of mice harboring these chromosomal deletions and to archive these mutations in the form of cryopreserved embryos and spermatozoa. The reagents generated during this project will be advertised on the Joint Genome Institute Functional Genomics web site.
This work is supported by the U.S. Department of Energy FWP ERKP293, in collaboration with the Joint Genome Institute.
131. Mouse Genetics and Mutagenesis for Functional Genomics: The Chromosome 7 and 15 Mutagenesis Programs at the Oak Ridge National Laboratory
E. M. Rinchik, D. A. Carpenter,
E. J. Michaud, Y. You, P. R. Hunsicker, and D. K. Johnson
The development of detailed mutation maps of regions of the mouse genome provides new resources for the study of mammalian biology and serves as an important functional complement to the human genome program. Mouse-human linkage homologies permit a type of "surrogate genetics" to be developed for regions of the human genome that is based on analyzing the molecular and organismal consequences of mutations mapping within the corresponding mouse genomic segment. One of the major goals of the mouse genetics program at ORNL is to apply our experience in chemical germ-cell mutagenesis and mutation recovery and propagation, as well as recently developed and evolving broad-based phenotype screening, for creating a large, user-friendly mouse-mutation resource for use by the functional-genomics and wider biological communities.
For a number of years, we have been molecularly characterizing regions of mouse Chromosome (Chr) 7 while recovering, in parallel, N-ethyl-N-nitrosourea (ENU)-induced, recessive single-gene mutations mapping within those regions by two-generation hemizygosity screens with radiation-induced deletions. Mutagenesis of one 6- to 11-cM region surrounding the albino (c; Tyr) locus has been completed, yielding 31 mutations representing ten complementation groups. An on-going screen of another ~4- to 5-cM Chr-7 region, proximal to the pink-eyed dilution (p) locus (human 11p and 15q homologies), has so far yielded 19 new mutations, representing 8 complementation groups, from a screen of just 1218 gametes. Both of these screens have greatly increased the fine-structure genetic and functional maps of the corresponding regions. In addition to these hemizygosity screens, we shall also describe new work about to get underway that involves three-generation, homozygosity strategies to induce mutations in proximal Chr 7 (human 19q homology), mid Chr 7 (human 15q homology), and mid-to-distal Chr 15 (human 8q, 22q, and 12q homologies), which are large, multi-megabase regions that are currently not covered by complexes of deletions. We shall discuss our emphasis on up-front investment in developing genetic reagents so that any mutation created can be maintained and used by a wide variety of investigators with no molecular genotyping. In addition, we shall discuss the potential value of "parallel processing" in regional mutagenesis of the mouse genome, in which chromosomally "pre-mapped" mutations are recovered by three-generation screens with inversions in parallel to (not following) the development of deletions in embryonic stem cells for use as finer-mapping and gene-identification reagents. Mutant stocks generated in any of our screens will be advertised on the Web (http://lsd.ornl.gov/htmouse/ mmdmain.htm) and made available to the scientific community.
[Research currently sponsored by the Office of Biological and Environmental Research, US DOE, under contract DE-AC05-960R22464 with Lockheed Martin Energy Research, Inc., and in the past by US DOE and the National Human Genome Research Institute (HG 00370).]
132. Comparative Analysis of Structure and Function in an Imprinted Region of Proximal Mouse Chromosome 7 and the Related Region of Human Chromosome 19q13.4
Joomyeong Kim, Anne Bergmann, Xiaochen
Lu, Anne Olsen, Jane Lamerdin, and Lisa Stubbs
Our group is interested in coupling mouse genetics and biology to comparative sequence that is being generated by JGI teams, with the aim of generating in-depth functional annotation the sequenced human regions. One special target for these studies has been a 2 Mb-region of human chromosome 19q13.4 (H19q13.4) and the syntenically homologous region of mouse chromosome 7 (Mmu7). The sequence of the human region is nearly completed by the JGI genome sequencing team, and sequence analysis of the homologous murine has recently been initiated. Since the murine region is known to be parentally imprinted and imprinting is a generally conserved in mammalian species, the structure and function of this region are of special biological interest.
Our group's efforts begin as DNA sequence and basic annotation of specific DNA segments are completed. Our goal is to characterize genes and conserved regulatory sequences predicted to exist in the sequenced regions, especially those that may contribute to parent-of-origin specific functions in humans and mice . The human region is especially rich in clustered zinc finger containing genes (ZNFs); about 90% of the genes found in 19q13.4 appear to be actively expressed Kruppel-type ZNF loci. We have also identified a number of other types of genes in the 2 Mb human interval, including genes encoding an Aurora-related serine/threonine kinase (STK13), a sulfotransferase2 (ST2), and one anonymous gene homologous to a yeast hypothetical protein P38334 (HYP). The mouse region is similar in terms of gene content, but physical mapping studies have also revealed several chromosomal changes that are unique to the mouse. For example, there are at least 5 copies of the STK13-related genes in mouse, and these copies appear to have been duplicated in tandem as part of a large unit that also contains ZNF-related gene sequences. This tandem duplication appears to have occurred in very recent evolutionary history. One gene (Cln4-2), whose homolog is located in the pseudoautsomal region of the human X chromosome and which had previously been mapped to Mmu7, also appears to have been very recently transposed into the mouse zinc-finger gene cluster region. Other orthologs of human 19q13.4 genes, including ST2, and HYP are present in the related mouse region, although their relative positions within the mouse and human regions have not been strictly conserved. The intronless nature of many of these genes suggests that they were duplicated by retroposition and inserted into this region after the zinc-finger gene clusters were elaborated.
Detailed functional studies have been focused primarily on the imprinted genes that are located in this region. Only one imprinted gene, paternally expressed gene 3 (Peg3) had been identified at the outset of this study, but we have recently identified several new genes located near Peg3. At least one of the new genes, called Zim1, is also imprinted in mouse. Peg3 and Zim1 are located next to each other in both species, and the two genes are reciprocally-imprinted: Peg3 is paternally expressed whereas Zim1 is expressed only from the maternal allele and specifically in embryonic tissues. As has been found in other well-known imprinted domains, such as Prader-Willi/ Angelman syndrome region of H15q11-q13/Mmu7 and Beckwith-Wiedemann region of H11p15.5/Mmu7, the H19q13.4/proximal Mmu7 imprinted domain is expected to contain a number of additional imprinted genes. A comprehensive update on our recent studies, including gene identification, gene expression, and functional analysis of this gene-rich, imprinted region will be presented.
133. Differential Expansion of Homologous Zinc-Finger Gene Families in Human Chromosome 19q13.2 and Mouse Chromosome 7
Mark Shannon1, Elbert
Branscomb1, Loren Hauser2, Anne Olsen1,
Laurie Gordon1, Linda K. Ashworth1, and Lisa
Mapping studies indicate that many of the 600-1000 mammalian zinc-finger (ZNF)-containing genes reside within familial clusters, particularly those genes encoding Kruppel-associated box (KRAB) motifs. However, little is known about family content, organization, or evolutionary conservation. In previous studies, we identified and characterized homologous KRAB-containing ZNF gene families located in human chromosome 19q13.2 and mouse chromosome 7. Here we present details of the construction and characterization of contigs that completely span these families. The human cluster spans 700kb and is comprised of 16 members that are arrayed in tandem. By contrast, the mouse family spans approximately 400kb and contains just 10 genes. We have also identified cDNA clones corresponding to each family member and have analyzed their sequences. The KRAB A domains encoded by the human and mouse genes are highly similar in sequence, but other portions of the predicted proteins encoded by the clustered paralogs may be more divergent in structure. To predict the evolutionary relationships between genes within and between the families, ZNF-containing regions were compared using computational methods. These studies uncovered three pairs of putative orthologs, but also provided evidence for the continued evolution of the families in both species after their divergence from a common ancestor. Recent evolutionary events include intragenic ZNF repeat alterations as well as complete gene duplications. These studies therefore expose complex, yet discernible, histories of sequence duplication and divergence and pave the way for studies of the evolution of gene function within the related families.
134. YAC-ES (Y-ES) Cell Libraries for In Vivo Analysis of JGI Sequences
Yiwen Zhu, Veena Afzal, Jan-Fang
Cheng, and Edward Rubin
Libraries of the human genome, propagated in bacteria and somatic cells, have been an invaluable tool in the identification of genes based on in vitro assays. We have expanded upon this concept by creating a >10 Mb "in vivo" library of regions of the human genome sequenced by the JGI in the form of human YACs propagated in totipotent mouse embryonic stem (ES) cells.
Megabase human YACs from JGI sequenced regions were first characterized for integrity by Southern hybridization and STS content mapping. Appropriate YACs were then retrofitted with selectable markers and introduced individually into germ line transmitting ES cells by yeast spheroplast - ES cell fusion. The content and integrity of the human YACs in the ES cells were assayed and clones with intact human transgenes without detectable rearrangements have been cryopreserved to serve as publicly available reagents to explore the function of genes contained within the human sequences (for more information, visit our Web site at http://grail.lsd.ornl. gov/projects/jgi/fung.shtml). In our initial experiments, we have fused six 5q31 YACs and two 19q13.4 YACs into ES cells. The 19q13.4 Y-ES clones are being injected into mouse blastocysts to test their capacity to contribute to the germline. We have obtained good chimeras from one of the 19q13.4 Y-ES clones tested. We are now in the process of characterizing another 50 megaYACs in 5q region.
Possible uses of the Y-ES clones to biological researchers include: 1) ready made reagents for the investigation of expression/function of a human gene of interest; either in tissue culture or in transgenic mice derived from Y-ES clones; 2) as a reagent for fine mapping of mouse mutations based on functional in vivo complementation of the mutant mouse phenotype by YAC transgenes; and 3) for sifting through large candidate regions of the genome identified by human complex trait mapping studies using gene expression patterns or functional assays in mice propagating these regions.
135. Comparative Functional Genomics
George M. Church, Pam Ralston, Martha
Bulyk, Abby McGuire, Rob Mitra, Saeed Tavazoie, and Jason Hughes
We have developed technologies for annotating genome sequences, including intergenic regions and regulon/operon comparisons. Performing enzymatic reactions on oligonucleotide chips or microarrays (of kbp-sized DNA) allowing us to replicate DNA chips using microcontact printing. RNA quantitations from chip, microarray and SAGE can be merged, clustered, and the motifs mechanistically responsible for the clusters of coregulated RNAs can be determined. Methods for measuring and modeling in vivo concentrations of protein, RNA, metabolite, protein interactions and mutant growth rates in response to diverse environments provide the foundations for a genome sequence function database.
Nature Biotech. 16:566-571; Nature Biotech. 16: 939-945; J. Molec. Biol. 284: 241-254. (http://arep.med.harvard.edu)
136. A Targeted 450 Kb Deletion in Mouse Chromosome 11 Identifies a Novel Gene Dramatically Impacting on VLDL Triglyceride Production
Yiwen Zhu, Miek Jong, Elaine Gong,
Kelly Frazer, Jan-Fang Cheng, and Eddy Rubin
In order to multiplex the examination of the function of genes present within JGI sequenced regions of the human genome the Cre Lox system was employed to delete several genes at a time within the human 5q31 / mouse chromosome 11 syntenic region. A 450 Kb stretch between IRF1 and CSF-GM gene containing nine genes, all of no known function, were deleted in ES cells. Mice homozygous for the deletion though prenatal viable demonstrated premature morbidity with approximately 75% dying before 100 days of age. The other significant finding in these animals was massive enlargement of the liver owing to the engorgement of hepatocytes with triglycerides. Plasma triglyceride levels were approximately ten-fold greater than control animals.
Due to the importance of triglyceride metabolism as an atherosclerotic risk factor we investigated the mechanism underlying the hypertriglyceridemia in these mice. Of the three major factors effecting plasma triglyceride levels (synthesis, lipolysis in the periphery and clearance via hepatic uptake) abnormalities were only present with regard to synthesis. Homozygous animals exhibited a four-fold increase in hepatic VLDL triglyceride production while animals heterozygous for the deletion had a two-fold increase in triglyceride production compared to control mice. These deletion mice represent a unique model of enhanced hepatic triglyceride production coupled with increased hepatic fat accumulation.
To identify the gene responsible for this phenotype through in vivo complementation the 450 Kb deletion mice were crossed with mice containing human YAC and mouse BAC transgenes covering the entire deleted region. Animals homozygous for the deletion and hemizygous for the transgenes were analyzed with regard to the phenotypes associated with the deletion. A human YAC containing approximately 120 Kb of the deleted region successfully corrected the hepatic fat accumulation, hypertriglyceridemia and premature lethality associated with the homozygous deletion. Three potential candidate genes are present in YAC: two identified as EST hits with no homology with known genes and one with homology to a rat liver specific transporter-like protein.
137. Identification and Functional Analysis of Evolutionarily Conserved Non-Coding Sequences in the Human 5q31 Cytokine Cluster Region
Gabriela Cretu, Webb Miller, Catherine
M. Brion, Jan-Fang Cheng, Christopher H. Martin, William Kimberly, Edward
M. Rubin, and Kelly A. Frazer
The human 5q31 region chosen by the JGI for large scale sequencing is biologically interesting because it harbors a family of cytokine genes which are important regulators of the immune response. We previously annotated the 1 Mb Cytokine Gene Cluster Region on human 5q31 computationally and biologically resulting in the identification of 23 genes and the determination of their expression patterns. To further annotate this 1 Mb region we are currently identifying evolutionarily conserved non-coding sequences and attempting to determine their biological function. Since the interleukin loci in this region are likely to have arisen by ancient duplications of an ancestral gene, several intervals of this region may be paralogous with each other. To identify evolutionarily conserved non-coding elements we compared the potential human paralogous sequences in this 1 Mb region to each other as well as with their orthologous mouse chromosome 11 sequences. Several highly conserved non-coding sequences that potentially have biological function were identified. We have started functionally analyzing two of these non-coding elements: an 86 bp element located 5' of the human interleukin 13 (IL 13) gene which is 91% identical with another non-coding element in the human 5q31 region, and a 400 bp element located between the IL 4 and IL 13 genes which is 85% identical in humans and mice.
To investigate the biological function of the 86 bp element upstream of IL 13 we have deleted it in a human 5q31 YAC and have used this manipulated YAC to generate transgenic mice. Production of human IL 4 and IL 13 is markedly reduced in mice harboring the mutated YAC lacking the 86 bp conserved element compared with the production of these human proteins in mice harboring a wild type YAC. These data suggest that the 86 bp conserved element is involved in regulating the expression of the human IL 4 and IL 13 genes.
The biological role of the non-coding 400 bp element located between IL 4 and IL 13 is being investigated by several strategies. We determined by PCR amplification and sequence analysis that this 400 bp element is also highly conserved in the cow, dog, rabbit, pig, and rat which provides additional evidence indicating that this human-mouse conserved non-coding elememt is likely to be functionally important. To determine the biological role of this conserved element we are in the process of homozygously deleting it in mice and will determine how its absence affects the expression of the murine IL 4 and IL 13 genes. We are also performing comparative studies of human IL 4 and IL 13 expression in mice harboring a 5q31 YAC lacking the 400 bp conserved element with mice harboring a wild type 5q31 YAC. To be able to assay for small changes in the expression of the human IL 4 and IL 13 genes it is important to eliminate variation in their expression due to integration site of the mutated and wild type YAC in the mouse genome. To accomplish this we have surrounded the 400 bp conserved element on the human YAC with lox P sites and are using this YAC to generate transgenic mice. The founder mice will be mated with wild type mice and with mice expressing CRE recombinase. In this manner we plan to generate lines of mice that harbor at the same site of integration the human YAC lacking and containing the 400 bp element. By assaying for changes in the expression of the human IL 4 and IL 13 genes in these mice we hope to gain insight into the function of this highly conserved 400 bp non-coding element.
138. Discovering the Genes Affected by Schizophrenia Using DNA Micro-Array
Yang Qiu, Edward M. Rubin, and Jan-Fang
Schizophrenia is a devastating psychiatric disorder that affects 1% of the population. Genetic factors make important contributions to the etiologies of this disease. It is highly likely that multiple genes and environmental factors are involved. Chromosome 6p has been shown to have linkage with schizophrenia in several independent studies. The current drugs treating schizophrenia including clozapine, risperidone and olanzapine are all far from perfect with substantial side effects. It is thus important to be able to identify the genes affected by schizophrenia, which would greatly enhance the drug discovery leading to a better treatment.
We are taking advantage of the technology of DNA micro-array at Lawrence Berkeley National Lab which can hold thousands of genes on one single glass slide and the development of the human and mouse Unigen set (uniquely expressed sequences) through the effort of genome community. The expression of thousands of genes at different physiological condition can be analyzed in parallel. New genes can be identified and biological functions of the genes can be further studied. The DNAs to be spotted on the DNA micro-array are as following:
(1) 10,000 human Unigen clones representing ~20% of expressed human genes.
(2) 309 BAC clones available from the physical mapping project covering 90% of the schizophrenia candidate region at chromosome 6p.
(3) 269 genes singly selected through thorough literature search which include neurotransmitter receptor (dopamine receptor, glutamate receptor, serotonin receptor, acetylcholine receptor, etc), brain function related genes and other possible genes involved in schizophrenia.
(4) ~ 40 clones identified to be differentially expressed in neuropsychiatric disorders by Stanley Neurovirology Laboratory at the Johns Hopkins University, School of Medicine.
The postmortem brain tissue from individuals with schizophrenia and normal controls will be obtained from Standley Foundation Neuropathology Consortium. Total RNAs are to be extracted from the different brain tissues and hybridized with the DNA micro-array. The genes that are affected in schizophrenia can be identified when using a large sample sets to minimize the individual variations in the gene expression.
Meanwhile, a mouse model for schizophrenia is underway for this study. It has been established that the mice treated with psychotic drug PCP (angel dust) mimic some symptoms of schizophrenia in which the prepulse inhibition is diminished in schizophrenia patients. We are treating mice with PCP as well as antipsychotic drugs including clozapine and risperidone. The gene expression patterns in the mouse brain will be followed at different times after each treatment using the DNA micro-array. The genes that are affected by drugs can be identified as the candidate genes for schizophrenia.
139. Gene Expression in Cardiac Hypertrophy as Measured by cDNA Microarrays
Carl Friddle, Teiichiro Koga, James
Bristow, and Edward M. Rubin
The mouse heart is an ideal model because it offers the benefits of a whole animal system in an organ comprised of relatively few cell types. We are studying the changes in heart gene expression that correspond to the onset and progression of cardiac hypertrophy. We wish to know the normal distribution of gene transcripts in the heart chambers in contrast to the distribution found in the hypertrophic state. Much has been learned about the expression profile of specific genes that play a role in cardiac hypertrophy (e.g.: ANF, MLC-2, c-Fos, c-Jun, Egr1). However, the discovery of new pathways involved in hypertrophy, and even the rapid identification of genes in known pathways, would benefit from an approach that analyzes thousands of genes in parallel. One could then perform a more detailed analysis of those genes that show a change in expression levels in conjunction with the onset of cardiac hypertrophy.
We have chosen to apply cDNA microarray technology to these questions. A array of over 3000 mouse EST clones was generated from the sequenced libraries of the IMAGE consortium. Included are ESTs from both heart, embryonic, liver and brain libraries. This array allows us to generate an expression profile for both the normal mouse heart and for hypertrophic tissues.
Hypertrophy was induced in vivo by treating
mice with Isoproterenol. Heart weight was increased by 50% over the course
of a week. Mice were sacrificed daily to generate a time course of hypertrophy
induction. We then monitored the expression of the genes represented by
our 3000 clones and used this information to identify classes of genes
that are regulated in coordination with the onset and progression of cardiac
140. Genetic Factors Affecting Globin Switching
Sluan D. Lin, Phil Cooper, Mary
E. Stevens, and Edward M. Rubin
Low level expression of fetal gamma globin inhibits red cell sicklying and its pathological consequence in individuals homozygous for the Beta-S alleles. However, fetal gamma globin switches adult beta globin shortly after birth. To furthering our understanding of the genetic factors affecting the sickle cell disease, we are studying the "transacting" modifier gene(s) that impact on the switching of human gamma to beta globin in transgenic mice.
Creation of transgenic mice that persistently express human gamma globin: We have made transgenic mice using a YAC containing the entire human beta-cluster with a -117 Agm mutation. The mutation causes the Greek form of hereditary persistence of fetal hemoglobin (HPFH) in human. We found that the HPFH mutation can also causes the human gamma chains to be expressed postnatally in the transgenic mice. This feature has greatly facilitated the study of the globin switching parameters and the level of gamma globin expression after birth.
Different genetic backgrounds affect the globin switching parameter among the F1 animals: The heterozygous FVB transgenic animals were crossed with different inbred strains including DBA/2N, Balb/C, 129/SvJ and SWR/J. The blood of YAC-positive animals screened by PCR were collected on 10, 15, 30 and 60 days after birth. The human gamma/beta globin ratio of the DBA/2N-derived F1 hybrid shows consistent highest level throughout the sampling period. Comparing DBA/2N-derived F1 with the transgenic FVB, the p-values are 1x10-7 and 3x10-10 on day 30 and day 60, respectively. Thus, we've verified the hypothesis that different genetic background of F1 hybrid mice, derived from crossing the transgenic with other inbred strains, can affect the level of gamma globin expression.
Backcross suggests more than one genetic loci contribute in regulating gamma globin expression: We first generated 70 backcross transgenic animals as a pilot study of the possible number of genetic factors involved in up-regulating the gamma expression. The lack of a bimodal distribution of the backcross animals suggesting that there are more than one genetic loci that contribute in regulating the expression of human gamma globin in the transgenic mice. By applying the classical formula of Wright (1968), we estimate the number of QTLs controlling the gamma expression is 2.4 and the genetic contribution to the phenotypic variance is 48%. By comparing to others publication of similar situation, We estimate that we would need approximately 200 backcross transgenic animals in total to map the QTLs. To test the polymorphism between the FVB and DBA/2N genomic DNA, we are screening for 83 SSLP markers across the mouse genome at an average genetic interval of 17 CM. Using these markers, we are performing a genome scan starting with the 20% backcross animals at the two phenotypic extremes. Once a significant lod score of 3.3 is achieved, we will pick more markers around the particular locus to fine map the modifier gene.
141. Resources for Functional Genomics in Drosophila
Gerald Rubin, Suzanna Lewis, Ling
Hong, Damon Harvey, E. Jay Rehm, Amy Beaton, Peter Brokstein, Guochun Liao,
Erwin Frise, and Allan Spradling
A major goal of the Drosophila Genome Project is to biologically annotate the DNA sequence of the Drosophila melanogaster genome as it emerges and to provide community resources for functional genomics. To this end, we are carrying out several related projects: (1) insertional mutagenesis using P transposable elements that have been engineered to allow controlled misexpression, in addition to insertional inactivation, of the gene at the site of insertion; (2) generation of a "unigene" set of arrayed, full-length Drosophila cDNAs; (3) highly accurate DNA sequencing of a selected subset of these cDNAs; (4) single-pass sequencing of the remainder of the cDNAs as the corresponding genomic sequence becomes available to generate a transcript map of the genome; and (5) determination of the expression pattern of individual genes by tissue in situ hybridization to embryos at a variety of developmental stages and by hybridization to gene microarrays.
142. Isolation of Drosophila DNA Repair Genes
R. Scott Hawley, Kenneth C. Burtis,
and Gerald M. Rubin1
The ultimate goal of this project is to complete the identification and mapping of all of the genes involved in repair of genome damage in Drosophila melanogaster and to initiate their functional characterization. The substantial quantity of Drosophila genomic and cDNA sequences obtained to date, in combination with the genetic information available for this organism, provide a powerful base from which to begin a comprehensive description of the DNA repair genes operating in Drosophila. We have now initiated a multi-faceted approach to complete this process, using genetic, molecular and bioinformatic approaches.
We have initiated genetic screens to extend previous, non-saturating screens for mutagen-sensitive mutations. These genetic screens have only just begun, and no new mutagen-sensitive loci have yet been isolated. However, using a combined molecular and genetic approach, we have made some progress in identifying two repair genes previously uncharacterized in Drosophila; the Drosophila homolog of the yeast RAD10 gene (now designated mei-10), and the Drosophila homolog of XPG. We have shown that the fly MEI-10 protein physically interacts with the MEI-9 (dm Rad1) protein by yeast two-hybrid studies. We are now in the process of creating mutants in the mei-10 gene. Preliminary mapping suggests that the MEI-10 gene may correspond to the previously identified mus210 locus. We have also obtained strong evidence that the Drosophila XPG homolog corresponds to the mus201 locus. The two extant mus201 alleles display phenotypes expected for an NER defect, but no discernable meiotic defect.
We are also continuing genetic studies of several known repair-deficient loci. Most notably we have identified a null allele of the repair/checkpoint gene mei-41 and demonstrated intermediate levels of repair competency and checkpoint function in heterozygotes for this mutation. We have used this and other such mutants as substrates in screens for dominant enhancer or suppressor mutations. These screens are allowing us to identify mutations that might be lethal or sterile when homozygous and thus be missed in more convential screens for mutagen-sensitive mutations.
A second approach to identifying genes involved in the Drosophila response to genome damage will involve the use of DNA microarrays. We have recently completed construction of an arraying robot using the design developed by Pat Brown's lab at Stanford, and have initiated the production of arrayed collections of EST-characterized Drosophila cDNAs produced by the Berkeley Drosophila Genome Project. These arrays will be used to identify transcription units whose expression is regulated in response to DNA damage. The genes thus identified will be further characterized molecularly and genetically, and correlated to the extent possible with repair genes identified by other means.
Finally, we will present an up-to-date summary of the Drosophila DNA repair genes identified through genomic and EST sequences obtained to date by the Berkeley Drosophila Genome Project.
143. Ribozyme Gene Delivery for Gene Target Discovery and Functional Validation
Xinqiang Li, Peter J. Welch, Mark C. Leavitt,
Flossie Wong-Staal, and Jack R. Barber
Ribozymes (Rzs) are RNA molecules that can be engineered to cleave and inactivate other RNA molecules in a sequence-specific fashion. Thus, Rzs can be designed to selectively inactivate the expression of any target gene ("gene knockdown") and its corresponding protein. We have used Rz genes, delivered with viral vectors, as a tool for gene functional validation and discovery. We have used hairpin ribozyme gene delivery to rapidly and effectively inhibit expression of a number of viral and cellular genes.
To expedite the process of associating genes with cellular function, we have developed Rz gene vector libraries. Retroviral and Adenoassociated viral vectors have been generated that efficiently deliver and express Rz genes whose target recognition sequences have been randomized, generating a library of Rz genes capable of recognizing a total of more than 1x107 possible gene target sequences. The library of Rz genes is delivered into large numbers of tissue culture cells (one Rz gene per cell for each Rz gene in the library), followed by selection for individual cells that have lost a particular function. The sequence of the Rz target recognition domain thus identified allows the identification and cloning of genes that are necessary for a given cellular function. As an example of the power of the technology, we will present data demonstrating its use to identify a novel tumor suppressor gene, without prior sequence information.
144. Microfabricated Microfluidic Devices for Proteome Mapping
R.S. Ramsey, R.S. Foote, R.D. Rocklin,
M.I. Lazar, Y. Liu, and J. M. Ramsey
Miniaturized chemical instruments, "Lab-on-a-Chip" technologies, are being developed for rapid, comprehensive analysis of cellular proteins, as an alternative to the slow and labor-intensive 2D gel methods currently used for protein mapping. The microfabricated devices will integrate on a single structure, elements that enable multidimensional separations of protein mixtures and electrospray ionization of the analytes for direct, on-line interfacing with mass spectrometry. The platform exploits the many advantages of Lab-on-a-Chip devices, including small size, inexpensive fabrication, high speed, low volume materials consumption, high throughput, and automated operation. Potential applications of the technology include quantification of gene product levels in specific cell types, comparative analysis of patterns of gene expression in different tissues at different stages of development, analysis of structural and/or expression level changes resulting from mutagenesis or genetic disease, and identification of specific protein markers of disease.
The conventional 2D PAGE method for resolving cellular proteins is not only laborious but also has poor reproducibility, sensitivity, and sample recovery. Individual spots may be identified off-line using mass spectrometry (MS) but sample extraction and transfer processes are inefficient. Column liquid chromatography or capillary electrophoresis (CE), which are more easily coupled with MS using electrospray ionization (ES), in general, lack the resolution required for the analysis of complex biological samples. Two-dimensional separations greatly increase the resolving power, provided the individual methods are orthogonal, and when combined with MS result in a powerful technique, given the multiplicative effect of joining different separation mechanisms. We have designed and demonstrated an integrated device combining micellar electrokinetic chromatography and high-speed free zone electrophoresis. The orthogonality of these techniques, an important factor for maximizing peak capacity or resolution elements, was verified by examining each technique independently for peptide separations. The two dimensional separation strategy was found to greatly increase the resolving power over that obtained for either dimension alone. The integrated device operates by rapidly sampling and analyzing effluent in the second dimension from the first dimension. Second dimension analyses are performed and completed every few seconds. Total analysis times are less than 10 min and the peak capacity has been estimated to be in the 500 to 1000 range. The operation of the device is completely automated. Microchips have also been interfaced to a time-of-flight mass spectrometer that has acquisition rates necessary to capture mass spectra from rapidly eluting components. The electrospray element will eventually be integrated with the two dimensional separations to allow on-chip MS analysis for ultra-high throughput protein mapping. Increases in sample throughput are anticipated to be greater than two orders of magnitude as compared to 2D PAGE.
145. Using Phage Display in Functional Genomics
Peter Pavlik, Rob Segal, Daniele Sblattero,
Vittorio Verzillo, Roberto Marzari, and Andrew Bradbury
Phage display offers the possibility of selecting polypeptides (and the genes which encode them) from libraries of 1e10 or more different polypeptides on the basis of their abilities to bind target proteins and subdomains. This diversity far surpasses the estimated number of total genes in the human genome. The application of this technology to the Human Genome Project will powerfully accomplish a central goal: the derivation of ligands that recognize protein products of all human genes, such ligands being either antibodies, or protein fragments.
Where the recognition ligands derived from this relatively new technology are antibody binding regions (single chain Fv) they can be employed in the same way as traditional antibodies. As such, they can play essential roles in assigning gene function, including the characterization of spatiotemporal patterns of protein expression and the elucidation of protein-protein interactions. Where the recognition ligands are protein fragments, they can be considered to be potential protein-interaction partners for the immobilized polypeptide and so a starting point for further biochemical studies. This project has concentrated on trying to find a general way to isolate antibodies against gene products, preferably starting from gene sequence and using peptides to avoid the need for cloning and expression. A new method to make phage antibody libraries has been developed and a new large library using this method is presently under construction. A library provided by Jim Marks, UCSF, has been used to select antibodies against a number of cell cycle and DNA repair proteins. We have succeeded in miniaturising selection on proteins to a 96 pin format. Should gene products be available this is a very efficient way to select antibodies in a high throughput format. We have also used scanned peptides (180 in total) derived from five different proteins (ubiquitin, cdk2, human serum albumin, cyclin D, transglutaminase) to select antibodies. Some of the antibodies selected are able to recognise the native protein. We are attempting to derive rules, based on physicochemical characteristics and other predictive algorithms, which predict which peptide sequences will select antibodies recognising the full length protein.
146. One Gene - How Many Proteins?
Raymond F. Gesteland, Chad Nelson,
Mike Giddings, Norma Wills, Jiadong Zhou, Barry Moore, Mike Howard, and
This is a new pilot project to use mass spectrometry methods to determine the multiplicity and character of proteins coming from individual mRNAs. Many processes contribute to the complexity of gene products that come from one gene. In addition to alternate splicing and RNA editing that increase mRNA complexity, protein modification and alternative translation can all expand the population of proteins that come from one mRNA species. Although we know a good deal about protein modifications on a protein by protein basis, we know little in a genome-wide sense. We know even less about the frequency of occurrence of unusual translation events. Alternative translations, or recoding, include programmed frameshifts, bypassing of mRNA regions, and redefinition of stop codons to encode one of the twenty amino acids or selenocysteine the 21st amino acid.
We are developing technology to ask how many different protein products come from each mRNA species. We are using Electrospray Liquid Chromatography Mass Spectrometry. With a genome of known sequence, such as yeast, we can fractionate proteins and by accurately determining their masses, see if they can be accounted for by predicted molecular weights from the known open reading frames. Identification with genes can be verified by mass analysis of tryptic digests. If initial molecular weights do not conform to any known ORF, alternate origins must be considered. Protein modifications will add predictable masses up to a few hundred Daltons, and again confirmation can be obtained by analysis of tryptic peptides. Recoding events such as frameshifting or bypassing will often result in more drastic changes in mass. Tryptic peptide analysis will identify the genome origin from which a limited number of possible masses due to recoding events can be predicted. Again, analysis of tryptic peptides should allow identification of the specific recoding event.
We are initially analyzing mitochondria of the yeast Saccharomyces cerevisiae since this will limit complexity to a fraction of the whole yeast genome - perhaps 500 genes out of 6,500. We are also pursuing tagging methods that are suited for examining one gene at a time and that will be more suited for analysis of the complexity of proteins coming from human genes. From this approach we hope to define the real complexity of the genome products.
147. ASDB: Database of Alternatively Spliced Genes
M. S. Gelfand, I. Dubchak, I. Dralyuk,
and M. Zorn
Alternative splicing is an important regulatory mechanism in higher eukaryotes1. By recent estimates, at least 30% of human genes are spliced alternatively (Mironov, A.A. and Gelfand, M.S. Proc. 1st Int. Conf. on Bioinformatics of Genome Regulation, 1998. v. 2, p. 249). Alternative splicing plays a major role in sex determination in Drosophila, antibody response in humans and other tissue or developmental stage specific processes2-5. The database of alternatively spliced genes can be of potential use for molecular biologists studying splicing, developmental biologists, geneticists, and cell biologists. Version 1.1 of ASDB contains information about protein products of alternatively spliced genes. Selecting all SwissProt entries containing the words "alternative splicing" has generated 1663 proteins. Then clusters of proteins that could arise by alternative splicing of the same gene were created. Two proteins from the same species belong to a cluster if they have common fragments not shorter than 20 amino acids. Each cluster is represented in the database by the multiple global alignment of its members, allowing for easy identification of regions produced by alternative splicing. The database contains 241 clusters with more than one member. The database can be searched using Medline, SwissProt, and GenBank identifiers and accession numbers. Standard context search can be performed over SwissProt keyword, description, taxonomy, and comment fields and feature tables. ASDB contains internal links between entries and/or clusters, as well as external links to Medline, GenBank and SwissProt entries. Next steps of ASDB development will be incorporation of DNA data, classification of main types of alternative splicing, incorporation of data on aberrant splicing and splicing mutations. Automated processing of existing databases with minimum manual curation produced the current version of the database. In future we plan to add manual curation of the database, including addition of splicing variants described in the literature but not annotated in GenBank.
1Sharp, P.A. (1994) Cell. 77,
148. Prediction of Protein Structural Domains
Robert Miller1, Winston A.
and David C. Torney2
New challenges of sequence analysis have arisen with the advent of functional genomics. In particular, there is a premium on being able to make good use of small collections of example sequences, of known function, for classifying and predicting the functions of new sequences. Established techniques of classification have thus far not performed as well as needed, even with relatively abundant data, as in the case of exon prediction. We have therefore developed new example-based Bayesian statistical techniques for classification. These approaches can use conserved sequence motifs when these are present, but such overt similarities are not required because our techniques capture and employ all the statistical properties exhibited by a collection of example sequences. Thus, the likelihood for any sequence being a member of a given functional class is derived based on examples from the class. As many classes of structurally or functionally related biological sequences have only a relatively small number of examples, the prior specification of "what the statistical properties of a class might comprise" is critical. Our techniques include judicious choices for this prior, using insights about the statistical and physical properties of the sequences. One promising application of our techniques is the development of automatic clustering methods for use with a class of sequences. This will enable the discovery of heterogeneity within a class, improving the ability to predict class membership and deriving new classes.
To establish and refine our techniques, as well as provide the basis for predicting structural and functional aspects of new protein sequences, we created datasets of sequence-dissimilar examples of known secondary structures, using DSSP applied to Brookhaven PDB files. We obtained 64,775 residues of alpha-helix, 47,304 residues of beta-sheet, and 45,549 residues of coil, exhibiting recognized structural features such as helix capping mechanisms. The application of our techniques classifies regions of novel protein sequences into these three categories. We will report the details of the implementation and performance, making comparisons with established approaches. Data may be submitted for analysis by our methods via the World Wide Web (http://www.sanbi.ac.za/karoo). Supported by the U.S. D.O.E. Office of Biological and Environmental Research under contract W-7405-ENG-36.
149. Rapid and Sensitive Characterization of Proteomes; an Adjunct to the Genome
Richard D. Smith, Ljiljana Pasa
Tolic, Mary S. Lipton, Pamela K. Jensen, Gordon A. Anderson, and James
In contrast to an organism's virtually static and well defined genome, the proteome continually changes in response to external and internal events. The patterns of gene expression, protein post-translational modifications, covalent and non-covalent associations, and how these may be affected by changes in the environment, cannot be accurately predicted from DNA sequences. In addition, direct protein measurements now constitute the most effective method for determining open reading frames for small proteins. Therefore, proteome characterization is increasingly viewed as a necessary complement to complete sequencing of the genome. Approaches for proteome characterization are increasingly based upon mass spectrometric analysis of in-gel digested electrophoretically separated proteins, allowing relatively rapid protein identification compared to conventional approaches. However, this technique remains constrained by the speed of the 2-D gel separations, the sensitivity needed for protein visualization, and the speed and sensitivity of subsequent mass spectrometric analyses for identification.
Our objective is to circumvent the limitations of this approach by directly characterizing the cell's polypeptide constituents by combining the speed of capillary isoelectric focusing (CIEF) and the mass accuracy and sensitivity obtainable with Fourier transform ion cyclotron resonance (FTICR) mass spectrometry. CIEF-FTICR MS studies require orders of magnitude smaller sample sizes than required by 2-D PAGE technology, and initial efforts have demonstrated sensitivities well into the attomole range, as well as the potential for further significant improvements. A key attraction of FTICR is the enhanced facility for protein identification based upon the use of genome sequence data. Isotopically depleted growth media allow highly accurate molecular mass determinations for larger proteins than otherwise possible, and further improves achievable sensitivity and detection limits. We describe our efforts aimed at developing on-line CIEF-FTICR techniques, their comparison with conventional methodologies, and their initial application to several prokaryotes for which complete genome sequences are available. We will also describe new approaches for the determination of precise expression levels for large numbers of proteins in the same measurement.
We thank the Office of Biological and Environmental Research, U. S. Department of Energy, for support of this research under contract DE-AC06-76RLO 1830.