Bioinformatics and Computational Biology
Predicting Genes in Prokaryotic Genomes: Are Atypical Genes Derived from Lateral Gene Transfer?
Department of Biology, Emory University, Atlanta, Georgia; and Schools of Biology and Biomedical Engineering, Georgia Institute of Technology, Atlanta, Georgia
Algorithmic methods for gene prediction have been developed and successfully applied to many different prokaryotic genome sequences. As the set of genes in a particular genome is not homogeneous with respect to DNA sequence composition features, the GeneMark.hmm program utilizes two Markov models representing distinct classes of protein coding genes denoted typical and atypical. Atypical genes are those whose DNA features deviate significantly from those classified as typical and they represent approximately 10% of any given genome. In addition to the inherent interest of more accurately predicting genes, the atypical status of these genes may also reflect their separate evolutionary ancestry from other genes in that genome. We hypothesize that atypical genes are largely comprised of those genes that have been relatively recently acquired through lateral gene transfer (LGT). If so, what fraction of atypical genes are such bona fide LGTs? We have made atypical gene predictions for all fully completed prokaryotic genomes; we have been able to compare these results to other surrogate methods of LGT prediction. In order to validate the use of atypical genes for LGT detection, we are building a bioinformatic analysis pipeline to rigorously test each of the gene candidates within an explicit phylogenetic framework. This process starts with gene predictions and ends with a phylogenetic reconstruction of each candidate. From the set of bona fide LGTs that we have identified, we will be able to determine the LGT parameters to which our gene finding programs are most sensitive (i.e. time scale of transfers, phylogenetic distance from transfer source, etc.). We are utilizing this pipeline to estimate the extent and pattern of LGT in a selection of genomes, both complete and nearly complete, with the long term goal of analyzing all such sequences.
VISTA Comparative Genomics at LBNL
Genome Sciences Department, Lawrence Berkeley National Laboratory; and University of California, Berkeley
The VISTA Web server (http://www-gsd.lbl. gov/vista) is an integrated set of software tools for comparing two or more genomic sequences. The server consists of two autonomous modulesone for alignment of long genomic sequences, and one for the visualization and identification of conserved elements (Dubchak et al. 2000; Mayor et al. 2000). The VISTA server currently uses AVID, a global alignment program (Bray et al., 2002) that works by first finding maximal exact matches between two sequences using a suffix tree, and then recursively identifies the best anchor points based on the length of the exact matches and the similarity in their flanking regions.
High quality draft human and mouse genomic sequences have been aligned using a computational strategy where mouse sequence contigs are anchored on the human genome by local alignment matches and then globally extended (Couronne et al., 2003) Alignments on the whole-genome scale can be visualized using an interactive tool Vista Genome Browser accessible at the gateway web site http://pipeline.lbl.gov. Vista Genome Browser is an applet that allows for displaying results of comparative sequence analysis in a VISTA format on the scale of whole chromosomes.
The computational strategy of anchoring sequence contigs from one species onto a base genome sequence assembly of a second species by local alignment matches and then globally aligning these contigs to candidate regions is also implemented for user-submitted sequences at another VISTA server http://pipeline.lbl.gov/cgi-bin/GenomeVista. This server assists in finding candidate orthologous regions for a submitted sequence from any species on either the human or mouse genome sequence assembly, and provides detailed comparative analysis.
Bray, N., Dubchak, I., and Pachter, L. (2003) AVID: A Global Alignment Program. Genome Res. 13:97
Couronne O., Poliakov A., Bray, N., Ishkhanov, T., Ryaboy, D., Rubin, E., Pachter L, Dubchak, I. (2002) Strategies and Tools for Whole Genome Alignments, 2003. Genome Res., 13:73
Mayor, C., Brudno, M., Schwartz,J.R., Poliakov,A., Rubin, E.M., Frazer, K.A., Pachter, L.S., Dubchak, I. (2000) VISTA: Visualizing global DNA sequence alignments of arbitrary length. Bioinformatics, 16: 1046
Dubchak, I., Brudno, M., Pachter, L.S., Loots, G.G., Mayor, C., Rubin, E.M., Frazer, K.A. (2000) Active conservation of noncoding sequences revealed by 3-way species comparisons. Genome Res., 10: 1304
Cell Cycle Regulation Model Construction Using Trainable Neural Networks
Institute for Genomics and Bioinformatics, University of California, Irvine; and Caltech Division of Biology
We use trainable neural network models to combine the yeast ChIP-chip transcription factor binding data of Lee et al. with the yeast cell cycle microarray expression data of Cho et al., arriving at hypotheses for a small core network involved in transcriptional regulation in the cell cycle. The stages of analysis may be outlined as (a) finding a reliable clustering of the expression time course data, (b) finding a robust set of genes whose expression class is predictable based on binding data, (c) finding a robust set of regulators most involved in this prediction for each class, and (d) optimizing a small, trainable neural network model of transcriptional regulation using the foregoing steps. The number of free parameters of the model is almost as great as the amount of data available to constrain it, so the method is near the boundary of current feasibility. However, network structures emerge robustly and a modification to the form of the assumed dynamics can be proposed based on the fits to existing data.
Fast Alignment & Analysis of Multiple Genomes
Southwest Parallel Software Thoughtware
We present a program which uses the sensitive Smith-Waterman alignment algorithm, but which is faster than BLAST, to align multiple genomes. We also present a novel viewer which allows the user to select and view multiple genome alignments of highly-conserved regions. We show performance comparisons between BLAST and our high-speed Smith-Waterman core.
Engineering Tools to Characterize the Coding Regions of the Genome
DOE Joint Genome Institute, Walnut Creek, CA 94598
With genome sequencing efforts producing vast amounts of data, attention is now turning towards unraveling the complexities encoded in the genome: the protein products and the cis-regulatory sequences that govern their expression. Understanding the spatial and temporal patterns of protein expression as well as their functional characteristics on a genomic scale will foster a better understanding of biological processes from protein pathways to development at a systems level. Currently, the main bottlenecks in many proteomics initiatives, such as the development of protein microarrays, remain the production of sufficient quantities of purified protein and affinity molecules or probes that specifically recognize them. Methods that facilitate the production of proteins and high affinity probes in a high-throughput manner are vital to the success of these initiatives. We have developed a system for high-throughput subcloning, protein expression and purification that is simple, fast and inexpensive. We utilized ligation-independent cloning with a custom-designed vector and developed an expression screen to test multiple parameters for optimal protein production in E. coli. A 96-well format purification protocol was also developed that produced microgram quantities of pure protein. These proteins were used to optimize SELEX (Systematic Evolution of Ligands by Exponential Enrichment) protocols that use a library of DNA oligonucleotides containing a degenerate 40mer sequence to identify a single stranded DNA molecules (aptamers) that bind their target protein specifically and with high affinity (low nanomolar range). Aptamers offer advantages over traditional antibody-based affinity molecules in their ease of production, regeneration, and stability, largely due to the chemical properties of DNA versus proteins. These aptamers were characterized by surface plasmon resonance (SPR) and were shown to be useful in a number of assays, such as western blots, enzyme-linked assays, and affinity purification of native proteins.
This work was performed under the auspices of the U.S. Department of Energy, Office of Biological and Environmental Research, by the University of California, under Contracts No. W-7405-Eng-48, No. DE-AC03-76SF00098, and No. W-7405-ENG-36.
Computational Analysis of Gene Deserts in the Human Genome
Life Sciences Division, Lawrence Berkeley National Laboratory, MS 84-171, Berkeley, CA 94720; Genomics Division, Lawrence Livermore National Laboratory, 7000 East Ave., Livermore CA 94550; and DOE Joint Genome Institute, 2800 Mitchell Drive, Walnut Creek, CA 94598
The sequencing and annotation of the human genome revealed a non-uniform gene distribution across different chromosomes, implying on the existence of vast genomic segments devoid of identifiable genes, or gene deserts. As an initial step to characterize these noncoding sequences and derive insights into their evolution and possible biological function, we computationally identified all the gene deserts present in the human genome and compared them with other average-sized intergenic regions as well as with homologous genomic segments from the mouse and the pufferfish Fugu rubripes, organisms whose genomes have recently been sequenced to completion. Our analysis revealed that gene deserts correspond to ~10% of the total size of the human genome, ranging in size between 600kb and ~3Mb. Using various computational approaches we compared the density of repetitive elements, GC content, SNP density and human-mouse conservation between gene deserts and non-desert intergenic regions. We observed a wide distribution for each measured parameter among different gene deserts, without a distinct signature shared by the majority of these noncoding regions. On average, we found the age of the repetitive elements in gene deserts to be younger than that of any other genomic fractions that we analyzed, possibly reflecting a higher incidence of deletions that swipe out older repeats in gene deserts than in other parts of the genome. The vast majority of human gene deserts are represented by corresponding gene deserts in the mouse genome and about half of these gene deserts carry sequences that are conserved in the Fugu rubripes genome. Interestingly, genes involved in various aspects of embryonic development flank most of the gene deserts containing fugu fish conservation, suggesting that these regions are embedded with transcriptional regulatory elements. Here, we will depict a functional gene desert, carrying several gene regulatory elements that could only be identified by comparing human, mouse and fugu sequences. We also outline an ongoing strategy for generating genetically engineered mice carrying deletions of gene deserts that will test the in vivo function of these large segments of noncoding DNA.
Decoding Transcriptional Regulation in the Human Genome
Life Sciences Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720; International Computer Science Institute, 1947 Center St., Suite 600, Berkeley, CA 94704; Department of Biochemistry, B400 Beckman Center, Stanford University, CA 94305; and DOE Joint Genome Institute, 2800 Mitchell Drive, Walnut Creek, CA 94598
Transcriptional gene regulation in the human genome is complex in nature. Spatial and temporal gene expression patterns are defined by a combinatorial interplay of several transcription factors binding to promoter region of a gene. While microarray experiments unveiled tissue- and conditions-specific patterns of gene expression for many genes, there is still a lack of knowledge of the underlying sequence motifs that induce the observed gene expression patterns. We present a novel method for detecting cis-regulatory modules in the promoter regions of human genes using genome-scale alignment of the human and mouse genomes. Initially, transcription factor binding sites (TFBS) were identified in the promoters of all annotated RefSeq human transcripts based on more than 400 TFBS profiles catalogued in the TRANSFAC database. From an overwhelming number of predicted TFBS the majority of which are false positives, we extracted only those that are aligned and conserved in human and mouse, using the rVista tool (http://nemo.lbl.gov/rvista/ index.html). New statistical measures were developed for pinpointing TFBS that are enriched in the promoters of a group of genes of interest compared to the background set. A novel hashing algorithm and appropriate statistical tests were devised to identify groups of TFBS that tend to co-occur in the promoters of interest. We applied our method to find regulatory modules related to cell-cycle and stress response. On the cell cycle data our algorithm identified several relevant TFs, including E2F, and seven cis-regulatory modules that are statistically significant. The sets of genes containing each of the modules were verified by checking for coherence of their expression patterns. Roughly half of the identified sets of genes were found to be significantly coherently expressed. On the stress response data about half of the detected gene sets fell predominantly into well-defined functional sub-categories.
Mining the Frequency Distribution of Transcription Factor (TF) Binding Sites in Promoters of Suppressed and Enhanced Genes During Human Adaptive Response to Ionizing Radiation
Biology & Biotechnology Research Program, L-452, Lawrence Livermore National Laboratory, Livermore CA, 94551; Department of Medicine, Baylor College of Medicine, Houston, TX. 77030; and San Diego Supercomputer Center, University of California, San Diego, La Jolla, CA 92093
Motivation: Through the Human Genome Sequencing Project a wealth of information has been gained at the nucleotide level. With the advent of DNA-based microarrays the amount of data available for interpretation is quickly becoming daunting. A starting point for discovery is to better link genomic biology approaches with bioinformatics to identify and characterize eukaryotic promoters. For example, microarray experiments use various cluster analysis algorithms to identify genes that share similar patterns of gene expression profiles that are predicted to be co-regulated as part of an interactive biochemical pathway. Further identification and characterization of DNA consensus sequences, regulatory elements that regulate the responsive genes could provide a valuable understanding of the genetic and biochemical mechanisms of the cell and should provide powerful biological indicators of genetic susceptibilities for tissue and genetic damage. To clearly identify these co-regulated groups of genes, we describe scalable computational workflow approaches that use web-based molecular biology tools and schemas for carrying out a variety of tasks such as hierarchical clustering, comparison of DNA sequences, and identification of transcription factor binding sites, comparison of clustered promoters and visualization of compiled data.
Results: We start an example workflow using cluster analysis of Affymetrix U95A array results for human adaptive response to ionizing radiation. Replicate gene expression data were based on lymphoblastoid cell lines derived from a radiosensitive non-adapter and two adapters. Exon mapping was performed to extract promoters (3kb upstream) for a cluster of mRNAs enhanced in only the non-adapter and a cluster of mRNAs enhanced in only the adapters. A search of transcription factor (TF) binding sites in the extracted promoters was then ran, resulting in a frequency distribution of all identified TF binding sites across all promoters. We then clustered the promoters based on frequency of TF binding sites. The resulting cluster image display shows a discernable pattern in TF binding site frequency that can be further mined for relevance to regulatory control of expression. There are many other possible ways to use microarray data for exploring relationships between co-expressed genes within and between species in order to infer co-regulationand some of these will be discussed.
A Scalable Visual Data Analysis Pipeline Framework Supporting Large-Scale Bioinformatics Research
Computer Science and Engineering Department, University of Connecticut, Storrs, CT 06269; and CyberConnect EZ, LLC, Storrs, CT 06268
One key challenge in supporting large scale bioinformatics research is developing a computational environment in which scientists can easily use various types of bioinformatics resources available in diversified platforms and locations. Such resources include databases, data available through third party web sites, files downloadable from ftp sites, analysis programs, format conversion routines, commonly usable scripts, high performance computing facilities, etc. A novel software framework has been developed that effectively harness all these diversified resources for ease of use by the bioinformaticists. This framework establishes a clear division of labor between the support core whose primary function is to develop and maintain computational resources and the scientists who uses such resources to conduct various bioinformatics analysis tasks. The support core configures the environment by interlinking various available resources. The scientists reap the benefits of the support cores resource integration. This framework relies on a distributed architecture allowing computational resources to be scattered around LAN, WAN and Internet. This framework also liberally adopts a visual iconic solution in user interface design and this easy-to-use feature presents a great potential for significantly improving scientists research productivity.
This novel software framework has been deployed and is currently under use for supporting a large scale bioinformatics activity. It is being used to develop a set of semi-automated pipelines designed to aid scientists in selecting non-redundant clones from the NIH cDNAs consortia library. The analysis steps of this pipeline include use of the publicly available EST repository, dbEST, elimination of redundant EST sequences by using a pre-built UniGene information, conducting pair-wise Blast sequence comparisons, and finally use of non-redundant clones selection heuristics.
Another use of the framework is to do microarray data analyses. A series of modular pipelines have been developed. One module addresses the quality control issue using global and intensity methods. One module is responsible for normalization of raw data produced from image analysis. Several modules are responsible for conducting expression level analysis including clustering and promoter analyses. The first two modules encapsulate use of conventional statistical packages in the pipeline. The expression level analysis modules include use of public/commercially available microarray data analysis packages as well as custom developed visualization programs.
We have also tested the feasibility of using the framework on a high speed sequence assembly work at JGI. This pipeline module includes use of RepeatMasker and Blast (running on a Linux cluster using MPI) in tandem. In an attempt to further demonstrate the scalability of this visual pipeline framework, we are planning a closer collaboration with JGI and LLNL in which LLNLs supercomputers are fully accessed through the easy-to-use visual analysis pipeline interface.
This work was supported in part by DOE SBIR Phase II Grant No. DE-FG02-99ER82773.
JGI Human Chromosome 19 Annotation
Astrid Terry (firstname.lastname@example.org), Laurie Gordon, Ivan Ovcharenko, Andrea Aerts, Uffe Helsten, Wayne Huang, Isaac Ho, Victor Solovyev, Duncan Scott, Steve Lowry, Olivier Couronne, Sam Rash, Paramvir Dehal, Inna Dubchak, Lisa Stubbs, and Dan Rokhsar
Computational Genomics, DOE Joint Genome Institute, 2800 Mitchell Dr, B400, Walnut Creek, CA 94598
The JGI has been mandated to finish and annotate human Chrs 5, 16, and 19. The final finished sequence for Chromosome 19 (Chr 19) was received in the middle of February and an automated pipeline for generating annotation has been developed. Gene models are built in many different ways: using experimentally known human mRNAs, EST/protein-seeded GeneWise, GenomeScan, FgenesH and GrailEXP. Annotators choose the best automated models using a hierarchy of evidence. Additionally, finished sequence is reviewed for evidence of any single base indels (we have corrected multiple finishing errors this way) and 5'/3' UTRs are extended by spliced ESTs or compatible redundant mRNAs. The syntenic mouse mRNA libraries augment the human mRNA libraries, using human exons when available. Alternative splicing is only reported if supported by at least nearly complete mRNA. High quality evidence is manually reviewed if no models can be created in the automated system. So far there are roughly 100 problematic loci, with various issues. Some of the challenges on Chr 19 include large gene families with known gene structure lacking extensive human mRNA/EST evidence and tandemly duplicated genes. Using expected gene structure and known properties, custom models are built for genes and pseudogenes, which are common in gene families. Web based interfaces allow annotators to view a predicted peptides properties and aid in putative function assignment based on pre-computed alignments of homology and domains. Using Chr19 as a model system, the other chromosomes are expected to follow shortly.
Request Handling Web Application Using JAVA Struts: Separation of Presentation and Transaction/Data Layer
DOE Joint Genome Institute, Walnut Creek, CA 94598
DOE Joint Genome Institute (JGI) is currently soliciting requests to sequence genomic regions of strong scientific value. To qualify for this program, the sequencing regions need to be contained in individual BACs, cosmids, or fosmids (from any organism). The applicant must also provide the clone for sequencing; however, JGI may screen available libraries to identify appropriate clones. The goal of this program is to focus on issues requiring long stretches of genomic sequence, and not to sequence small DNA fragments.The reviews will be conducted every two months by a panel of biologists who are very familiar with the DOEs missions and research programs. All approved regions will be listed on the status page. All sequence reads will be generated using either a MegaBACE or a ABI3730 instrument. The raw/assembled sequences and analysis results will be provided on our ftp site.
The request can be submitted through our web application, which is developed in Jakarta Struts framework. Struts provides an open-source framework for creating web applications that easily separate the presentation layer and allow it to be abstracted from the transaction/data layers. Here we present step-by-step illustration of the request handling web application.
Target Selection in Ciona Whole Genome Enhancer Screening: Algorithm and Visualization
DOE Joint Genome Institute, Walnut Creek, CA 94598; and Department of Molecular and Cellular Biology, University of California at Berkeley, Berkeley, CA 94720
To characterize gene regulatory network, we used electroporation assays to screen genomic DNA fragments for tissue specific enhancer activities in Ciona intestinalis. The Ciona genome is one of the smallest of all chordate genomes and Ciona tadpole represents the most simplified chordate body plan.
We designed the methodology of selecting targets for Ciona enhancer screening. In this computational approach, the forward and reverse sequencing reads are connected to create virtual clone sequences. This large pool of virtual clone sequences is then used to generate a blastable database. BLAST analysis aligns these clone sequences to large genomic DNA segments. We designed the algorithm to select the minimum tiling path clones to cover these genomic segments. Gaps are considered and artificial clones for filling gaps are suggested by the algorithm. A web application is set up for running the selection behind the scene. An index page is automatically updated once the tiling path clones are calculated. The web application is also set up for checking clone coverage, display tiling path clones graphically, propose gapping clones sequences, and dump the selected tiling path clones into the next step workflow system.
The Commercial Viability of EXCAVATOR: A Software Tool for Gene Expression Data Clustering
ApoCom Genomics and Oak Ridge National Laboratory
ApoCom Genomics, in collaboration with Oak Ridge National Laboratory, is being funded under a DOE Phase I SBIR Grant (DE-FG02-02ER83365) to assess the commercial viability of a novel data clustering tool developed by Drs. Ying Xu, Victor Olman and Dong Xu (Xu, et.al., 2001). As we enter into an era of advanced expression studies and concomitant voluminous databases, there is a growing need to rapidly analyze and cluster data into common expression and functionality groupings. To date, the most prevalent approaches for gene and/or protein clustering have been hierarchical clustering (Eisen et.al., 1998), K-means clustering (Herwig et al., 1999), and clustering through Self-Organizing Maps (SOMs) (Tamayo et al.,1999). While these approaches have all clearly demonstrated their usefulness, they all have inherent weaknesses. First, none of these algorithms can, in general, rigorously guarantee to produce globally optimal clustering for any non-trivial objective function. Moreover K-means and SOMs heavily depend upon the regularity of the geometric shape of cluster boundaries, and they generally do not work well when the clusters cannot be contained in some non-overlapping convex sets.
For cases where boundaries between clusters may not be clear, an objective function addressing more global properties of a cluster is needed. Three clustering algorithms, along with a minimum spanning tree (MST) representation, have been implemented within a computer program called EXpression data Clustering Analysis and VisualizATiOn Resource (EXCAVATOR). Our research team has conducted a comparison between the EXCAVATOR clustering algorithm and the widely used K-means clustering algorithm using rat central nervous system (CNS) data. Two criteria were employed for the comparison. The first was based on the jackknife approach to assess the predictive power of the clustering algorithm, and the second was based on the separability quality of clusters. All three of the EXCAVATOR algorithms (MST-hierarchical, MST-iterative, and MST- global optimal) outperformed the K-means algorithm relative to predictive power and separability quality.
In addition to comparative studies to assess the usefulness of EXCAVATOR, the team has developed an advanced graphical user interface (GUI). The GUI has been designed to afford maximum flexibility incorporating the multi-clustering data visualization, as well as user driven comparison and editing capabilities. EXCAVATORs data visualization component is based on a modular/flexible approach so as to extend its capability to other clustering/classification areas, such as phylogeny, sequence motif recognition, and protein family recognition.
Eisen, M.B., Spellman, P.T., Brown, P.O. and Botstein, D. (1998) Cluster analysis and display of genome-wide expression patterns. Proc. Natl Acad. Sci. USA, 95, 14 863-14 868.
Herwig, R., Poustka, A.J., Müller, C., Bull, C., Lehrach, H. and OBrien, J. (1999) Large-scale clustering of cDNA-fingerprinting data. Genome Res., 9, 1093-1105.
Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Kitareewan, S., Dmitrovsky, E., Lander, E.S. and Golub, T.R. (1999) Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc. Natl Acad. Sci. USA, 96, 2907-2912.
Xu, Y., Olman, V. and Xu, D. (2001) Clustering Gene Expression Data Using A Graph-Theoretic Appraoch: An Application of Minimum Spanning. Bioinformatics. Vol.18, no.2002.