|Genome Informatics Abstracts
DOE Human Genome Program
|Author Index||Sequencing Technologies||Microbial Genome Program|
|Search||Mapping||Ethical, Legal, & Social Issues|
|78. The Genome Annotation Collaboration:
Jay R. Snoddy, Morey Parang, Sergey Petrov,
Richard Mural, Manesh Shah, Ying Xu, Sheryl Martin, Phil LoCascio, Kim
Worley1, Manfred Zorn2, Sylvia Spengler2,
Donn Davy2, Chris Overton3, Edward C. Uberbacher,
and the Genome Annotation Consortium
The Genome Annotation Consortium is organizing software and database development projects toward a common goal of providing as much value-added annotation as possible on a genome sequence framework. The consortium is applying computational analysis modules and information technologies to the output of genome sequencers. We have developed a prototype system and process that will be presented at the Oakland workshop. We are also interested in forging new collaborations to add value to the genome sequence and annotation framework. Desired collaborations should improve the analysis process or the underlying technologies that are required for this analysis. This basic annotation process includes the following steps:
1. Acquisition of genome sequence data
and other data that can be readily attached to genome sequences;
The outputs of our desired process include:
1. An assembled genome sequence framework;
Our current prototype is being applied to the output of all the large-scale genome sequencing centers for human sequences. We are adding genome mouse and microbial sequences to our prototype (see abstract of Larimer et al. for microbial analysis). As part of the initial prototype, we have established a data-acquisition component that retrieves data from genome center web sites and GenBank. This acquired data, for example, includes clone-contig overlap that is not always in the GenBank/EMBL/DDBJ entry. We have established a sequence-assembly component that creates a consensus genome sequence framework by assembling the different clone sequences. In addition, we acquire other experimental observations that can be linked to that genome-sequence framework during annotation (e.g., ESTs, STSs, cDNAs).
We have developed a number of analysis modules, including GRAIL-EXP modules (see abstract of Xu et al.). We have integrated these analysis modules in a data-analysis process that creates a comprehensive genome-wide analysis (see abstract of Shah et al.). This comprehensive analysis process will be updated to ensure that new data can be added to the genome sequence framework. We have made progress in adding navigation and summary reports (see abstract of Snoddy et al.).
We also have made progress on the difficult issue of data storage and management that can organize this diverse experimental and computational data (see abstract by Petrov et al.). We have produced different catalogs of genes and proteins including (1) GenBank annotated genes, (2) Genscan-predicted genes, and (3) GRAIL-EXP-predicted genes (including a subset of genes that have some EST evidence for expression). We have produced a Java-based interface (the Genome Channel Browser v. 2.0) and an HTML-based data-access method. These interfaces, other planned interfaces, and other progress will be presented at the Oakland meeting.
The analysis modules used in the comprehensive
genome-analysis processes also will be available as public servers (see
abstract of LoCascio et al.). These servers would permit users to analyze
their new data or subsets of public data. Some of these analysis modules
also will be portable and could be applied at a number of sites beyond
the consortium member sites, including genome centers. We expect that our
data-analysis process and computational infrastructure will also foster
other genome-based, large-scale computational biology, including prediction
of protein structure and modeling of biological systems.
78a. Our Vision for a New Macromolecular Structure Database --The New Protein Databank
Helen M. Berman1, Gary Gilliland2, Peter Arzberger3, Phil Bourne4, John Westbrook5, Phoebe Fagan6
The rate of Growth in the number of structures experimentally determined by X-ray crystallography and NMR methods promises to increase dramatically as the focus continues to shift toward understanding the sequence-structure-function relationships. Structure data will only fully enable our ability to understand function if they are well annotated, integrated at all levels of detail with related sources of biological information, and widely disseminated to the growing user community. These data must be consistent so that the full potential of discovery through comparative analysis is available.
To this end, groups from Rutgers, the State University of New Jersey, the San Diego Supercomputer Center (SDSC) of the University of California, San Diego (UCSD), and the National Institutes of Standards and Technology (NIST) have formed the Research Collaboratory for Structural Bioinformatics (RCSB). The combined experience of members of the RCSB in structure data processing and analysis covers data validation, data modeling, database development, query languages, and visualization tool development. The RCSB has developed and currently maintains nine publicly available structural biology databases.
For its first collaborative project, the RCBS has created the new Protein
Data Bank, or Micromolecular Structure Database (MSD), which has several
The Data Uniformity process for the new Protein Data Bank will be presented, user input sought, and plans for the future discussed, with emphasis on enabling the field of structural bioinformatics.
79. Visualization, Navigation, and Query of Genomes: The Genome Channel and Beyond
Morey Parang, Richard Mural, Manesh
Shah, Doug Hyatt, Miriam Land, Jay Snoddy, Edward C. Uberbacher, and the
Genome Annotation Consortium
We are developing and deploying a series of interface tools for visualizing and querying the reference human genome and other genomes assembled and annotated by the Genome Annotation Consortium (see related abstracts). The Genome Channel Browser is a Java viewer capable of representing a wide variety of genomic-sequence annotation and links to a large number of related information and data resources. It relies on a number of underlying data resources, analysis tools, and data-retrieval agents to provide an up-to-date view of genomic sequences as well as computational and experimental annotation.
The current version of the Genome Channel Browser provides a diverse set of functional features in a DNA sequence including PolyA sites, CpG islands, repetitive DNA, simple repeats, STSs, GRAIL2 and Genscan exons, as well as GRAIL-EXP and Genscan gene models and their respective protein translations. The underlying information and evidence for genes and other features is presented in a variety of text windows, graphics windows, and summary reports.
The new version of the Genome Channel Browser (v2.0) offers such improved user interface and additional capabilities as the following:
We are researching the feasibility of providing interfaces to additional types of analysis results, such as protein threading and structural classification that might provide clues to the functions of predicted genes. Other features being studied for future implementation and visualization include polymorphisms and mutations.
80. Genome Annotation Data Management and Data Administration: Developing Summary Results for User Navigation, Genome Research, Improved Data Processing, and Quality Metrics
Jay R. Snoddy, Miriam Land, Sheryl
Martin, Morey Parang, Inna Volker, Denise Schmoyer, Manesh Shah, Sergey
Petrov, Edward C. Uberbacher, and the Genome Annotation Consortium
Summary reports of the genome annotation data and the underlying data management required to generate them are being constructed. A goal is to create these reports from a robust, queryable, and scalable data management system (see abstract of Petrov et al.). Some of these summary reports will be available as online HTML documents. These summaries can help improve four primary areas.
Several general observations can be made now from the current snapshot of the data, and details from a later snapshot will be presented at the Oakland workshop. There are 7 to 10 times more predicted gene models (both GRAIL-EXP and Genscan gene models) than gene models annotated in GenBank. The majority of the predicted GRAIL-EXP genes do have one or more ESTs that are used in the gene modeling. A third to half of the gene models that predict putative protein sequences have a reasonable BLAST hit to known proteins in Swiss-Prot (BLAST with an Evalue <=1.0e-4). By this BLAST hit criteria, about 3 times more predicted genes appear to have good homolog candidates than there are annotated genes in the GenBank archival record.
By the time of the Oakland workshop, we hope to display several online summary reports that can demonstrate the current state of genome annotation for genomes, chromosomes, contigs, and sequenced clones. This should provide users with the results of the different but integrated data-management and processing steps that we employ in genome annotation. We would be interested in suggestions for other reports or queries that others may find useful.
81. Data Management for Genome Analysis and Annotation: Engineering a Fundamental Infrastructure for Data that Supports Collaboration in Genome Annotation
Sergey Petrov, Jay R. Snoddy, Michael
D. Galloway, Sheryl Martin, Miriam Land, Morey Parang, Tom Rowan, Denise
D. Schmoyer, Manesh Shah, Inna E. Vokler, Edward C. Uberbacher, and the
Genome Annotation Consortium
The GenomeDataWarehouse, GDW, is a heterogeneous information system created to store and support data from multiple sources. GDW currently is being developed and filled with data. The purpose of GDW is to provide data management to highly diverse and distributed sets of data. One goal is to create and support a new form of on-line analytical processing (OLAP) that is suitable for the complex world of genome research data. Another goal is to provide the right kind of data management that can encourage several groups in the Genome Annotation Consortium to collaborate in adding experimental and computational data to a developing genome sequence framework. The data management will need to provide data to the different data-analysis modules that ORNL and other collaborators are creating and support the linkages among the underlying experimental data and computational data produced from those different modules. This data warehouse will assist in the production of several user interfaces that are described in other abstracts or are under development; there will not be one monolithic interface to GDW.
Conceptually, GDW consists of three parts: archival data sources, the kernel, and data marts. Currently, GDW is based on two Sybase servers, an SRS server, and data files running on networked Sun workstations. One Sybase server is dedicated to a copy of the Genome Database and, the second to kernel databases and a developing Genome Channel database. The archival data sources include sets of data from community databases (e.g., GenBank, SwissProt, Prosite, and GDB). The GDW Kernel is a set of databases used to store identification data on biologically meaningful objects and cross-references regardless of object origin, structure, and representation. The data marts are precompiled data sets reflecting the logic of a particular interface.
For archival data sources, we occasionally must internalize and manage community data within ORNL computers for a variety of performance, update, and querying reasons. These archival data are attached to the evolving genome sequence framework. We are using an SRS server (a product of the EMBL/EBI) for archiving and maintaining much of these data. Our implementation of the SRS server at ORNL provides access to 31 community flatfile databases. We are evaluating the use of SRS and other mechanisms for serving annotation data that we are creating.
The data warehouse kernel provides the underlying mechanism that manages data in this complex area where we cannot enforce transactional control and data integrity of some underlying archival data. Given the difficulties and constraints in technology and available resources, our warehouse doesn't require integration of all data from different sources and does not completely enforce global integrity; data are stored "as is". At the same time, a mechanism is needed to provide cross-references between information on same objects in different data marts and the relationships that originate from the data sources and our analyses. The Kernel consists of several databases storing IDs of objects found in archival data sources, their classifications, and relationships -- including cross-references. The structure of the kernel databases itself doesn't depend on structure of objects and relationships found in original sources. All databases in the kernel have almost-identical logical structure; data was divided among several databases to improve kernel performance only. Each database represents relationships between objects and their classes in a meta-closed way; every class and relationship is represented as an object and, therefore, information expressible in the database can include relationships between classes and relationships, as well as classification of relationships. Flexibility of chosen data representation allows us to include new data sources on the fly and to represent new classes of objects and new relationships found in genomic data. This approach comes at the cost of performance, but these databases are not meant to be routinely accessed by users.
The data marts are the read-only data sources that users routinely access to analyze and navigate the data. Data marts are compiled under the control of the GDW kernel. Each data mart reflects the internal logic of user interfaces and software systems that are focused around one aspect of genome data. Our current Genome Channel data system is the first example of these data marts. The Genome Channel data mart organizes data around genome structures and features on the assembled genome sequence framework. In the future, we will be developing other data marts, including a gene and protein catalog. Although this may contain data similar to parts of Genome Channel, the data system and interface are to be organized around genes, proteins, and the relationships among genes and proteins (including what is known and what we can predict about homology, phylogeny trees, protein families, and function).
We are developing several interfaces that will allow users access to data marts and mechanisms to query and navigate among data marts and other data sources. We are trying to give the user some flexibility in altering the view of the data. We are also exploring the application of the Internet standard, XML, as a method of expressing some annotation data; this should allow the user a lot of flexibility in altering data presentation at the client browser without going back to the data mart and altering the underlying content. We anticipate that a number of initial HTML prototypes will be available by the time of the Oakland meeting and hope to acquire more feedback on our efforts.
82. Genome Channel Analysis Engine: A System for Automated Analysis of Genome Channel Data
Manesh Shah, Morey Parang, Doug
Hyatt, Michael Galloway, Richard Mural, Kim Worley1, Edward
C. Uberbacher, and the Genome Annotation Consortium
The Genome Channel Annotation Toolkit currently incorporates several exon- and gene-prediction programs as well as other kinds of feature-recognition systems and database homology search systems. Most of these analysis systems have been developed by the Genome Annotation Consortium collaborators, while some systems have been obtained from other researchers who make their code available but are not currently consortium members. The exon- and gene-recognition systems include GRAIL, GRAIL-EXP, Genscan, and Genie. Feature-recognition systems include the GRAIL suite of tools: CpG island, PolyA sites, simple repeats and repetitive DNA elements. Database homology systems include NCBI BLAST and Beauty postprocessing.
The Genome Channel Analysis Engine is an automated system that facilitates the analysis of contig sequences contained in the Genome Channel repository. It schedules and distributes the various processing tasks on several networked computer systems in a concurrent, pipelined mode to best utilize the available computer resources and to achieve optimal throughput. The scheduling is organized in terms of analysis epochs. At the start of each cycle, a data-refresh procedure is executed to detect and compile a list of all new and updated contig sequences that the sequence data-retrieval engine has incorporated in the Genome Channel staging area since the previous cycle's data refresh. It also checks the database source ftp sites for updated versions of all databases required by various analysis tools and updates local copies as necessary.
A master process then starts up servers on the available machines, including a PVM process for GRAIL analysis modules. Using a combination of Perl scripts and C programs, the analysis engine automatically runs the analysis tools on new contigs. The master process distributes the tasks as required to servers running on other machines. Some tasks are performed in parallel, including the GRAIL analysis tools (PVM) and the GRAIL-EXP Blast search (MPI). In addition, once protein translations have been obtained for the predicted genes and exons in a contig, they are immediately piped to a Beauty postprocessing server for a detailed homology search. The ultimate goal of this scheduling and distribution scheme is to reduce the time required to process 100 Mb of data to under 24 hours. Using the currently available resources, the analysis engine can process 100 Mb in about 72 hours.
Analysis processing is currently being performed at ORNL and is also being deployed at Lawrence Berkeley National Laboratory (LBNL). The computational infrastructure at ORNL consists of a cluster of 15 DEC Alpha workstations, 2 Sun HPC 450 UltraSparc servers, and a 200 GB Network Appliance RAID disk storage unit. These resources are barely adequate for handling the current rate of growth of sequence data. We are pursuing several strategies to deal with the anticipated rate of sequence generation and the consequent growth in compute and storage requirements. Some of the most compute-intensive tasks are being ported to the Paragon supercomputer at ORNL and to the supercomputers at LBNL. We also plan to evaluate the High Performance Storage System (HPSS) at ORNL for data storage.
83. GRAIL-EXP: Multiple Gene Modeling Using Pattern Recognition and Homology
Ying Xu, Manesh Shah, Doug Hyatt, Richard
Mural, Edward C. Uberbacher, and the Genome Annotation Consortium
GRAIL-EXP is a multiple gene-modeling system that combines information from the analysis of EST homology with pattern recognition to construct accurate gene models. We believe that these current improvements in GRAIL-EXP represent fundamental advances in gene-modeling accuracy and computational performance. Currently, the system is being used extensively by the Genome Annotation Consortium to provide comprehensive genome-wide annotation for genomic DNA sequence from human, mouse, and other model organisms as well as several microbial organisms. GRAIL-EXP is used in this context to analyze long stretches of human and mouse DNA sequences (contigs that span tens of thousands to more than a million bases) to correctly identify and characterize the large numbers of genes found in such sequences.
Computational methods for gene identification in human genomic sequences typically consist of two phases: coding-region recognition and gene modeling. Although several effective methods for coding-region recognition are available, parsing the recognized coding regions into appropriate gene structures remains a difficult problem. GRAIL-EXP addresses the problem of multiple gene identification, using a set of biological heuristics and information available from sequence homology with available EST and mRNA sequences.
GRAIL-EXP uses GRAIL for predicting exons in a sequence. GRAIL evaluates all possible exon candidates in a DNA sequence and groups the high-scoring candidates into overlapping clusters. Those containing repetitive DNA elements are filtered out based on BLAST alignments of the exon candidates with a repetitive DNA database.
In the next phase, the system uses BLAST to identify all EST and mRNA sequences (obtained from GenBank dbEST and TIGR's human transcript sequence database) that have a sufficiently high BLAST alignment score with the candidate exons. The system also extracts information useful for the subsequent gene-modeling phase from each matched entry in the database. This results in a set of alignments for each exon candidate.
In the gene-modeling phase, an optimal gene model is constructed from the predicted exon candidates and the alignment information using dynamic programming. A set of nodes, one for each exon candidate or aligned-EST-sequence pair, is created. Each node is assigned a score based on its GRAIL score and the BLAST score for that alignment. The best-scoring gene model ending at each node is calculated using a recursive algorithm. Each exon is examined in three possible roles (as being the initial, middle, or terminating exon of a gene model). The algorithm assigns penalties and rewards at each step based on reading-frame mismatch, existence of in-frame stop codon, and terminating exon not ending in a stop codon. A node that uses the same EST as the previous node is assigned a reward that significantly outweighs the penalties. This guarantees that an EST that matches multiple exon candidates will have overriding influence on the gene model.
If the optimal gene model incorporates a set of one or more matched ESTs, then the system determines if any regions of the ESTs were not covered by the gene model exons. In this case, the system tries to locate the missing EST fragments in the appropriate intervals of the genomic sequence. If located, that region is added to the gene model as an exon.
GRAIL-EXP, a complex system with several logical components and numerous subcomponents, has been designed and implemented as a modular system. This is convenient for distributing various analysis tasks on multiple computers to achieve higher throughput. The system currently runs on a cluster of 10 DEC Alpha workstations and is able to analyze around 1 Mb of genomic sequence in about 15 minutes. Work is under way to achieve significant speedup by porting the most computationally-intensive modules to the Paragon supercomputer in the Center of Computational Sciences at ORNL and to similar platforms at Lawrence Berkeley National Laboratory. A Java-based graphical user interface has been developed to provide an interactive environment for the analysis of user-supplied DNA sequences. The system will be made available to the genome community via the public GRAIL server at ORNL in early 1999.
84. High-Performance Computing Servers
Phil LoCascio, Doug Hyatt, Manesh
Shah, Al Geist, Bill Shelton, Ray Flannery, Jay Snoddy, Edward Uberbacher,
and the Genome Annotation Consortium
Advances and fundamental changes in experimental genomics brought about by an avalanche of genome sequences will pose challenges in the current methods used to computationally analyze biological data. We are constructing a computational infrastructure to meet these new demands for processing sequence and other biological data within the Genome Annotation Consortium project, for genome centers and for the biological community at large. To cope with this 20-fold data increase, we have been developing the necessary high-performance computing tools to address this scaling challenge. As part of the Department of Energy's Grand Challenge in computational genomics, we have developed a number of applications to form part of a high-performance toolkit for the analysis of sequence data.
The initial tools we wish to include in the toolkit are high-performance biological application servers that include BLAST codes (versions of BLASTN, BLASTP, and BLASTX), and codes for sequence assembly, gene modeling (e.g., GRAIL-EXP), multiple sequence alignment, protein classification, protein threading, and phylogeny reconstruction (for both gene trees and species trees).
The tools and servers will be transparent to the user but able to manage the large amounts of processing and data produced in the various stages of enriching experimental biological information with computational analysis. The goal of this high-performance toolkit is not only to provide one-stop shopping to a genome sequence-data framework and interoperable tools but also to run the codes in the toolkit on platforms where the kinds of questions that the GAC and our users can ask are not greatly affected by hardware limitations.
The system's logical structure can be thought of as having three overall components: client, administrator, and server. All components share a common infrastructure consisting of a naming service and query agent, with the administrator having policy control over agent behavior, and namespace profile.
At the atomic transaction level of detail, clients and servers behave as expected, with clients issuing requests and servers responding. A higher level of transaction detail permits a much more complex model of operation where clients can be operated from within servers, and servers can be directed to propagate replies. This nested transaction model is very powerful for developing decoupled calculation and query facilities. The complex interaction is completely transparent to the user because all transactions are controlled by a query agent.
The GRAIL-EXP gene-recognition application will be deployed as a server that uses this model. The application derives alignment services from BLAST servers elsewhere. Internally, the GRAIL-EXP server is composed of a number of independent components that interact as a nested set of transactions. The ability to assign different resources to different components is an extremely important feature for maintaining a credible load-balancing scheme.
Due to the logical decoupling of the query infrastructure, we are able to produce a model with both excellent scaling abilities and fault-tolerant characteristics. In testing the ability to run multiple instances of GRAIL-EXP and BLAST we have demonstrated that the removal of any dependent services does not cause loss of data. Instead, where processing power is removed, we observe a graceful degradation of services as long as there is some instantiation of service available.
The overall software engineering design has been constructed very carefully to provide a nonspecific use of distributed resources, through the neutral application programming interface (NAPI) layer. NAPI is used to encapsulate the functionality required for distributed operation while utilizing the currently available resources. The underlying infrastructure is subsumed using PVM (parallel virtual machine) for robust heterogeneous operation and MPI (message passing interface) for homogeneous application development, optional where available. Other infrastructure ports can be accommodated (e.g., JAVA RMI), but the focus is now on the design of the high-level component model functionality and semantics.
Located at Oak Ridge National Laboratory within both the Center for Computational Sciences and the Computational Biosciences section, the development testbed consists of three super computers (Intel Paragons), some SGI SMP machines, and a DEC Alpha Workstation cluster. We are rapidly approaching alpha-stage deployment testing; after testing performance and stability, we can deploy the framework to NERSC, other high-performance computing sites, and other collaborators.
85. DOE Joint Genome Institute Public WWW Site
Robert D. Sutherland and Linda Ashworth
The Joint Genome Institute (JGI) has rebuilt its public WWW site to improve presentation of sequence and mapping data, and make data access more efficient. This site includes: (1) combined access to data from LANL, LBNL, LLNL, and the new Production Sequencing Facility (PSF), (2) information about the JGI and its member institutions, (3) links to member institution's WWW pages and other relevant WWW sites, (4) physical maps for regions being sequenced by the JGI, (5) links to entries submitted to public databases, (6) links to the JGI FTP server, (7) finished sequence data with associated quality information, and (8) information and WWW-links to promote public education and understanding of Genetics and the Human Genome Program. We have also upgraded our extremely fluid internal site which supports data sharing and communication between the four sites. This work is funded by the United States Department of Energy. URL: http://jgi.doe.gov
86. JGI Informatics and the PSF Network
Tom Slezak, Mark Wagner, Lisa Corsetti,
Sam Pitluck, Arthur Kobayashi, Mimi Yeh, Brian Yumae, and Peg Folta
The component labs of the JGI sometimes appear to have little more than their funding source in common, both in biology and informatics. Meeting the stiff FY98 production goals was more of a miracle of elasticity of existing systems and people than any sort of fusion of component parts or philosophies. It is only now that we are actually working together in the PSF that we are being truly compelled to develop a new common culture. In some senses, informatics is being used as both the carrot and stick with which to overcome many of the differences, gratuitous or not, that have evolved at the JGI member labs.
Consolidation in informatics can only occur if there is corresponding consolidation or similarity in the underlying biological methods. Systems developed for transposon-based sequencing are highly dissimilar from those developed for shotgun strategies. Similar differences occur in mapping systems, which in the JGI have ranged from Claris Draw to Sybase. It is not feasible to make dramatic changes to processes that are under extreme production pressures; we are challenged to provide gradual, seductive improvements that bring about unification without derailing production ramps.
The long-term vision for JGI informatics is that information should be available in a JGI-centric view throughout the entire process: from mapping (via several methods), to production sequencing (allowing for multiple methods), to annotation and submission (allowing for several styles of manual and semi-automated annotation), and eventually to a range of functional genomics and structural biology. We are in the very earliest days of implementing this dream, as we struggle with production ramp rates unmatched anywhere that demand full effort from our best people who would otherwise be building our new systems. We will discuss our early efforts and our aspirations for this important first year of true "Joint-ness".
The PSF network has been designed to accommodate growth and flexibility in light of uncertain future demands. It features a cable plan that allows any jack to be a network, digital voice, or analog fax line as needed. The network is segmented into a high-reliability subnet for DNA sequencer data acquisition and primary server/storage functions, and another subnet for office and lab computers. We acknowledge the excellent work and contributions of the LBNL Communications (Sig Rogers), LBLnet (Ted Sopher), and ESnet (Jim Leighton) staffs in making the PSF network happen on time and on budget.
This work was performed by Lawrence Livermore National Laboratory under the auspices of the U.S. Department of Energy, Contract No. W-7405-Eng-48.
87. Verification of Finished Sequence at JGI-LLNL
Karolyn J. Burkhart-Schultz, Amy
M. Brower, Arthur Kobayashi, Matt Nolan, Melissa Ramirez, and Jane E. Lamerdin
The JGI-LLNL sequencing group submitted 8.6 Mb of genomic sequence to the NCBI database in the 1997-1998 fiscal year. This accomplishment represents an 8 fold increase over our submissions for the previous year. An integral part of our finishing process is verification. The verification process allows an independent assessment of the validity of the finished assembly and final consensus sequence of each large insert clone (i.e. cosmid or Bac) project. Verification involves: 1) re-checking the finisher's validation of the assembled clone/project; 2) independent re-assembly of all the reads in the project with all finisher edits removed and comparison of this "no-edits" consensus to that submitted by the finisher; and 3) analysis of the extent and quality of the overlap of the finished clone with adjacent clones in the sequencing tiling path.
A finished project is submitted for verification along with a validation report prepared by the finisher. The verifier uses this report, as well as Consed and LLNL-developed tools to identify regions where strict standards for quality and double stranding may not be met. The verifier "re-checks" any "problematic" or difficult regions encountered during the assembly process. In addition, the final consensus is "digested" and the fragment sizes compared to those obtained by restriction mapping data compiled for each cosmid or Bac. At least three digests (e.g. BamH1, BglII, EcorI, EcoRI/BglII, or Xho1) are used in these comparisons and any significant deviations between the map and sequence data are flagged.
The purpose of the "no-edits" re-assembly of a project is to remove any possible biases introduced by the finisher in the process of obtaining contiguous sequence. Currently this assembly is performed using an earlier version of the Phrap assembly engine. If the product of the "no-edits" re-assembly is not contiguous, the reasons for any breaks are examined and explained. Similarly, base discrepancies between the "no-edits" and finished consensus sequences are examined to determine which contains the valid basecall. If the contig breaks or sequence discrepancies cannot be resolved, the verifier may request that further data be generated.
Completion of the verification process requires resolution of all issues discussed above and that the final assembly and consensus sequence are supported by the data. A verification report with explanations of various issues is added to the validation report in the project directory. Rigorous verification of finished sequence assures the integrity and the quality of the final submitted sequence in the public database.
This work was performed by Lawrence Livermore National Laboratory under the auspices of the U.S. Department of Energy, Contract No. W-7405-Eng-48.
88. Informatics for Production Sequencing at LLNL
Arthur Kobayashi, David J. Ow, Matt
P. Nolan, Stephan Trong, Tory Bobo, Tom Slezak, Mark C. Wagner, T. Mimi
Yeh, Lisa Corsetti, Jane Lamerdin, Paula McCready, Evan W. Skowronski,
and Anthony V. Carrano
This past year, LLNL contributed over 8.6 MB of high-quality finished sequence to the Joint Genome Institute total of 20.9 MB. This represents an increase of more than 500% over the amount finished by LLNL last year (1.5 MB). We have managed to support this increased throughput through incremental improvements to our existing informatics infrastructure.
Our current system features extensive sample tracking, quality-control checks and reporting on every sequence run on our ABI sequencers, automated prefinishing, and an integrated suite of robotic workstations which perform rearraying, sample preps, etc. Most of our interfaces have been converted to WWW-based forms, and all of our sample information is stored in a Sybase relational database. Data is transferred through an automated sample sorting system over a dedicated 100 MB/sec ethernet local area network segment.
Over the next few years, our throughput will continue to increase at an aggressive rate. We are also faced with the challenge of relocating our core sequencing facility to Walnut Creek while developing our finishing capabilities at LLNL. We are currently working on improving our existing system through increased automation, barcode labeling of samples, and a new database schema. Some of these projects are described in more detail in other posters.
This work was performed by Lawrence Livermore National Laboratory under the auspices of the U.S. Department of Energy, Contract No. W-7405-Eng-48.
89. A Workflow-Based LIMS for High-Throughput Sequencing, Genotyping, and Genetic Diagnostic Environments
The widespread application of new genetic and DNA-based technologies and techniques to important medical and biological problems has resulted in an explosion of data and information. Advanced information systems are needed to comprehensively collect, analyze, and manage these critically important data.
This project has developed, marketed, and supported a family of workflow-based software products that addresses the specific challenges and needs of managing genetic and DNA-based information for the gene and drug discovery processes. The Cimarron Workflow System will accelerate the creation, modification, automation, and re-use of laboratory data collection and analysis activities.
Cimarron has augmented its existing Activity Model and LIMS tool kit with a central, well-defined Workflow Modeling and Management capability. The resulting Workflow Model and augmented tool kit have been used successfully in customer systems within both academic and industry.
Cimarron has continued to augment the Cimarron Workflow Model and the LIMS tool kit with advanced routing and queue management capabilities, a graphical workflow modeling application by which a domain expert can design a lab system, and other advanced information system capabilities, and will continue to market this tool kit to customers within genome labs and production facilities. This project will result in a commercial tool for building workflow-based laboratory information management systems. A large market exists for such a tool within the domain of high-throughput facilities for sequencing, genotyping, and genetic diagnostics.
90. A Simulation Extension of a Workflow-Based LIMS
The Human Genome Project is a complex scientific enterprise of national significance whose success will be greatly accelerated by effective project management and planning tools. This SBIR project is delivering such a tool by integrating simulation (computer modeling) software with laboratory information management databases. Central to the HGP are high throughput molecular biology laboratories which critically depend on cost effective management of complex experimental and production workflows. This project is developing software to simulate laboratory workflows under real and what if scenarios. This software is unique in that it derives its workflow model and configuration parameters from the real laboratory workflow, as stored in its operational laboratory information management system.
Phase I of this project was devoted to feasibility evaluation, design finalization, and technology assessment for an integrated laboratory information management / simulation facility. The findings and results confirmed that the integration was feasible, powerful, and conceptually elegant. Customers of Cimarron Software have helped shape and evaluate a prototype system, and confirmed the marketability of a fully engineered product.
Phase II of this project is building a fully engineered product for specifying, launching, monitoring, saving and comparing simulation runs. The interactive system is being packaged as a Java applet, and hence web enabled. This simulation facility is a natural and commercially valuable extension of Cimarron's core technology. Essentially all of Cimarron's current customers have indicated they would purchase such an extension if it were well integrated into their systems. Longer term, it is expected that this simulation / database combination will have similar benefits to the broader workflow management market.
91. A Graphical Work-Flow Environment Seamlessly Integrating Database Querying and Data Analysis
Dong-Guk Shin1, Lung-Yung
Chu1, Lei Liu1, Nori Ravi1, Joseph Leone2,
Rich Landers2, and Wally Grajewski2
In the past, we have been has been very successful in developing a graphical ad hoc query interface capable of accessing heterogeneous public genome databases. This project aimed at developing a suite of user-friendly software designed to aid computational biologists in accessing various independently managed genome databases. This software makes the SQL query syntax manageable for the novice user and makes unfamiliar complex genome database schemas quickly understandable for less experienced persons. Furthermore, this software aids users in quickly expressing semantically-correct ad hoc queries. The impact of wide distribution of this software is expected to be significant. Computational biologists who have been reluctant to use genome databases themselves would begin to query the databases themselves, thanks to the numerous user-friendly features built-in the easy-to-use graphical interfaces. Most distinctively, the computational biologists will be able to ask cross-database queries against multiple genome databases that are springing up within the genome community.
We are currently investigating ways of embedding this user-friendly database access tool into a graphical work-flow management environment. Although being able to query various genome databases easily and being able to make associations between remotely located data is essential, we consider that it is imperative to produce an integrative environment in which both database querying and data analysis activities can be carried out seamlessly in a cohesive manner. This requirement is critical because many biologically significant questions are centered around performing analysis programs. In the proposed scenario, a computational biologist should be able to store persistently results of a Blast search into a database and subsequently should be able to query and cross-link filtered Blast result with existing genome databases. Similarly, a computational biologist should be able to perform a database query and funnel the query results into a subsequent Blast or Fasta search, etc. Furthermore, the user should also be able to conveniently change analysis or query results into a certain data format to be input to tree alignment programs, like CLUSTALW, or tree building programs, like Phylip and Puzzle, for final stages of analysis visualization. The ultimate goal of this project is to produce an easy-to-use work-flow editing environment in which the user can easily specify data flow involving both database querying and data analysis. This project is being pursued in collaboration with JGI.
92. Data Visualization for Distributed Bioinformatics
Gregg Helt, Suzanna Lewis, Nomi
Harris, and Gerald M. Rubin
A significant challenge for genome centers is to make the data being generated available to biologists in a succinct and meaningful way. We are addressing this problem by creating extensible, reusable graphical components specifically designed for developing genome visualization applications. With careful planning and design this toolkit enhances the ability for others and ourselves to rapidly develop genome visualization applications for the Internet and as editing applications.
The visualization toolkit is written in the Java programming language. Our applications are being designed to read XML files. We will describe our component based approach and demonstrate a variety of visually distinct applications that are all based upon the same underlying components. These range from whole genome views of chromosomes to multiple alignment views of sequence data. The different views of the data are interconnected via shared, common data models that underlie the various displays. The views are also linked to external databases to retrieve and display textual data on selected features. Different types of analysis can be dynamically performed on the data and the results displayed on the maps. This analysis code is dynamically loaded when requested, to minimize the initial loading time.
Other groups can reuse this work in various ways: genome centers can reuse large parts of the genome browser with minor modifications, bioinformatics groups working on sequence analysis can reuse components to build front ends for analysis programs, and biology labs can reuse components to publish results as dynamic Web documents.
The BDGP has established a collaborative agreement with a small startup, Neomorphic, to develop improved versions of the widgets initially begun at the BDGP. This will provide the additional resources necessary to allow us to provide commercial-grade, thoroughly documented products. Under the terms of this collaboration, all the products of the collaboration will be made available to academic and government institutions for a nominal fee.
93. A Figure of Merit for DNA Sequence Data
Mark O. Mundt, Allon G. Percus, and David
We have implemented a new measure of the quality of sequence data. Given a sample of sequence data and its phred scores, our figure is the predicted net error rate for finished sequence that would be generated from a given coverage of sequences of comparable quality. It is reasonable to use our figure of merit for assessing the quality of batches of sequence data for continuous quality control in sequencing factories.
This figure of merit avoids the complexities of fragment assembly. It assumes that the sequence reads occur at random positions, uniformly across the target sequence, and with each orientation being equally likely. The figure of merit is then the expected composite rate of erroneous basecalls, for a given coverage in sequences comparable in quality to those of the sample dataset.
Thus, an average is taken over the different ways in which the bases and their associated phred scores "align" on the different base positions.
We implemented this figure of merit in an executable Java computer program, available by anonymous ftp from cell.lanl.gov., in the directory pub/fom. The inputs to the program are the standard phred output for the sample of sequences whose quality is to be assessed, and also the desired coverage to be used. There are essentially no restrictions upon the size of the dataset: it could consist of the phred scores for one or multiple reads. This program can trim off specified sequences, such as vector sequences. We illustrate the result of using different coverage parameters for three datasets, one of which has noticeably lower quality. It is surprising that expected net error rates manifest no trace of a dependence on the parity of the number of times sequenced, arising from "majority rule" statistics.
94. Probabilistic Basecalling
Terry Speed, Lei Li, Dave Nelson, and Simon
Basecalling is the process of converting raw data from automated DNA sequencing machines to a sequence of bases. The process is typically subdivided into the tasks of color separation, mobility shift correction, deconvolution and decoding. A probabilistic model of the process is presented, at center of which lies an Hidden Markov Model (HMM). The class of HMMs is chosen for its flexibility and for the availability of efficient algorithms for training and decoding. Performance of this approach to basecalling is compared with the standard available basecalling algorithms.
95. The FAKtory Sequence Assembly System
Susan J. Miller, Eugene W. Myers,
Kedarnath A. Dubhashi, and Daniel E. Garrison
The FAKtory system facilitates sophisticated prescreening, assembly, contig layout manipulation and finishing for DNA sequencing projects. The system allows specialized database configuration, definition of operations performed on sequences, constraint-based assemblies, and convenient linking to post-assembly analysis software. Each user can configure various FAKtory displays and can select the degree of control desired over each of the operations. Retaining our original design goals of customizability and sound ergonomics, we have continued to add features to FAKtory.
Among our recent additions to the system are allowing input of pre-trimmed sequence data and input of SCF format data. We have made a dramatic speed improvement and have increased the sensitivity of overlap detection in the underlying FAKII assembly kernel. A report generator has been added to FAKtory and we have developed a separate graphical viewer for comparing CAF or ace format assemblies. The finishing editor now has the capability of running the GENSCAN program and displaying any exons found in the six reading frames. Optionally, the consensus sequence can be filtered through the CENSOR or RepeatMasker programs before searching for exons.
Currently under development are improved overlap detection using quality values and enhancements to the Finishing editor. In addition, fragment prescreening based on quality numbers will be incorporated into the system. We also plan to add the capability of importing GAF format assemblies into FAKtory and to integrate the graphical comparison tool for more convenient comparisons of alternate assemblies generated by FAKII or by other assemblers.
96. Hidden Markov Models in Biosequence Analysis: Recent Results and New Methods
Christian Barrett, Mark Diekhans, Richard
Hughey, Tommi Jaakkola, Kevin Karplus, David Kulp, Stephen Winters-Hilt,
and David Haussler
Currently there is an acute need for effective methods for locating genes in DNA sequences, along with their splice sites and regulatory binding sites, and for classifying new proteins by their predicted structure or function. Hidden Markov Models (HMMs) have proven to be useful tools for these tasks. We have recently extended the HMM-based genefinding system Genie so that it can simultaneously incorporate protein homology and EST information to improve gene finding. We have also built a new library of HMMs for protein families and tested our methods against other methods for the detection of remote homologies between proteins in a large scale experiment conducted at the Laboratory for Molecular Biology in Cambridge. Results showed the method to be superior to other methods, including PSI-BLAST, the nearest competitor. Finally, we have developed a new method of biosequence classification called the Fisher kernel method. Here an HMM (or any parametric generative model for a family of biosequences) is used to embed the sequences into a linear space with a natural inner product defined using the Fisher information matrix. One can then employ a variety of classification methods to discriminate members of the family from nonmembers, for example, support vector machines. We present experiments for the protein superfamily classification problem that show the Fisher kernel method is superior to existing HMM approaches, and to simpler methods such as BLAST. In particular, the method is better at finding remote homologs in nearly all the 33 protein families we tested, including G proteins, retroviral proteases, interferons, and many others.
97. Java Based Restriction Map Display
Mark C. Wagner, Jan-Fang Cheng,
Steve Lowry, Robert Sutherland, Norman Doggett, Laurie A. Gordon, and Anne
The Clone Resources Task of the Joint Genome Institute generates large quantities of physical mapping data. These data require the use of a graphical display for ease of understanding. The restriction mapping data generated by the JGI (chromosome 5 data by Steve Lowry and Jan-Fang Cheng, chromosome 16 data by Robert Sutherland and Norman Doggett, and chromosome 19 data by Laurie Gordon and Anne Olsen) has been made available through the use of a Java based graphical interface.
This display enables both the biologists of the JGI and the public at large to view the current restriction maps of the JGI. We have recently upgraded the display to permit users to select a specific area of the chromosome, a gene/marker of interest, a particular map of interest, or a particular clone. These methods permit the user to get directly to the area to be studied. We have also added the display of selected genes and markers to the map, so that relationships between various biological entities are more readily apparent. We have made our sequencing information available directly from the mapping display.
98. Mapping Data
Lixin Tang, Jeremy Boulton1,
Benjamin Liau1, Hui Zhang2, Wei Qin3,
Sung Ha Huh1, Yicheng Cao, Robert Xuequn Xu, Glen George1,
and Ung-Jin Kim
In building the physical contig maps of human and other chromosomes, we found that a high degree of automation as well as a high degree of flexibility is usually desired at the same time. Despite remarkable progresses in physical mapping projects and biocomputing, a computer software tool that allows both is not yet available albeit there are some mapping tools that heavily rely on automation and allow limited human intervention.
AceDraw is a graphical software tool that we have developed in order to facilitate drawing, updating, and database entry of physical mapping data. It is capable of reading the content of ACeDB database, allows graphic display and freehand editing of the physical maps, and dumps the physical map in a file that can be parsed by ACeDB. The program was written in C++ and used a freely available relational database, MySQL, as a backend. It helps to manage the mapping data in a way that can be organized and viewed in which the location and order of clones and landmarks and overlaps between clones are displayed along the length of chromosomal regions. It is similar to ACeDB in this respect, and it can actually take in the ".ace" files and draw more presentable and intuitive contig maps using a large number of colors, with each assigned color associated with certain traits of associated data, such as "sequenced", "fingerprinted", and/or "end sequenced", etc. It can save the mapping data back into ACEDB format after drawing, modification, and updating, thus allowing using of the special functionalities provided in ACEDB.
An important feature of AceDraw is that it is a freehand drawing tool that enables easy human intervention to resolve conflicts in the contig map by direct manipulations on clones and markers such as clone searching, moving, resizing and color changing. AceDraw also allows easy creation, modification and deletion of clones and landmarkers without low-level editing of database files, and thus facilitates map construction. Since AceDraw is associated with relational database model, querying the database for an object of interest can be done easily. Moreover, AceDraw supports map output into high resolution postscript files for the printing of hardcopy maps.
99. A Relational Database and Web/CGI Approach in the Analysis and Data Presentation of Large-Scale BAC-EST Hybridization Screens
Robert Xuequn Xu, Chang-Su Lim,
Bum-Chan Park, Mei Wang, Jonghyeob Lee, Aaron Rosin, Eunpyo Moon, Melvin
Simon, and Ung-Jin Kim
Large-scale hybridization, in which both probes and targets are in huge numbers, are frequently used in the genome projects to mass-produce positive screening results. In our project of "Construction of a genome-wide human BAC-Unigene resource", probes (ESTs) are pooled in groups of 20 according to a pre-designed 20x20 matrix, and the pooled and labeled probes are applied to BAC library filters in hybridization.
We have developed a complete data management, analysis and presentation system with a relational database tool, combined with a web server and several Perl scripts, for the deconvolution results of our massive BAC-EST screening. It features screen-by-screen progress reporting, detailed description of each probe, automatic statistics report generation, and some quality control functions (http://www.tree.caltech.edu/lib_D_Unigene.html). The methodology can be easily adapted to any other large-scale hybridization projects using similar probe pooling strategy. With the help of this tool, the probe pooling matrix is virtually unlimited, and therefore the efficiency of hybridization screening can be greatly improved. For example, for 10,000 probes, if they are organized into 100x100 matrix, only 200 hybridizations are needed (100 row pools and 100 column pools), instead of 10,000 individual hybridizations.
The initial hybridization results are collections of BACs that are positive to particular probe pools - row and column pools. Since there are 20 row pools and 20 column pools, 40 hybridizations are performed. In order to resolve the individual probe-BAC relationship, The pooled screening results are fed into a relational database scheme: First, a probe table PROBE_RC is constructed which includes fields PROBE, ROW, and COL (PROBE is the IMAGE clone ID; ROW is the row number of the probe in the 20x20 matrix; and COL is the probe's column number). In a 20x20 matrix, ROW ranges from 1-20, and COL ranges from 21-40. Then, when the pool-wise hybridization results are available, those results are entered into the positive BAC table as BAC, POS in which BAC is the positive BAC address identified from the filter screen and POS is the pool number of probes that that BAC is positive to. Apparently, the POS's value will range from 1 to 40. Finally, after data entry, a relational database tool is started to create views that split the positive BAC table into BAC_ROW view and BAC_COL view, and these two views are joined at the BAC field (i.e. BAC_ROW.BAC= BAC_COL.BAC) to create a new view BAC_ROW_COL which contains BACs that appear on both row-pool screens and column-pool screens.
The resulting view BAC_ROW_COL (which consists of BAC, BAC_ROW.POS and BAC_COL.POS fields) is joined further with the probe table, with the join condition "PROBE_RC.ROW= BAC_ROW.POS AND PROBE_RC.COL= BAC_COL.POS", and the final "deconvolution" table is generated as selecting BAC, ROW, COL, PROBE. Any possible individual positive BAC-probe relations are revealed in this table, and it can be grouped, sorted and reported through the relational database tool's internal reporting function and publish to the Web; or through custom designed Perl/CGI scripts.
100. A Distributed Object System for Automated Processing and Tracking of Fluorescence Based DNA Sequence Data
Michael C. Giddings, Jessica M. Severin,
Michael Westphall, and Lloyd M. Smith
We have been developing a system which allows for rapid and easy construction of data handling and analysis for DNA sequencing through the use of modular componentware which can be assembled into a working analysis system through an easy to use graphical interface. This system utilizes network-distributed object communications for data transfer and process synchronization allowing the integration of processing programs and components located on different hardware and software platforms. Our analysis system focuses on automated processing of raw data collected on four color fluorescence based DNA sequencing instruments to the point of base calling. This system employs a distributed object framework which allows easy integration of new components into the analysis system. A new component can even be an existing analysis program available as a precompiled command-line tool such as Phred or Phrap. We currently have 5 individual component servers implemented using this frame work. These correspond to the steps of input of new gel data into the system, automatic lanetracking, manual checking of lane tracks, lane extraction into trace data, and finally lane trace preprocessing and basecalling by BaseFinder. The last element is the "System Controller" which handles setup of system, data flow, information tracking, configuration of individual processing steps, handling and recovery of server failures, and parallel distributed load balancing to facilitate the use of multiple servers of the same type distributed across multiple machines. Parallel distributed load balancing will allow the system to scale with sequence processing demands by the simple addition of more processing computers and duplicate servers.
100a. Recent advances in high-throughput genomic sequencing: Magnetic Capture of Plasmids
Kevin McKernan, Paul McEwan, Will Morris, Nicole Stange-Thomann,
Imani Torruella-Miller, Andrew Sheridan, Alan Wagner, Dudley Wyman, Boris
Pavlin, James Benn, Eric S. Lander, Lauren Linton
In an attempt to build a genome center production facility capable of 200Mb finished sequence/year it is important to streamline and simplify operations to allow scalability. At Whitehead Institute we have pursued this goal by designing biochemical protocols that are amenable to automation while constructing the automation as needed. As of recent, we have designed and constructed automation for M13 and plasmid purification.
Prior to invention of the protocol to be described below, plasmid purification protocols have been labor intensive and expensive. Current commercially available protocols allow 1 tech to process 1000 clones / 8 hour work day at an average cost of $1.20/well (Qiagen).
M13 purification protocols, on the other hand, have been readily automated in a few laboratories (Stanford, WI) and can produce 12000 clones/ 8 hours per day (15 plates/hour) at a cost of $.07/well while also offering the quality control of full automation. As a result, automating M13 purification has proven to be cost effective since it generates high quality data due to single stranded sequencing.
Due to these advantages, most genome centers have chosen M13 (Wash U, Whitehead '97, Baylor, Stanford) as their primary vector, supplementing shotgun projects with either 1-2x tiling paths of plasmids to obtain a forward and reverse linked map of the BAC or Cosmid or resorting to directed reverse sequencing of PCR products of their M13 templates. These low coverage double stranded supplements minimize the cost and labor of purifying an entire 10x plasmid subclone library. Unfortunately, PCR based approaches to obtain the forward and reverse linked map with M13 have proven to be problematic in Whitehead's experience due to the difficulties modern polymerases have amplifying repetitive regions and homopolymer stretches.
A more severe draw back to M13 as a primary vector is its tendency to delete tandem Alu human insert. These deletions still occur using host cells with recA- genotypes(DH5 alpha Laq IQ RecA-) . Isolation of the RF dsDNA of these deleted clones also harbored these deletions suggesting that the deletion events possibly occurs in the rolling circle method of replication where the RF plasmid is temporarily single stranded during replication.
Plasmids with the pUC Ori seem to be immune to these deletion events and show different cloning bias as assessed by the standard deviation of coverage throughout all plasmid projects. Ideally, where the pass rates, quality, and costs of M13 and plasmids were equivalent, human genome sequencers would choose a double stranded vector system for two primary reasons.
1) Forward and reverse linkage information provided by the double stranded
vectors is invaluable in assembling repetitive genomic DNA.
In conclusion, the two vector systems compliment each other. However, supporting both purification pipelines in a high-throughput genome center can be Pyrrhic due the severe differences between the conventional protocols needed to isolate M13 versus plasmid DNAs. Here we describe a novel and automatable plasmid purification protocol that is also compatible for automated M13 DNA purification within the same robotic workstation.
Thus, enabling a center to freely oscillate between M13 and plasmid DNA template coverage.
The evolution of the protocol and the status of the robotic workstations will be discussed.
101. Arraydb: CGH-Array Tracking Database
Donn Davy, Daniel Pinkel, Donna
Albertson, Steve Clark, Joel Palmer, Don Uber, Arthur Jones, Joe Gray,
and Manfred Zorn
In collaboration with the UCSF Cancer Center, we have developed a database to track data from all stages of the production and use of the CGH (Comparitive Genome Hybridization) expression array slides produced robotically by separate LBNL-Cancer Center collaboration. The system tracks clones or DNA and their sources as they are selected for use in the array, the DNA preparation, the microtiter plates, print-run specifications, slide-printing runs, and slides printed. Researchers then log experiments performed on slides, with the resulting slide-images and analyses written back to the database, providing results which link back to the original clone or DNA and its source.
The system is implemented in an Oracle 8 database, served on the Web by a NetDynamics application server, providing a highly scaleable, flexible, and responsive solution. It is accessible from java-compatible web browsers, and can provide a fine-grained control of security and accessibility.
102. BCM Search Launcher -- Analysis of the Genome Sequence
Kim C. Worley and Pamela A. Culpepper
We provide web access to a variety of enhanced sequence analysis search tools via the BCM Search Launcher. The BCM Search Launcher is an enhanced, integrated, and easy-to-use interface that organizes sequence analysis servers on the WWW by function, and provides a single point of entry for related searches. This organization makes it easier for individual researchers to access a wide variety of sequence analysis tools. The Search Launcher extends the functionality of other WWW services by adding hypertext links to additional information that can be extremely helpful when analyzing database search results.
The BCM Search Launcher Batch Client provides access to all of the searches available from the Search Launcher web pages in a convenient interface. The Batch Client application automatically 1) reads sequences from one or more input files, 2) runs a specified search in the background for each sequence, and 3) stores each of the search output files as individual documents directly on a user's system. The HTML formatted result files can be browsed at any later date, or retrieved sequences can be used directly in further sequence analysis. For users who wish to perform a particular search on a number of sequences at a time, the batch client provides complete access to the Search Launcher with the convenience of batch submission and background operation, greatly simplifying and expediting the search process.
BEAUTY, our Blast Enhanced Alignment Utility makes it much easier to identify weak, but functionally significant matches in BLAST protein database searches. BEAUTY is available for DNA queries (BEAUTY-X) and for gapped alignment searches. Up-to-date versions of the Annotated Domains database present annotation information. Our collaboration with the Genome Annotation consortium (http://compbio.ornl.gov/tools/channel) provides BEAUTY search results for all of the predicted protein sequences found in the human genomic sequences produced by the large scale sequencing centers.
Support provided by the DOE (DE-FG03-95ER62097/A000).
103. Profile Search
Protein Sequence & Profile Search Software Product
The ultimate goal of this project is to produce a software product that will facilitate the ability to query many protein sequence and profile databases. Platform portability has been tested to include Solaris, Linux, and Windows 95 platforms. This software builds an optimized database from the protein and sequence database files for maximum performance.
The current features of this software system include:
Michael Giddings, Olga Gurvich,
Marla Berry, John Atkins, and Raymond Gesteland
The selenocysteine insertion sequence (SECIS) has been found in a number of organisms. In Eukaryotes it consists of a particular structure in the 3' untranslated region which modifies the behavior of ribosomal translation within the coding region, causing selenocysteine insertion at a UGA codon instead of the translational stop which would normally occur. SECIS elements vary in structure, but all contain several common elements. Known examples contain a core group of 4-5 nucleotides that do not conform to usual Watson-Crick pairing rules, as well as a stem-loop structure which contains a group of 2-3 adenosine nucleotides as either a bulge or member of the upper loop.
Locating new SECIS elements through pattern searches in DNA databases poses some significant challenges. We are using a three-pronged approach to this problem. The first phase utilizes a fast, rough scan of large databases with Ross Overbeek's "patscan" software to narrow the field of possibilities. The second phase, currently being implemented, performs a refined analysis of the candidates, scoring and ranking them with a new algorithm in which neural network and fuzzy logic approaches are being explored. The third phase then performs visualization of the candidates, based on an algorithm that utilizes rules regarding the formation of SECIS elements to fold and display them for human analysis.
We present the implementation details of this system in its current form, as well as initial results for scans of several genomic databases using this system.
We would like to acknowledge the following supporters of this work:
DOE Grant #DE-FG03-94ER61817, "Advanced Sequencing Technology"
DOE Grant (no # assigned yet), "Genomic Analysis of the Multiplicity of Protein Products from Genes"
NIH Genome Training Grant #T32HG00042
105. Sequence Landscapes
Gary D. Stormo, Samuel Levy, and
Sequence Landscapes are a graphical display of the word frequencies from a database (DB) for every word of every length in a target sequence (TS) [see Levy et al. Bioinformatics 14: 74-80, 1998]. If the TS and the DB are the same sequence this is a convenient method to detect all of the repeated sequences, of any length. However, we have been exploring the use of this approach for classifying regions of DNA sequence into functional domains, such as exons, introns, promoters, etc. Using DB from each class, the landscapes can be used to derive likelihoods that every region of the sequence belongs to each possible class. We think information can be combined with other types of information to help provide improved recognition algorithms. We are especially interested now in improving methods for determining promoter regions and transcription initiation sites. The information in the landscape can also be very useful for determining the best oligos to use on DNA chips. One of the criteria to be used in choosing the best oligos are those that are most specific for the gene being assayed. Therefore one would like to pick, for each, the oligo which has the most mismatches to the most similar other sites in the genome. This can be accomplished easily and efficiently with the landscape information. We return a list of candidate oligos which can then be ranked by other criteria, including hybridization energy and TM.
106. Protein Fold Prediction in the Context of Fine-Grained Classifications
Inna Dubchak, Chris Mayor, Sylvia
Spengler, and Manfred Zorn
Predicting a protein fold and implied function from the amino acid sequence is a problem of great interest. We have developed a neural networks (NN) based expert system which, given a classification of protein folds, can assign a protein to a folding class using primary sequence data. It addresses the inverse protein folding problem from a taxonometric rather than threading perspective. Recent classifications suggest the existence of ~300-500 different folds. The occurrence of several representatives for each fold allows extraction of the common features of its members. Our method (i) provides a global description of a protein sequence in terms of the biochemical and structural properties of the constituent amino acids, (ii) combines the descriptors using NNs allowing discrimination of members of a given folding class from members of all other folding classes and (iii) uses a voting procedure among predictions based on different descriptors to decide on the final assignment. The level of generalization in this method is higher than in the direct sequence-sequence and sequence-structure comparison approaches. Two sequences belonging to the same folding class can differ significantly at the amino acid level but the vectors of their global descriptors will be located very close in parameter space. Thus, utilizing these aggregate properties for fold recognition has an advantage over using detailed sequence comparisons. The prediction procedure is simple, efficient, and incorporated into easy-to-use-software. It was applied to the fold predictions in the context of fine-grained classifications 3D_ALI1 and the Structural Classification of Proteins, SCOP2. In attempt to simplify the fold recognition problem and to increase the reliability of predictions, we also approached a reduced fold recognition problem, when the choice is limited to two folds. Our prediction scheme demonstrated high accuracy in extensive testing on the independent sets of proteins.
A WWW page for predicting protein folds is available at URL http://cbcg.nersc.gov
1Pascarella, S., Argos, P. (1992).
Prot. Engng., 5: 121-137
107. Comparative Analyses of Syntenic Blocks
Jonathan E. Moore and James A. Lake
The comparative analysis of syntenic blocks common to the genomes of closely related organisms, such as those found in humans and mice, appears to have enormous potential to aid in the identification of gene boundaries, open reading frames (ORFs), and the interpretation of gene organization. Recently, pattern filtering, a new genome analysis tool has been developed. Pattern filtering methods appear to be able to obtain an optimal signal to noise ratio when used to search for ORFs and also simplify the analysis of codon periodicities. In initial studies, it appears to be a sensitive and robust indicator of ORFs, and of gene structure and organization. Our major goal is to develop rapid, simple, and effective methods for analyzing syntenic blocks from human, mouse, Drosophila, and Caenorhabditis genomes using pattern filtering to optimally determine rates of evolution and thereby map ORFs, gene boundaries, regulatory regions, and introns. Preliminary experiments with syntenic blocks in human chromosome 12p13 and the corresponding region in mouse, and also experiments with mammalian mitochondrial DNAs, will be used to illustrate the potential of the method.
108. Sensitive Detection of Distant Protein Relationships Using Hidden Markov Model Alignment
Xiaobing Shi and David J. States
Hidden Markov models are statistical models of the primary structure of a sequence family. In this poster, an algorithm to align hidden Markov models (HMMs) of protein sequences is presented along with the software implementation. Aligning HMMs provides a way to compare sequence families. Compared to pair-wise sequence alignment, HMM alignment is more sensitive to identify relationships between sequence families and requires less computation. Our algorithm uses dynamic programming to identify similarities between two HMMs. Two scoring algorithms are used: the local alignment algorithm, which identify the most similar segments from two HMMs, and the "glocal" alignment algorithm, which aligns the entire length of one HMM to a similar segment of the other model.
We have developed software to perform the alignment and set up a website allowing users to perform the alignment on the internet. Besides allowing users to input or upload HMMs, the website can build HMMs form user-inputted raw sequences or multiple alignments. All HMMs in the Pfam database are also available for aligning on that website. We also provided a method to generate and then align two random HMMs, the score of which can be used to determine the significance of a HMM alignment score.
We have used this software to align all pairs of HMMs in the Pfam database, and the result has revealed some interesting relationships between existing protein families that have not previously been recognized. For example, the high HMM local alignment score of the Sodium:solute symporter family (SSF) and the Amino acid permease family suggests that these two families are closely related. Other examples include the Tropomyosin family and the Filament family, the GerE family and the sigma70 family.
109. Multiple Sequence Alignment with Confidence Estimates
David J. States
Multiple sequence alignment (MSA) is the basis for many aspects of molecular sequence analysis including phylogenetics, motif detection and molecular modeling. Because the space of possible multiple sequence alignments is very large and the information accessible through sequence data is limited, there are often regions of a multiple sequence alignment that are not well determined. Here we develop a theory for assessing the confidence of multiple sequence alignment, describes software that implements this algorithm, and discusses the application of these methods.
A hierarchical approach to MSA is used in which each constituent sequence is related to the full alignment as a leaf in a tree of nearest neighbor relationships. The algorithm uses a progressive strategy for building the multiple alignment. Hidden Markov Models (HMM) are used to describe each sequence or collection of sequences. At each phase in the alignment calculation, all current models are compared with each other using a dynamic programming calculation to calculate the maximum scoring local alignment. A new HMM is derived from the pair of models with the highest alignment score, and this new model replaces both of the previous models. The iteration is repeated until only a single HMM remains. A site specific confidence estimate, C, for pairwise alignemnts is calculated by comparing the likelihood for the optimal alignment passing through a pair of residues with the sum of the likelihoods for all alternative pairings of either the query or target residue.
where is the optimal score for an alignment passing through any pair of residues i and j calculated using a forward and back dynamic programing algorithm [Vingron and Argos, Bishop and Thompson]. Note that the alternatives, , include the possibility that the site is deleted or inserted as well as being a matched pair of residues. C has the form of a probability and is bounded by
0 < C < 1 . The overall confidence
for a site in the multiple sequence alignment is calculated as the product
of the confidence in the all of the pairwise alignments making up the full
The algorithm provides an efficient way to build HMMs for large families of unaligned sequences. A web site provide access to this tool is available at http://www.ibc.wustl.edu/service/msa.
110. Improved Specificity and Sensitivity in Sequence Similarity Search Through the Use of Suboptimal Alignment Based Score Filtering
Lisa Gu and David States
The specificity of molecular sequence similarity search is often limited by the presence of repetitive elements present in biological sequences. Both repeat filtering and biased content filtering methods have been proposed to alleviate these problems, however these methods can mask off large portions of some query sequences limiting the utility of subsequent searches. We have examined the use of suboptimal alignment to automatically identify robust regions of sequence similarity and use this indirectly to filter out the repetitive regions whose alignment is not definite. In this algorithm, the alignment confidence is assessed by comparing the score of the optimal alignment in a pair of residues are aligned with the highest score for an alignment in which the two residues are not paired. Varying degrees of stringency can be applied by raising the threshold for accepting an aligned pair. A "confidently aligned residues" (CAR) score is obtained by performing an optimal Smith-Waterman optimal alignment and subtracting the pairwise score for those residues pairs in that alignment that can not be confidently aligned.
Protein families rich in repetitive sequence were examined and members within the same family were aligned with each other. The results CAR scores were compared to those obtained using the XNU filter as a masking technique and WU-BLASTP (2.0) as the search algorithm. For the collagen family, whose members have extensive and highly repetitive regions, CAR based scoring is uniformly more sensitive in the detection of family members compared with XNU + BLAST. Alignments are missed by XNU + BLASTP as a result of excessive masking by XNU, but large numbers of false positive alignments are seen if BLAST is run without XNU. On the other hand, XNU + BLASTP is, in some cases, able to detect regions of similarity in the myosin heavy chain family, which has some members with a minimal amount of repetitive region. For non-collagen, non-myosin repetitive sequence proteins, CAR scores detected a significant number of similarities missed by XNU + BLAST and in no case was a similarity detected by XNU + BLAST missed with CAR scoring. Our results can be explained by the fact that suboptimal alignment algorithm imposes a more stringent constraint on the alignment between two sequences than BLASTP. Moreover, since the members have minimal repetitive regions, masking by XNU does not cause a tremendous loss of information. CAR scores appear to be a useful tool for enhancing the performance of sequence similarity search in the face of repetitive sequence regions.
110a. Expert System for Long-Read Base-Calling in DNA Sequencing by Capillary Electrophoresis
Arthur W. Miller and Barry L. Karger
We have recently reported the routine sequencing of 1000 bases in less than one hour by capillary electrophoresis (CE) with replaceable linear polyacrylamide solutions (Salas-Solano et al., Anal. Chem. 1998, 70, 3996-4003). One factor contributing to this result was a base-calling expert system, ABC. Compared to our earlier base-calling approaches, the principal benefit of this base-caller has been a reduction in errors at read lengths above 800 bases, where peaks may be too poorly resolved to determine precise base positions. A more flexible and robust version of ABC has now been developed, which begins by performing color separation and baseline subtraction. It then divides the electropherogram into short sections, which are analyzed independently to estimate noise, peak width, and other parameters. This initial analysis is used to select basecalling rules for each region of the data, which are applied to determine the final DNA sequence. Base confidences are assigned using decision trees. ABC works with four-dye CE or slab gel data acquired using four or more raw spectral channels, and requires no user configuration.
This work is being supported by DOE grant DE-FG02-98ER 69895.
111. Screening for Large-Scale Variations in Human Genome Structure
S. MacMillan1, C. Hott1,
D. Anderson1, E. C. Rouchka2, B. D. Dunford-Shore,
B. Brownstein1, R. Mazzarella2, V. Nowotny2,
and D. J. States2
The human genome is polymorphic at all scales ranging from single nucleotide polymorphism to cytogenetically visible translocation spanning tens of megabases, but it remains difficult to characterize variation between these scales. We have proposed a method for screening for the presence of large-scale structural variants in the human genome. To demonstrate the feasibility of our strategy, STS markers derived from regions of finished genomic sequence are used to screen BAC and PAC libraries derived from 9 individuals with coverage in excess of 20 fold using a hierarchical multiplex hybridization and PCR approach. Recovered clones are subjected to both end-sequence analysis and four-enzyme restriction (RE) digest fingerprinting. End-sequence reads are aligned with the reference genomic sequence and their separation is compared with the molecular size of the clone as determined by the sum of the RE fragments sizes. The set of restriction digests predicted from the region spanned by the end-sequence alignments is compared with the experimental digests. Our method is validated by applying them towards verification of three BAC sequencing projects from the Chen laboratory, demonstrating a fingerprint sizing accuracy of better than 1% for bands with molecular weight between 1.2 and 15 kb. Successful fingerprints and end-sequence were generated for all clones. No false positive or false negative calls were identified in 302 bands scored for comparison. To date, 61 markers spanning 2 megabase of sequence (color vision, BRCA2, and TCR beta) have been screened retrieving 249 clones. 15 sites have been identified where multiple clone demonstrate a consistent pattern of deviation from the predicted digest pattern, including the presence of novel bands as well as the absence of predicted bands. To validate variations, a second tier of RE has been implement to further characterize these variants and PCR assays are being developed to test for the presence of these variants in uncloned genomic DNA.
112. Probabilistic Physical Map Assembly
David J. States, Thomas W. Blackwell,
John McCrow, and Volker Nowotny
Physical map assembly is the inference of genome structure from experimental data derived on clones and markers, and map assembly is central to genome analysis. Map assembly depends on the integration of diverse data including sequence tagged site (STS) marker content, clone sizing, and restriction digest fingerprints (RDF). Like any experimental data, these data are uncertain and error prone. Physical map assembly from error free data is algorithmically straightforward and can be accomplished in linear time in the number of clones. However, the assembly of an optimal map from error prone data is an NP-hard problem [Turner, Shamir]. In this abstract we present an approach to physical map assembly that is based on a probabilistic view of the data and seeks to identify those features of the map that can be reliably inferred from the available data. Based on our alternative approach, we achieve several goals. These include the use of multiple data sources, appropriate representation of uncertainties in the underlying data, the use of clone length information in fingerprint map assembly, and the use of higher order information in map assembly. By higher order information, we mean relationships that are not expressible in terms of neighboring clone relationships. These include triplet and higher order constraints (a+, b, c+ => b likely to be +), the uniqueness of STS position, and fingerprint marker locations. Probabilistic descriptions of the map provide an alternative approach to the problem of physical mapping. In this view, we assert that it is impossible to know which of the many possible map assemblies is correct. We can only state which assemblies are more likely than others given the available experimental observations. Parameters of interest are then derived as likelihood weighted averages over map assemblies. Ideally these averages should be sums or integrals over all possible map assemblies, but computationally this is not feasible for real-world map assembly problems. Instead, Gibbs sampling is used to asymptotically approach the desired parameters. Software implementing our probabilistic approach to mapping has been written. Assembly of mixed RDF and STS maps containing up to 60 clones can be accomplished on a desktop PC with run times under an hour. A JAVA based physical map viewing tool has also been written to display the results of these calculations.
113. Multi-Resolution Molecular Sequence Classification
David J. States, Zhengyan Kan, and
Classification is the most reliable and widely used basis for inferring macromolecular function from primary sequence. Beginning with the pioneering work of Margaret Dayhof, a number of sequence classification algorithms have been proposed based including sequence signatures (Prosite), profiles (blocks), HMMs (pfam), and transitive closure relationships (HHS and others). There are intrinsically conflicting constraints on domain classifications that makes it difficult to achieve satisfactory performance in all applications all of the time. Classes must be general enough to represent all of the members of a class, but this generality limits the information content of any single pattern and reduces the sensitivity with which members can be detected. Further, the stochastic nature of mutations may result in domain detection in some sequences and failure to detect domains in other closely related sequences. In transitive closure methods where we are attempting to infer domain structure from similarity relationships, variations in the extent of sequence covered by sequence alignments may further confuse matters and result in the failure to consistently recognize a domain. Instead the algorithm defines several related domains with overlapping membership and sequence extents.
Here we present a novel approach to molecular sequence classification that addresses some of these problems. A multi-resolution approach is employed in which sequences are first classified into transitive closure groups (TCGs) on the basis of high scoring global sequence alignments. These TCGs are then grouped into superfamilies based on inferred domain content and local sequence similarity relationships. All of the members of a TCG are assumed to have identical domain structure providing more redundancy in the data available for domain definition and avoiding inconsistent domain annotation between closely related sequences. To date, 14,227 transitive closure groups with more than two members have been defined in a classification of non-redundant protein sequences derived from SwissProt, PIR, OWL, TREMBL, and GenBank. Work on HMM representations for TCG and the grouping of TCGs into superfamilies is on-going. Relating the annotation and literature reference accessible through primary sequence classification with the structure-based classification being developed at SDSC is proposed as a goal for the Molecular Sciences Thrust.
114. PQ Edit--A Web-Based Database Table Editor and the Relational Database Abstraction Layer
Brian H. Dunford-Shore and David J.
High throughput genome sequencing necessitates the production and use of large amounts of information. To make such information usable, it must be easy to enter, edit, search, and manipulate and the information systems must evolve with changes in experimental design and formula. Relational database servers and tools such as form generators or Microsoft AccessT provide tools to implement part of the solution but does not offer client-independent, flexible, instantly useable data entry and editing. The PQ Edit program and the RDBAL (Relational Database Abstraction Layer) Perl 5 modules were written to fill this gap. PQ Edit is a Perl CGI script that provides general purpose, client-independent, web-based database table editing for relational database tables using automatically generated CGI forms. PQ Edit allows the editing of any database table as it currently exists despite any changes made to the definition (schema) of the tables. PQ Edit provides a reasonable entry form so that the (re)writing of data entry forms for relational databases is unnecessary most of the time. PQ Edit is based on the RDBAL Schema Object--a general purpose Perl library for retrieving database definitions and for searching or manipulating data. RDBAL is an abstraction layer for relational (SQL) databases, which allows middleware independent SQL execution and database schema (catalog) information retrieval. RDBAL tries to 'hide' details of implementation for scripts so that they need no changes to run on different platforms such as Linux, Solaris, or Windows NT and Sybase/MS SQL server or Oracle. The RDBAL Perl library uses Perl 5 objects to make it easy to retrieve information about a particular database's schema. The database connection is cached in the schema object. Database entities (tables, views, and procedures), their field's properties and their index information are retrieved when the schema object is created. Table primary and foreign key relationship information is also retrieved for all tables in a database. Currently PQ Edit and RDBAL support Transact-SQL (Sybase and MS SQL) and Oracle relational database servers via SybaseDBlib, ODBC, or DBI/DBD drivers. Other types of data sources, such as AceDB, are possible. PQ Edit and RDBAL have been tested and used on the Apache and MS IIS web servers and on Solaris and Windows NT.
115. Allele Frequency Estimation from Sequence Trace Data
David G. Politte, David R. Maffitt, and
David J. States
Parametric model fitting of unprocessed sequencing-gel trace data and a least squares optimization algorithm provide a method for accurately determining allele frequencies of a single nucleotide polymorphism in a population. The method uses trace data from one or two homozygous individuals as a reference to estimate allele frequencies present in DNA derived from a pooled population. A parametric model is fit to each of the traces to estimate the amount of each of the four fluorescent dyes that is present at each site. The parameters estimated from each trace are then normalized to account for scalar variations due to differences in the amount of template or sample loaded. The parameters estimated from the trace of the heterozygous individual or from the mixture are viewed as a weighted sum of the parameters estimated from the traces of the homozygous individuals. The weights, or allele frequencies, are estimated by minimizing the sum of squared errors between the linear combination of homozygous traces and the mixed trace. Comparison of allele frequencies estimated by our method to known frequencies at polymorphic sites in three pools of CEPH individuals show that our method is accurate to ~10% even when only a single homozygous reference is available. The allele frequency estimator is accessed via a portable Java based interface that reads ABD or SCF format trace files and allows the user to interactively select sites of interest. When a site has been identified, allele frequency estimation calculations are performed remotely using HTTP mediated requests. Our method is automatic and much less labor intensive than previous approaches. Software is available at http://www.ibc.wustl.edu/ software/allele-estimation.
116. Improved Detection of Single Nucleotide Polymorphisms (SNPs)
Scott L. Taylor, Natali Kolker,
and Deborah A. Nickerson
Single nucleotide substitutions and unique base insertions and deletions are the most common form of polymorphism and disease-causing mutation. Based on the natural frequency of these variants, they are likely to be the underlying cause of most phenotypic differences among humans. Because of their functional importance, their frequency, and amenability to automated genotyping, large mapping of single nucleotide polymorphisms (SNPs) are now underway for the human genome. We have developed a computer program known as PolyPhred which together with Phred, Phrap, and Consed automatically identifies single nucleotide substitutions using fluorescence-based sequencing. Over the past year, we have evaluated several approaches to increase the accuracy and selectivity of PolyPhred. We will present information on a binning process that greatly improves SNP identification by PolyPhred and that speeds the analysis of sequence diversity in human genes.
117. The Genome Sequence DataBase (GSDB): Advances in Data Access, Analysis, and Quality
C.A. Harger, M. Booker, A. Farmer,
W. Huang, J. Inman, D. Kipart, C Kodira, S. Root, F. Schilkey, J. Schwertfeger,
A. Siepel, M.P. Skupski, D. Stamper, N. Thayer, R. Thompson, J. Wortman,
J.J. Zhuang, and M.M. Harpold
Two primary foci of GSDB (www.ncgr.org/gsdb) located at the National Center for Genome Resources (NCGR), in Santa Fe, NM, are to expand the data access and analysis capabilities that are provided to researchers and to continue to improve and automate data quality assurance procedures. Substantial progress in both of these areas has been made during the last 18 months.
Recently NCGR has launched two data utilization tools which provide significant enhancements in data access and analysis capabilities. First, NCGR has begun implementation of sequence similarity searching by making the BLAST suite of algorithms available for researchers to search sequences in GSDB. The addition of sequence similarity searching complements the gene localization capabilities, e.g., MarFinder, already provided by NCGR. NCGR is planning to expand this analysis capability by making Frame Search, Clustalw, and Smith-Waterman publicly available.
Second, NCGR has introduced Sequence Viewer, a platform-independent graphical viewer for sequence data in GSDB. This tool provides easy visualization of sequence and associated annotation together with simple text presentation of non-graphical data. The benefits of Sequence Viewer are augmented by its integration with other GSDB data access tools, such as Maestro, a web-based database query tool. The availability of Sequence Viewer provides a significant improvement in the ability to retrieve and review sequences and associated annotation from GSDB.
During the last year NCGR has also made important advances in data quality assurance procedures. First, NCGR has improved the suite of programs that automatically acquire data from the International Nucleotide Sequence Database Collaboration (IC) databases. These improvements have resulted in a significant reduction of the amount of manual curation necessary to ensure quality and completeness of data acquired from the IC. Second, NCGR has implemented daily curatation of several database fields, including source molecule, chromosome, and the taxonomic information. The increased data consistency resulting from these efforts allows NCGR to provide researchers with flexibility in selecting BLAST search sets. For example, these search sets could range from the entire database to a variety of taxonomic-based subsets or to individual human chromosome sets.
These enhancements and improvements are designed to make GSDB more accessible to researchers, extend the rich searching capability already present in GSDB, and to facilitate the integration of sequence data with additional types of biological data.
118. Analysis of Ribosomal RNA Sequences by Combinatorial Clustering
Poe Xing, Casimir Kulikowski, Ilya
Muchnik, Inna Dubchak, Sylvia Spengler, Manfred Zorn, and Denise Wolf
In our present study, multi-aligned sequences of eukaryotic and procaryotic small subunit rRNA were analyzed using a novel clustering procedure in an attempt to extract subsets of sequences sharing common features. This procedure includes two new models - data segmentation and a core separation and consists of the following four steps: a) sequence segmentation and identification of likely conserved segments according to some specific criterion (i.e. gap frequency); b) clustering of sequences based on each of these segments; c) intersection of clustering results from all the conserved segments; d) comparison of the results of the steps a)-c) with a phylogenetic tree.
Segmentation is a result of global optimization of a new objective function that finds the most homologous consequent partition of a given set of aligned sequences. It was developed as a very efficient and simple dynamic programming procedure. Segmentation was performed on the multi-alignment of 409 eucaryotic rRNA sequences and, independently, on the multialignment of 6205 procaryotic rRNA sequences. In both cases we tested different levels of granularity of segmentation by changing total number of segments. The position and the length of the conserved segments in the multi-alignment were relatively stable. Segment-specific score function discriminated sequence segments mostly composed of gaps from those less frequently interrupted by gaps. Among eucaryotes we found seven conserved segments with less than 20% gaps in the segment, and among procaryotes - nine conserved segments with less than 40% of gaps.
Using the novel clustering procedure, we examined these, minimally interrupted by gaps, segments of the multi-alignment. Every segment was analyzed individually by the clustering procedure, which extracted optimal (exact and unique) subset of 'correlated elements' among all aligned sequences. From each segment we obtained one core cluster and one complementary tail cluster. In the core cluster, all sequences were close to each other and also similar to the consensus sequence of the corresponding segment. For this reason, we call the core cluster a 'homogeneous group', and the tail cluster a 'heterogeneous group'. The sizes of the homogeneous groups derived from each segment in eucaryotes were 284, 344, 361, 343, 366, 335, 317 sequences, respectively. From this result, we can see that rRNA sequences are indeed highly conserved in eukaryotic organisms since among 409 analyzed sequences, a majority belongs to the homologous groups. In procaryotes homogeneous groups derived from each segment contained 3838, 3343, 2378, 2447, 4312, 2641, 1491, 837, 3179 sequences, respectively. Although a relative fraction of sequences in the homologous groups is lower than in eucaryotes, it is still significant and reached 69 % for one of the segments.
Clusters resulting from different conserved segments are fairly consistent. We performed the intersection of all clustering results on all segments by labeling each sequence with an occurrence label. Although there are 27, or 128 types of occurrence patterns possible among seven conserved segments of eucaryotes, only 33 patterns were observed, which indicatesd a significant deviation from a random sequence classification. Furthermore, of the 33 patterns, only 4 patterns could be considered significant because they were shared by a large enough number of sequences. To integrate clustering information from all conserved segments, we ranked each sequence according to its occurrence label, and aggregated them based on the rank. We found that 249 of the 409 rRNA sequences fell into the group with the highest rank: 7, which means they are homologous as determined by clustering of all seven conserved sequence segments. In procaryotes distribution of patterns is also non-random, although clusters resulting from 9 different conserved segments are not very consistent. Among 29, or 512 types of occurrence patterns, 320 patterns were observed, and among those only 11 combinations were represented by more than 100 sequences and 249 by less than 20. 59 Sequences fell into the group with the highest rank: 9, which means they were homologous as determined by clustering of all the nine conserved sequence segments. There were 415, 705, and 940 sequences in the clusters of rank 8, 7 and 6 respectively, which also suggests a substantial homology among the sequences. There are 470 sequences in the cluster of rank 0, meaning that these sequences share little similarity among all nine conserved segments.
Prevalence of the homologous sequences in all segments indicates that using only conserved sequence segments greatly reduces the effect of random information from non-conserved or nonessential sequence fragments on the evaluation of relationship between sequences. Comparison of the phylogenetic classification of the rRNA sequences with our clustering results showed that each phylum usually corresponds to one or two major clusters that are adjacently ranked in our analysis. The advantage of presented algorithm is that: (1) We avoid the interference of frequent gaps that exist in the multi-aligned sequences, and base our clustering only on uninterrupted sequence segments potentially corresponding to essential functional units of rRNA molecules. (2) By identifying these conserved segments, in future we will be able to develop new procedures to cluster unaligned sequences. (3) The algorithm provides the means to apply a polynomial clustering procedure of O(n2) by using the special properties of the objective function defined on the conserved segments.
Since our clustering is based on an objective criterion defined by specific statistical properties of the sequences, and uses no prior knowledge of the biological relevance of the sequences being analyzed, the consistency of our clustering result with an independently derived phylogenetic organization of the associated organisms suggests that it is feasible to apply such an objective and stable clustering method to discover phylogenetic correlations among large number of biological sequences. It can serve as a framework to organize these sequences in an efficient and easily searchable manner.
119. Ribosomal RNA Alignment Using Stochastic Context Free Grammars
Michael P.S. Brown
I present a method for aligning ribosomal RNA using a well principled probabilistic method that models pairwise interactions in a computationally efficient manner, Stochastic Context-Free Grammars (SCFG's). I show this method has superior performance characteristics in relation to several other alignment methods. This method has applications in areas such as phylogenetic tree reconstruction. A webserver is located at http://www.cse.ucsc.edu/research/compbio/ ssurrna.html.
SCFG's have been used previously for modeling structures such as tRNA (Sakakibara94, Eddy+Durbin94) and have been demonstrated to have the highest specificity of any method (Lowe97). This performance comes from SCFG's pairwise modeling ability as well as it's probabilistic foundations that allow specific estimations of parameters such as gap and mutation costs. Unfortunately SCFG's require a relatively high computational cost, O(n^3), where n is the length of the sequence. Previous work to reduce this cost has been done by preprocessing databases with a fast approximate method and presenting only likely strings to the SCFG for further processing (Lowe97). I extend this idea in a new direction using Hidden Markov Models (HMM's).
HMM's are used not only to preprocess the database but to also constrain the SCFG computation in a principled way using posterior decodings. These constraints allow the analysis of large molecules such as rRNA to be done using the full power of complex SCFG models in a reasonable amount of time. I analyze several methods for RNA structure prediction and show that SCFG's have the highest specificity and generalization capabilities using the Ribosomal Database Project alignment of small subunit rRNA as a gauge (Maidak97).
Alignment of ribosomal RNA is important for several reasons. Historically, rRNA was used by Carl Woese to relate all organisms and reconstruct the tree of life (Woese77). Recently, Norman Pace pointed to an opportunity for an environmental genome survey in which rRNA is gathered from the environment to provide a sequence based snapshot of the microbial biodiversity (Pace97).
In order to relate organisms based on their biosequence identity, a multiple sequence alignment is necessary. Indeed, alignment is a very important process in correct phylogenetic tree reconstruction (Morrison97). Current methods of computing this alignment involve a combination of computer alignment with human fine tuning (O'Brien98). This leads to a computational bottleneck as evidenced by the large number of unaligned rRNA sequences in the Ribosomal Database Project. Full analysis of widescale environmental biodiversity projects will exacerbate this problem.
Stochastic Context-Free Grammars are an automatic method of determining RNA alignment using a well principled probabilistic model that accounts for pairwise interactions in a computationally efficient manner. SCFG's have superior performance properties in relation to other methods and have several important application areas including phylogenetic tree reconstruction.
(Sakakibara94) Y.Sakakibara et. al. Nucleic
Acids Research. (22)5112-5120. (1994).
120. Ribosomal Database Project II
James R. Cole, B. Maidak, T.G. Lilburn,
B. Li, C.T. Parker, S. Pramanik, G.M. Garrity, T.M. Schmidt, and Jim Tiedje
The Ribosomal Database Project - II (RDP-II) provides rRNA related data and tools important for researchers from a number of fields. These RDP-II products have great potential value for functional genomics. In addititon they are widely used in molecular phylogeny and evolutionary biology, microbial ecology, organism identification, characterizing microbial populations, and in understanding the diversity of life. RDP-II is a value-added database that offers aligned and annotated rRNA sequence data, analysis services, and phylogenetic inferences derived from these data. These services are available to the research community through the RDP-II website (http://rdp.cme.msu.edu/html/).
In December 1997, the RDP officially moved to The Center for Microbial Ecology at Michigan State University from its previous home at The University of Illinois. A new, greatly enhanced website, and a major data update (version 7) were released on July 31, 1998. The new data release, the first since June '97, contains 9835 aligned sequences, an increase of 66% over the previous release. In addition, this is the first release to be generated from a new custom dbms. Generating the release from the dbms provides the user with better, more consistent formatting of the data within sequence records, and consistent formatting of shared data (eg. reference data) between records.
The new RDP-II website offers a significant improvement over the older website. It exhibits a new, clean, easy to understand user interface. Most of the functions have been enhanced with easier user data input, and improved, more informative output. In addition, we offer several new functions, including a similarity matrix generator, a T-RFLP analyzer, and a java based phylogenetic tree browser. In the first full month of operation (August '98) the website handled 23,032 requests from 1399 distinct hosts in 40 different countries.
We are currently focused on reducing the delay between the time rRNA sequence data becomes available in the primary sequence repository (GenBank) and the time these sequences are available in annotated and aligned format through RDP-II. To that end, we are working on further automation of the sequence harvesting, alignment, and annotation procedures. In addition, we are working on procedures to enhance our phylogenetic tree building capability and to simplify user sequence submission. Our goal is to have data available in RDP-II within three months of its GenBank release.