In this issue...
Also available in pdf.
1997 Santa Fe Highlights
Joint Genome Institute (JGI) Comes of Age
JGI and Bermuda Quality Sequence
Grants Awarded for JGI Collaboration
JGI Sequencing Clones
Sequencing at NIH NHGRI
Data Surge Challenges Informaticists
Genome Annotation: Informatics Advances Needed for Age of Functional Genomics
ELSI: Rapid Progress Accelerates Societal Impact of Genome Research
1999 DOE HGP Meeting Set for California
Human Genome Project Administration
New 5-Year Goals, Project Midpoint
DOE, NIH Discuss Informatics
JASON Group Review
BER Genome Instrumentation Research
In the News
Private-Sector Sequencing Plan
Bang for the Buck: Government-Backed Research Underpins Potentially High Payoff Ventures
Palmisano Joins DOE OBER
DNA Files series to be on NPR
HUGO Addresses Sample Collection
Sickle Cell Mice May Lead to New Treatments
TIGR Sequencing 6 More Microbes
Tuberculosis Microbe Sequenced
C. Elegans Sequencing Nears Finish
HGMIS Website Restructured
cDNA Cloning Workshop Identifies Critical Issues
Survey Identifies Growing Need for Synchrotron Analyses
Report on Functional Consequences of Gene Expression
Book on Tuskegee Conference
Book Focuses on Biomarker Implications, Conference Proceedings
Genome Analysis Protocol Handbook
Software and the Internet
Mouse Genome Informatics Release 2.0
New System Identifies Polymorphisms
DOE Supports Web Site for 1997 AAAS Genome Symposium
Expressed Human Genome Database
NHGRI Initiates Mailing List
U.S. Genome Research Funding
Meeting Calendars & Acronyms
Genome and Biotechnology Meetings
Training Courses and Workshops
HGN archives and subscriptions
HGP Information home
Notes from DOE-NHGRI
Informatics Workshop, April 2-3, 1998
By Dan Drell, DOE Office of Biological and Environmental Research
On April 2 and 3, 1998, DOE's OBER and NIH's NHGRI convened a workshop to identify informatics needs and goals that could comprise the next genome 5-year plan (currently being developed) as well as begin to craft a vision for genome informatics over the next 5 years and beyond. In particular, following the announcement of the impending closure of Genome Data Base (GDB) at Johns Hopkins, OBER desires to clarify its future role in genome informatics that serves its role in the HGP. In attendance were 46 invited informatics and genomics experts, 6 DOE, 8 NHGRI, 2 NIGMS and 1 NSF staffers. The meeting was held at the Dulles Hilton on Rte. 28 in Herndon, VA.
Since the beginning of the Human Genome Project, informatics has been widely regarded as one of the most important elements of the HGP. The overall quantity of information, the mass and varying types of experimental raw data being generated, the spectrum of data from ABI traces to DNA sequences, to map positions of markers, to identified genes, ultimately to intelligent predictions of future genes (open reading frames) and their hypothetical functions, all absolutely require computational collection, management, storage, organization, access, and analysis. Not surprisingly, given the wide diversity of sponsoring agencies, participating institutions, and scientists who are involved in genomics, the resulting data are highly heterogeneous in terms of format, organization, quality, and content. Furthermore, not all uses for these data can be anticipated today; this implies a need for structural flexibility in the database(s) that support the genome project. Additionally, knowledge improves over time which implies that curation of the data, i.e. correcting it, adding to the functional and useful links it has, annotating it, must be done on a continuous basis.
Although universally regarded as critical to the success of the HGP, informatics is done by computer scientists, not biologists. This has led to some communication difficulties that have not been fully resolved. By and large, those doing informatics have not had practical biology backgrounds (there are, of course, exceptions to this), and biologists, to a large extent, have used computers only for word processing and e-mail. This situation is changing rapidly but still has a way to go. Additionally, the expectations from genome informatics are not uniform; biologists have a set of expectations that can vary from those of the computational scientists. Importantly, computational analyses of genomic data are not meant to generate "revealed truth"; rather, they are best understood as serving to generate testable hypotheses that must then be taken to a lab bench somewhere for critical testing. Both NHGRI and OBER took the starting position that it is the needs of the users that matter the most and which must drive the goals of genome informatics over the next 5 years. To this end, most of the invitees were, broadly defined, "users" of informatics services, and only a minority of invitees were "producers."
Prior to the workshop, the ORISE contractor E-mailed to all the invitees 4 broad questions to serve as a framework for the workshop. These four questions were:
- Queries: What scientific questions will you want to answer? What types of data will you need to answer these questions? Which of these data types are permanent, which are temporary but important, and which will need to be regularly updated? What uses will you have for genomic sequence data in the next 5 years?
- Tools: What protocols and tools for data submission, viewing, analysis, annotation, curation, comparison, and manipulation will you need to make maximal use of the data? What sorts of links among datasets will be useful?
- Infrastructure: What critical infrastructures will be needed to support the queries you want to perform and what attributes should these infrastructures have? In what ways should they be flexible, and how should they stay current? How should they be maintained?
- Standards: What kind of community-agreed standards are needed, e.g. controlled vocabularies, datatypes, annotations, and structures? How should these be defined and established?
The agenda consisted of 6 "user" talks the first morning, followed by breakout groups the first afternoon. The 4 breakout groups were (the bolded name was the breakout group chair):
- Sequencing, mapping for sequencing, gene maps: Raju Kucherlapati, LaDeana Hillier, Eric Green, David Lipman, Takashi Gojobori, Peter Schad, Elbert Branscomb, Ray Gesteland, David Smith, Peter Cartwright, Rainer Fuchs, Peter Weinberger
- Gene finding, OMIM, variation: Ken Buetow, David Nelson, Anne Spence, Jim Ostell, Bob Robbins, Aravinda Chakravarti, David Valle, Bob Cottingham, Bruce Weir, Deborah Nickerson, Chuck Langley, Stan Letovsky
- Annotation, function: Brian Chait, Roger Brent, Martin Ringwald, Joanna Amberger, Mark Boguski, Manfred Zorn, Ed Uberbacher, Chris Overton, Temple Smith, Richard Mural, David Balaban, Dixon Butler, Nat Goodman, Barbara Wold (may attend), Randall Smith
- Comparative genomics: Carol Bult, Michael Cherry, Tony Kerlavage, Jean-Francois Tomb, Terry Gaasterland, Frederique Galisson, Reinhold Mann, Janan Eppig, Bill Gelbart, Katie Thompson, Paul Gilna.
The Informatics Workshop began with welcoming comments from John Wooley of DOE and Francis Collins of NHGRI. Both noted the importance of informatics to the success of the HGP, both noted that the time was now to begin thinking about the biology that could follow the completion of the first human sequence. An acute question for today is what tools will the post-HGP biologist need to do the work s/he wants to do? Collins noted the initiative on Single Nucleotide Polymorphisms (SNPs) that 17 NIH institutes are joining in. He also noted that with Genome Data Base closing this summer, "re-parking" of that data was important so it wouldn't be lost. He closed by saying that the genome programs were listening since the next HGP 5-Year plan was being developed and this workshop would be important towards the definition of the informatics goals that would appear in it.
Aravinda Chakravarti (Case Western Reserve University) and David Thomassen (DOE OBER) discussed the planning process for the 5-year plan. Thomassen, speaking for Ray Gesteland (University of Utah) who could not be present, noted the priority areas for the DOE: high-throughput sequencing at the Joint Genome Institute (and its Production Sequencing Facility), technology development (including improvements in current technologies, "hardening" of developing technologies, and keeping open an eye to future technologies such as leveraged sequencing), informatics, functional genomics, and ELSI. Chakravarti noted that the priority areas for NHGRI included sequencing, genetic variation, functional genomics, informatics, and ELSI. At Airlie House in Warrenton, VA, in May 28-29, the principals of both the DOE and NIH genome programs, along with invited outside scientists, will review a joint 5-Year plan for the HGP. This plan should be ready for publication in an October, 1998 issue of Science. Chakravarti closed emphasizing the need for a concrete, tangible, implementable plan that focused heavily on the next 5 years.
The morning proceeded with talks from various genome project users, each representing a different perspective. LaDeanna Hillier (Washington University in St. Louis) listed the informatics needs typical of a large sequencing center. Her comprehensive list included data tracking, physical mapping support (e.g. band calling, map assembly tools, and map publication and dissemination tools), data collection and analysis (e.g. lane tracking, image analysis, base calling, confidence estimations, and interfaces), data processing (e.g. vector clipping, sequence assembly, data management and reporting tools, QA/QC tools), finishing tools (editing, problem solving, etc.), technology development (rearraying, colony picking, data collection support), LIMS (lab information management systems), gene prediction aids, gene identification tools (naming conventions, map integration, graphical representation tools), annotation representation tools (data mining and analysis tools), and databases (for public tools, both phase 1 and phase 2 data). Additionally, standardization of the required fields that must be filled for an entry to be accepted by a given database needs to be agreed to. Stable identifiers are an absolute requirement so that data isn't lost when different things are done with it and the intellectual spoor trail can be followed back if desired. Hillier closed on data access, asserting that complete sharing between public databases using standardized well documented and centralized formats should be enforced, with libraries of routines callable by JAVA.
Takashi Gojobori (National Institute of Genetics, Japan) discussed the DNA Database of Japan, DDBJ, and noted it is one of the three corners of the sequence database triangle (the others are NCBI and EMBL.) DDBJ has a total staff of about 65 people (including post-docs and graduate students) and is roughly comparable to GenBank at NCBI. The four themes of his presentation were:
- genomic diversity in human populations (encompassing identification of disease gene candidates, measurements of genetic variation, elucidation of evolutionary processes, and detection of new mutations);
- comparative genomics (including comparisons of whole genomes from different species, elucidation of the evolutionary process of genomic structure, and studies of biological relationships between different species);
- genomic engineering (identification of essential genomic regions, identification of minimum genomic sequences for function, and the elucidation of the ancestral genome at the "origin of life"); and
- cDNA expression profile database (from specific target regions e.g. the brain, selected model organisms, determination of types of genes expressed where and when, and prototype evolutionary models.)
Gojobori closed with an appeal for what he termed a "Humanity Genome Project" involving a search for genes for human psychology, emotion and behavior.
Anne Spence (University of California, Irvine) represented the perspective of the medical geneticist user. She gave a forthright, blunt talk about the need for data resources that a medical geneticist could use to answer, efficiently, questions about genes and their medical implications. A typical query might be "tell me everything about gene X." Today, this query involves interrogating several web sites, not always interlinked, and often with uncurated data of varying veracity and reliability. Spence gave a dramatic example involving a query about a gene implicated with attention deficit hyperactivity disorder ( a gene associated with ADHD has been located to the same spot as DRD4, which has been linked to a plethora of syndromes such as schizophrenia, Alzheimers, depression, novelty seeking, obsessive compulsive disorder, and the list goes on.) She noted 3 fundamental issues in informatics that complicated the medical geneticist's life: 1) genetics vs. computers or the challenge of capturing all the data vs. intelligently using the data; 2) the data volume problem (which is getting worse as more sequencing is done); and 3) the issue of data accuracy vs. completeness. What the medical geneticist needs is a user-friendly disease/gene entry, in a database with links to other resources, regular rapid updates, with accurate curated and annotated information, and population data. GDB had been a bridge to much of this data, between OMIM and GenBank, but GDB had been hard to use. There is an acute need to capture discovered knowledge and make it easily available and this is not being done now.
Debbie Nickerson (University of Washington, Seattle) talked about genetic variation. To maximally utilize the expected flood of human variation data that the SNP efforts will generate, SNP data needs to be integrated into existing maps. There are plenty of maps out there now, but they are on "boutique" web sites and are difficult to find, and virtually impossible to add to. DNA variation data could encompass type (e.g. substitutions, indels, repeats), location, discovery, mode (inherited or acquired), frequency (population, haplotype, linkages), method of genotyping, and phenotype/function. It is sobering to realize that the human genome might vary by as much as 6% in size (the genome could be 3 x 109 base pairs, plus or minus 9 x 107.) Some mechanism for external annotation so that other biologists could (using a simple format) add to the value of the data is desirable.
Roger Brent (Molecular Sciences Institute, Berkeley, CA) gave a provocative talk comparing functional genomics (as a set of interpretations and derivations from sequence data) to the information one might need to "guess the plot" of Shakespeare's Othello. For biologists, this information might comprise measurements of protein concentrations, states, and subcellular locations, as well as temporal changes as a function of cellular conditions and activities; for Othello, one would want to know who is in the room, what else is in the room, what is in the room that a character can form a complex with (e.g. a knife or pillow), and what is in the room that a given character does form a complex with. Brent used this analogy to point out that biological informatics needs to deal with "fuzzier" data and more tentative inferences, that queries phrased closer to "natural language" are needed, as are "canned" queries (e.g. most plots are the same or similar.) Brent's talk provoked some discussion that ranged from the sharp ("complete crap") to the diplomatic ("lots of problems with communications" between biologists and computer scientists.)
Rainer Fuchs (Ariad Pharmaceuticals) talked from the perspective of the biotechnology industry user. He noted that industry wasn't monolithic, that it is wide ranging in character and needs. Common to many are the needs for more potential targets for pharmaceutical generation. This implies better ways of identifying those that are worth investing resources in developing. The hopes that industry has for genomics includes better target identification, target validation, target prioritization; the informatics challenges include data analysis (knowledge discovery), establishment of standards, and training new young scientists for the future. New data types can easily be expected, including gene expression (at both the nucleic acid and protein levels), molecular interactions, gene regulation, and genetic variation (including polymorphisms, post-translational modifications, and splice variants.) "Tools for the rest of us" (as opposed to the high end, large scale sequencers) are also needed. This should involve tools that are easier to use, that are available, that are robust and of commercial quality, that are supported. Fuchs noted that although no one had mentioned it explicitly, the idea of "federated database systems," in which a query could cross from one database to others and return relevant information obtained from several of them, was still highly sought after. Industry also (along with medical geneticists) wants to be able to ask "tell me everything about this gene." To do this, Fuchs passionately argued for standards across the bioinformatics landscape. Today, industry standards are worlds apart from those considered in the genomic bioinformatics field. A group exists (the OMG, Object Management Group) that currently is an industry group but which could involve academic and government representatives if they showed interest. Fuchs noted that standards were critically important because in an era of industrial-scale sequencing, it made little sense to "let 1000 flowers bloom;" striving for perfection was laudable in principle, but not reasonable in practice. There was no need to reinvent the wheel. Core databases with centralized data management, explicit object definitions and access methods, better financial support not dependent on research grants (a bad mechanism for supporting infrastructure), but with rigorous review for both technical practice and continuing need and utility was important. Component oriented software standards would promote systems integration, interoperability, flexibility and responsiveness to change (e.g. CORBA). Annotation was critically important so that the who what where when and why of genome sequence products could be built up. Automated analyses using clearly defined standard operating procedures, consistent application, and sufficient documentation would help a lot. Finally, Fuchs mentioned the acute need for training of additional scientists (not exclusively biologists) in these technologies.
Bettie Graham of NHGRI concluded the morning with a short description of several NHGRI training programs that could help with the dearth of bioinformaticists in the public sector genome field.
The afternoon of the first day was devoted to 4 breakout groups; the results of those groups were presented the next morning. I visited each of the breakout groups to get a sense of how the discussions were going and took some notes while in each one, but the summaries below are based on the final products of each group.
Sequencing, mapping for sequencing, gene maps: Raju Kucherlapati, LaDeana Hillier, Eric Green, David Lipman, Takashi Gojobori, Peter Schad, Elbert Branscomb, Ray Gesteland, David Smith, Peter Cartwright, Rainer Fuchs, Peter Weinberger.
PHYSICAL MAPS, GENE MAPS: develop integrated databases where identical sequence markers in different maps are in synonyms database; all markers should be located in a central database (e.g. NCBI); queries to maps: what are the markers, clones, and genes in a spatial interval; what are the genes/ESTs location and clones?
SEQUENCE READY MAPS: all data (full contig depth) should be accessible; assembly criteria (e.g. STS, fingerprints) should be included; data must contain: interval anchored to best maps, all clone addresses for members of the contig, members of tiling path clones that are (or will be ) sequenced, clone id coupled to library information, links to additional information; it would be desirable to have STS content and fingerprint of each clone. All the data must be prepared and presented in a standard fashion
SEQUENCE: the sequence data must contain the following: source of the sequence (the clone id), the sequence anchored to clones, the STS location and confirmation by electronic PCR, quality scores for each base (probability of error) for large genomic sequence, biological attribution as annotation. Contiguous genomic sequences should be assembled.
TOOLS: support for distribution and maintenance of tools for general use should be promoted; this includes tools for map and sequence assembly, new tools for interoperable systems, and (especially) robust tools for sequence finishing.
LINKAGES AND STANDARDS: clear definition of objects in databases including their behavior and semantics; standard interfaces for WWW and for systems communications, standards for sequence accuracy and which data needs to be captured, international genomic standards for objects to be represented in databases, establishing (in one year) a working group to develop, periodically review, modify these standards and to so advise the funding agencies who should then enforce the resulting standards.
Gene finding, OMIM, variation: Ken Buetow, David Nelson, Anne Spence, Jim Ostell, Bob Robbins, Aravinda Chakravarti, David Valle, Bob Cottingham, Bruce Weir, Deborah Nickerson, Chuck Langley, Stan Letovsky
QUERIES: what is known about this gene? What is known about this region (marker delimited, cytogenetic location)? Were does this sequence go (what is its genomic context)? Does this gene vary? From where did this information come (cell, tissue, population, species ethnicity, environment)? What are the genetic characteristics of this population (geography, origin, sample clinical diagnosis, phenotype)? What analysis reagents should be used?
TOOLS AND INFRASTRUCTURE: For individual DNA variants [note: highest priority]: (raw genotype data, raw haplotype data, population description, sample relationship to other individual samples, e.g family structure), sequence context, derived data (haplotype, frequency, linkage disequilibrium), (capture of old data?)
For individual phenotype (human and non-human): sex, phenotype information required to obtain published results, gene expression profiles, detailed clinical descriptions [NOTE: Serious ELSI issues could arise here!!]
OMIM: complex traits, modifier genes, gene interactions (links to epistasis)
TOOLS AND DATA SOURCES FACILITATING MAP INTEGRATION: reference sequence, genetic map, radiation hybrid map, cytogenetic map, transcript map, STS map, SNP map, BAC map; tools that facilitate navigation of multiple data sources, open access to data sources
COMPREHENSIVE PHYSICAL CLONE DATA RESOURCES:
Genomic sequences: ( raw data, e.g. traces, source data, procedure for generating data, annotation (intron, exon, motifs), derived data, quality/confidence/coverage.) cDNA sequence: raw data, analytic tools for cluster analysis. Methods: annotated repository of analytic tools, annotated repository of information on, and availability of, reagents.
HISTORICAL DATA/ANNOTATION REPOSITORIES:
orthologs, paralogs resource
STANDARDS: availability of raw data that support conclusions, tools used to generate data need to be well described and available, standard nomenclature/vocabulary (required use in public databases), standard formats for entry of data of the same type, data to support genetic conclusions submitted, methods used to generate data need to be specified.
Who sets standards? Editors, community, funding agencies
How? User-friendly mechanisms for feedback to databases (with response of receipt), curation by domain experts, formal evaluation/validation experiments
Implementation thoughts: multiple databases (with different approaches/models) maintained by experts, need to conduct research in areas of integration tools, need to have data be open and accessible by many parties.
Annotation, function: Brian Chait, Roger Brent, Martin Ringwald, Joanna Amberger, Mark Boguski, Manfred Zorn, Ed Uberbacher, Chris Overton, Temple Smith, Richard Mural, David Balaban, Dixon Butler, Nat Goodman, Barbara Wold (may attend), Randall Smith
One issue that was mentioned was the incorporation of data and information generated by both large and small groups; this is based on a sense that the rules are different, e.g., large groups are expected to put sequence on their web sites each evening, while smaller labs can pretty much do what they want. Temple Smith suggested a Swiss-Prot Blocks-like data structure for genome sequence data that would not replace GenBank but improve on it (in a Leggo block fashion). Ed Uberbacher noted the importance of comprehensive annotation based on assembled genomes (not the fragments often found in GenBank) and on comprehensive complete information. The database must be queryable in a reasonable way (another criticism of GenBank) and all data, whether primary or derived, needs to be sourced. It was noted that annotation is not gospel, only testable hypotheses. Curation remains a touchy issue, as it needs to be done, but it isn't clear who should do it. Expert curation by selected editors is difficult and expensive, but would not be impossible if suitable incentives were used. It was noted that most of OMIM's budget goes to curation.
- Show us the money: robust software engineering for genomics is expensive and requires a stable infrastructure for R&D and deployment. Adequate funding must also be provided for innovative research in genome informatics to address grand challenge problems in data management analysis and visualization.
- Implement fully automated genome annotation systems that keep pace with world-wide sequencing output. Must be capable of initial annotation and ongoing re-annotation. Must include visible policies, protocols, and evidence.
- Implement data management, analysis, and visualization tools for functional genomics, including mRNA expression data and protein interaction data. Make sure they interact seamlessly with other genomic data.
- Expand development effort of user-friendly tools for data access and analysis. Different user communities require different solutions: System developers need powerful ad hoc query languages, while clinicians and casual users need query support systems.
- Explore new models for generation of publicly available data and informatics tools, including work done by private companies under contract to the government which then freely distributes the data and tools.
- Establish 3-5 academic centers for genome informatics leading to the critical mass necessary for sustained R&D and deployment, and training programs. These will be the major centers for training genome informatics specialists.
Comparative genomics:Carol Bult, Michael Cherry, Tony Kerlavage, Jean-Francois Tomb, Terry Gaasterland, Frederique Galisson, Reinhold Mann, Janan Eppig, Bill Gelbart, Katie Thompson, Paul Gilna.
- What is it? Inter and intra genomic comparisons about structure (gene organization/architecture, sequences and subsequences), gene products (function, role, system, expression), temporal processes (embryonic development, aging, evolution, population (allelic) dynamics) Need for prediction models.
- Move from canned queries to open systems
- Establish standards to enable links leading to comparisons leading to new tools and testable hypotheses for prediction models in the long term.
- Standards: Stable, unique, non-recyclable identifiers. Synchronization of updates. Version/history tracking of database objects. Semantic mapping between databases. Metadata (queryable) (e.g. who, what, when, where of the data) Controlled vocabularies (systems, roles, functions, etc.) Gene name/symbol nomenclature consistency. Aim: referential integrity.
- Curated and "archival" data resources (sequence, taxonomy, model organism, protein family databases) [These resources exist - although in various states of "readiness" for comparative genomics.]
- Database interconnection tools/methods/protocols for retrieving relevant individual records in batch mode.
- Repository of tools and information about tools.
- Mechanisms for lowering the "energy of activation" for submitting raw and interpreted data (electronic lab notebooks).
- Mechanisms/database structures for capturing comparative genomics results. (Core phylogenetic domain databases)
- Central resource of databases and links (highly curated!)
- Where to start:
- model organism databases
- sequence (nucleotide/protein) databases
- structure databases
- taxonomic databases
- Regular "summit" meetings to establish and maintain controlled vocabularies
- Cooperation of journals
- Rewards from, academia/agencies for electronic publishing
- Consortia/cooperative agreements among databases
- New initiatives
- Comparative genomics requires a high level of human intervention and curation to interpret and synthesize the data. Emphasis should be on increasing productivity not solely on scalability.
- As the data increase in magnitude and complexity, more human resources will be required for curation.
- You get what you pay for.
- Without an effective bioinformatics infrastructure, the promise of the HGP will not be realized. Funding levels need to be consistent with the critical nature of informatics.
Each breakout group reported on its conclusions and recommendations.
Raju Kucherlapati (sequencing) noted the issues in the summary above. In the discussion, it came out that at Wash U (St. Louis), the Waterston group is currently sequencing about 100 Mb/yr (all sequences combined, e.g. human and C. elegans) but can finish only at 60Mb/yr so that finishing remains a major bottleneck in sequencing. It was suggested that the OMG standard setting working group be supported and that academic/government participants be encouraged. David Lipman (NCBI) said that they were working to hire more staff expressly to work with genome centers on sequence data submission.
Ken Buetow (gene finding/OMIM/variation) noted (based on a sketch of Jim Ostell's) the various gaps in the flow of mapping information. There are numerous gaps in morbid maps, many gaps in the map positions of physical reagents, clones, many deficiencies in raw data (ABI traces, etc.), gaps in the annotation and description of complex traits, huge gaps in knowledge about gene interactions and modifier genes, and little in the way of repositories on DNA variation, linked phenotypes, methods and reagents, homologies and orthologies, and historical data and annotations. Better integration tools were desperately needed. David Lipman noted that there was an overarching need to understand the data from the perspective of its utility. Others wanted to make sure it was all captured first.
Chris Overton (annotation/function): There are several models for high quality, curated databases out there, e.g. FlyBase, AceDB, OMIM. 3rd party annotation was, by and large, not a successful approach. Whatever was done, working closely with GenBank was important. Functional genomics data was a research issue since it wasn't at all clear what data needed to be collected. "User-friendly tools" is easy to say, hard to accomplish, very expensive, and means different things to different communities. There is a need for 3-5 centers of excellence (including UPenn?) where a critical mass of informatics together with biology can be accumulated. One of those centers now is NCBI.
Carol Bult (Comparative Genomics): The need here is ways to traverse across many resources to answer complex queries. A major part of comparative genomics that cannot be readily automated is homology comparisons. Computerized annotation can only do so much and needs to be viewed as a tool for hypothesis generation. Controlled vocabularies need to be constructed, but it is recognized that the slope between controlled vocabularies and a comprehensive (and complex) knowledge base is a slippery one.
David Lipman noted that the meeting was a useful one for him. NCBI has more than 60,000 users/per day (some 2 million per month). MedLine, which used to be available for a fee, now is free on the Web. PubMed is used by a wide audience, some 40% of whom are researchers, 10% are MDs, and the rest are the "public." NCBI is interested in expanding library functions and is talking with textbook publishers and hopes to strengthen connections to the literature and to tap into the education market. NCBI is growing 15% in usage every 45 days, but remains a small division within NLM. They are trying to make Entrez more robust and might find ways to export or disseminate it to others.
Ed Uberbacher gave a brief overview of the Annotation Consortium at ORNL and described its overall schema of Data Acquisition, Data Analysis, Data Storage, and Data Access (via the Genome Channel.) Several people (Jean-Francois Tomb [Dupont] and Carol Bult [U Maine]) expressed strong praise for Ed's efforts.
Wrap up (Eric Green and Elbert Branscomb): [This is the hardest part of the meeting to summarize; Francis Collins asked for priorities, estimated costs, and a timetable which did not map easily onto the earlier Queries, Tools, Infrastructure, and Standards pattern that the breakout groups had been asked to respond to.]
- DNA variation, individual based
- phenotypic variation, individual based
- functional genomics
- repository for informatics tools and information
- Whenever possible, existing database frameworks should be used.
- A curated reference genome database should be created, involving "high level community" curation, e.g. editors
- Comprehensive data capture in standard formats, well structured (using controlled vocabularies) should be a goal.
- Pathway/regulatory databases (e.g. WIT, KEGG, Eco Cyc) should be encouraged.
- curated structured reference genome (map and sequence) database
- integrated and linked databases
- variation database
- functional/expression database
- raw data capture
- finishing tools
- production tools
- research tools (analysis, visualization, etc.)
- access tools (visualizing data objects, pulling objects from different databases, extraction tools)
- annotation tools
- data capture tools
- functional genomics tools
- data mining tools
- A tool capture site
- development, hardening
- tool quality assessment
- map integration
- outreach tools
NSF S&T centers as model for needed genome informatics center perhaps on scale of $12 Million per year
Overall Policy Recommendations:
- there should be open competition for supplying database/informatics needs
- existing frameworks should be used where possible
- standard data object definitions should be realized
- continued support for model organism databases should be effected
- raw data should be captured to the maximum extent possible
- there should be investments made in hardening and exporting software tools from genome centers.
This was a useful and rewarding meeting. While some consensus recommendations can be identified, there is still much vagueness among the informatics communities, mostly users, represented at this workshop. Those who generate the data have different concerns from those who want to use it. There is still some hesitation between the biologists who aren't conversant in the technical issues of informatics and the informatics scientists who aren't fully conversant in the biology. The presence of NCBI was a strong positive from this meeting. There was a general air of amity and agreement in the various breakout groups. It seems that the genome project still has many unmet informatics needs and there was, to my mind, remarkable concordance on what the "wish list" should still have on it. From a DOE-specific perspective, the importance of annotation efforts (highlighted by the work of Ed Uberbacher's group at ORNL) was underscored.
Queries: everything conceivable about sequences, genes, markers, relationships, maps, proteins, functions, interactions, regulatory pathways, variation, phenotypes, inter-species comparisons. How data was derived, where it came from, under what experimental circumstances, by whom, what the raw data (ABI traces, gel lanes, etc) were, what methods were used to process the raw data into database entries (e.g. sequence), QA/QC measures -- everything!
Tools: finishing tools, production tools, research tools (for analysis, for visualization, etc.), access tools (for visualizing data objects, for extracting objects from different databases, etc.),annotation tools, data capture tools, functional genomics tools, data mining tools, and a dedicated tool capture site. Development and hardening of tools to promote easier dissemination finishing and exporting, QA/QC of the different tools, tools that are interoperable, map integration tools, and outreach tools.
Infrastructure: This principally means databases and the workshop suggested a pile of them. These include:
- curated structured reference genome (map and sequence) database,
- integrated and linked databases,
- variation database,
- functional/expression database
- an informatics tools and information database.
Standards. There was near uniformity on the need for intelligent standards that various constituencies of the genome project, academic, government, and industry, could join in defining and implementing. These include a variety of controlled vocabularies for various objects that would be entered into appropriate databases. Today, industry standards are very distinct from those (few) that exist (e.g. Phred/phrap for sequence QA/QC) in the HGP A group exists (the OMG, Object Management Group) that currently is largely composed of industry representatives, but should involve academic and government representatives. Explicit object definitions and access methods are desperately needed. Component-oriented software standards would promote systems integration, interoperability, flexibility and responsiveness to change (e.g. CORBA). Automated analyses (annotation) using clearly defined standard operating procedures, consistent application, and sufficient documentation would help a lot.
The workshop closed with some policy recommendations, (slightly expanded from above):
- There should be open competition for supplying most database/informatics needs, but support for any large databases needs to be done outside of the regular grant mechanism (but NOT outside of periodic technical and mission relevance reviews.)
- No one database can be expected to do everything for everybody; however, the user needs to "feel" that s/he is interacting with only one entity.
- Existing frameworks (database schema, submission tools, etc.) should be used where possible to save money. Contracting out certain tasks to the private sector should be explored.
- Standard data object definitions should be developed and promulgated in the near future and enforced by the agencies.
- There should be continued support for model organism databases.
- Raw data should be captured to the maximum extent possible before it is irretrievably lost.
- There should be investments made in hardening and exporting software tools from genome centers.
The electronic form of the newsletter may be cited in the following
Human Genome Program, U.S. Department of Energy,
Human Genome News (v9n3).