4. GENOME INFORMATICS
In a statement of research goals of the US Human Genome Project [F. Collins and D. Galas,"A new five-year plan for the US Human Genome Project," Science 262: 43-46 (1993)], the Project's leaders define "informatics" as:
Their goals for the current 5-year period are:
While similar in purpose and style to other major scientific cataloging efforts of the past and present--for example, Handbook of Chemistry and Physics, Chart of the Nuclides, to name two familiar resources--the Human Genome Project's informatics task is strikingly unified in that its focus is solely on translating and disseminating the information coded in human chromosomes. Genome informatics differs from earlier scientific catalogs also because it is a "child" of the information age, which brings clear advantages and new challenges, some of which are related to the following:
Within the Human Genome Program, technical challenges in the informatics area span a broad range. Genome informatics can be divided into a few large categories: data acquisition and sequence assembly, database management, and genome analysis tools. Examples of software applications within the three categories include:
Data acquisition and sequence assembly:
Managing such a diverse informatics effort is a considerable challenge for both DOE and NIH. The infrastructure supporting the above software tools ranges from small research groups (e.g. for local special-purpose databases) to large Genome Centers (e.g. for process management and robotic control systems) to community database centers (e.g. for GenBank and GDB). The resources which these different groups are able to put into software sophistication, ease of use, and quality control vary widely. In those informatics areas requiring new research (e.g. gene finding), "letting a thousand flowers bloom" is DOE's most appropriate approach. At the other end of the spectrum, DOE and NIH must face up to imposing community-wide standards for software consistency and quality in those informatics areas where a large user community will be accessing major genome data bases.
The need for genome quality assurance enters the informatics field at several different levels. At the earliest level, both policies and tracking software are needed that will preserve information about the pedigree (origin and processing history) of data input to the sequencing process. This potentially includes information on the origins of clones and libraries, image data of gel runs, and raw data of ABI-machine traces. Policies need to be developed concerning minimum standards for archiving the raw data itself, as well as for the index that will allow future users to find raw data corresponding to the heritage of a specific DNA sequence.
At the level of sequencing and assembly, DOE and NIH should decide upon standards for the inclusion of quality metrics along with every database entry submitted (for example PHRED and PHRAP quality metrics, or improvements thereon).
At the level of database quality control, software development is needed to enhance the ability of database centers to perform quality checks of submitted sequence data prior to its inclusion in the database. In addition, thought needs to be given towards instituting an ongoing software quality assurance program for the large community databases, with advice from appropriate commercial and academic experts on software engineering and quality control. It is appropriate for DOE to insist on a consistent level of documentation, both in the published literature and in user manuals, of the methods and structures used in the database centers which it supports.
At the level of genome analysis software, quality assurance issues are not yet well posed. Many of the current algorithms are highly experimental and will be improved significantly over the next five years. Tools for genome analysis will evolve rapidly. Premature imposition of software standards could have a stifling effect on the development and implementation of new ideas. For genome analysis software, a more measured approach would be to identify a few of the most promising emerging analysis tools, and to provide funding incentives to make the best of these tools into robust, well-documented, user-friendly packages that could then be widely distributed to the user community.
Currently, there are many, diverse resources for genomic information, essentially all of which are accessible from the World Wide Web. Generally, these include cross references to other principal databases, help-files, software resources, and educational materials. The overall impression one gets after a few hours of browsing through these web sites is that of witnessing an extraordinarily exciting and dynamic scientific quest being carried out in what is literally becoming a world-wide laboratory.
Web tools and the databases are also changing how the biology community conducts its business. For example, most journals now require a "receipt" from one of the standard databases indicating that reported sequence data have been filed before a paper is published. The databases are finding ways to hold new entries private pending review and publication. The databases contain explicit reference to contributors--there is probably no better way to exercise real quality control than the threat of exposure of incorrect results. We view all these developments as being very positive.
With so much information coming available, considerable effort goes into staying current. Many institutions conduct daily updates of information from the database centers. This works because such updates can be performed automatically off of peak working hours. The resources needed to update and circulate information are likely to increase as volume increases. The effort in learning how to use relevant database tools represents an important investment for individual scientists and group leaders.
Maintenance of databases is an important resource question for the Project. Currently, DOE supports two major efforts:
GenBank (www.ncbi.nlm.nih.gov/Web/Genbank/index.html) is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences. There are approximately 967,000,000 bases in 1,491,000 sequence records as of June 1997. GenBank is part of the International Nucleotide Sequence Database Collaboration, which also includes the DNA Data Bank of Japan (DDBJ) and the European Molecular Biology Laboratory (EMBL/EBI) Nucleotide Sequence Database.
4.2.1 User issues
The major genomic databases serve broad communities, whose users have vastly differing needs. In this situation several levels of user input and management review are called for.
To assure that all the database centers are "customer oriented" and that they are providing services that are genuinely useful to the genome community, each database center should be required to establish its own "Users Group" (as is done by facilities as diverse as NSF's Supercomputer Centers and NASA's Hubble Space Telescope). Membership in these "Users Groups" should be on a rotating basis, and should represent the full cross-section of database applications (small academic groups, large genome centers, pharmaceutical companies, independent academic laboratories, etc.). The "Users Groups" should be convened by each Center Director and should meet several times a year, with written reports going to the Center Directors as well as to the sponsoring Federal agencies.
Several briefers from database centers expressed concern that the "average user" was not well-informed about appropriate ways to query the databases, and that search tools (e.g. BLAST) frequently were not being used in a sound fashion. To address this type of issue, DOE should encourage the database centers in consultation with their "Users Groups" to organize appropriate tutorials and workshops, to develop "crib sheets" and other instructional documentation, and to take further steps to educate targeted user communities in techniques for sound database use appropriate to their applications.
At a higher management level, DOE and NIH should continue the process of constituting independent panels every few years, to review the health of the entire suite of genomic database centers. These panels should provide independent peer review of every community database, including input from "Users Groups" as well as technical and management review of center operations. Inclusion of Computer Science database experts on the review panels will help facilitate exchange of information with the Computer Science community.
4.2.2 Modularity and standards
Too often database efforts attempt to "do it all"; i.e., they attempt to archive the data, provide mechanisms for cataloging and locating data, and develop tools for data manipulation. It is rare that a single data base effort is outstanding in all three areas, and linking the data too closely to the access and analysis methods can lead to premature obsolescence. For reference, the following functions can be identified:
Authoring: A group produces some set of data, e.g. sequence or map data.
Publishing and archiving: The data developed by individual authors is "published" electronically (i.e. put into some standard format) and accumulated in a network accessible location. This also involves some amount of "curation", i.e. maintenance and editing of the data to preserve its accessibility and accuracy.
Cataloging (metadata): This is the "librarian" function. The primary function of a library is not to store information but rather to enable the user to determine what data is available and where to find it. The librarian's primary function is to generate and provide "metadata" about what data sets exist and how they are accessed (the electronic analog of the card catalogue). Other critical functions include querying, cross-referencing, and indexing.
Data access and manipulation: This is the "user interface". Because the data volumes are typically large, computerized methods for data access and manipulation must be provided, including graphical user interfaces (GUIs).
The key point is that the various functions should be modularized, rather than tangled together in a single monolithic effort. The reason is obvious: computer technology, storage technology, data base technology, networks, and GUIs are evolving on a time scale much shorter than the projected lifetime of the data. Each technology evolves on its own time scale and schedule. Therefore, the functions must be modularized to allow separate upgrading. Modularization also allows multiple approaches, e.g. to user access: simple, intuitive GUIs for some users, powerful search and combinatoric engines for others.
Data format standards are a key to successful modularity. The community should invest in developing a common "language" which includes definition of certain basic data types (e.g., "classes" or "objects"' in object-oriented terminology). Data format conventions should be defined for sequence data, map data, etc. Where multiple standards already exist, investment should be made in translators. Some standardization of methods to operate on data objects is also desirable, particularly for the most frequent operations and queries. However, the user should be able to develop powerful customized methods and manipulation techniques.
Currently, neither standards nor modularity are very much in evidence in the Project. The DOE could contribute significantly by encouraging standards. Database groups should be encouraged to concentrate on the "librarian" functions, and leave the publishing and archival functions to other groups. Development of user interfaces and manipulation tools may also be tackled by database efforts, but it is not obvious that the best librarians are also the best GUI developers.
As part of the librarian function, investment should be made in acquiring automatic engines that produce metadata and catalogues. With the explosive growth of web-accessible information, it is unlikely that human librarians will be able to keep pace with the ancillary information on the genome, e.g. publications and web-sites. The technology for such search engines is well-developed for the web and needs to be applied specifically to genomic information for specificity, completeness, and efficiency.
Indexing and cross-referencing are critical database functions. It is often the case that the indexes which encapsulate the relationships in and between data bases constitute a far larger data set than the original data. Significant computer resources should go into pre-computation of the indexes that support the most frequent queries.
Consideration should be given by the database efforts to development of shell programs for genome database queries and manipulation. A shell is a simple interactive command-line interface that allows the user to invoke a set of standard methods on defined objects, and lists of objects. In the numerical world, Mathematica, Maple, and IDL are examples of such approaches. The shell typically has a simple syntax with standard if-then constructs, etc.
4.2.3 Scaling and storage
As noted in Section 1.2.3, about 40 Mb of human sequence data in contigs longer than 10 kb exists in the genome databases today, using a storage capacity of 60 GB (NCGR). By the time the Human Genome Project is complete, these databases can be expected to hold at least 3 Gb of sequence, along with annotations, links, and other information. If today's ratio of 1.5 KB per sequence-base is maintained, 4.5 TB of storage will be required. At the very least, a comparable 100-fold increase in submission/transaction rates will occur, but we expect the transaction rates to grow even faster as genomic data are more complete and searches become more sophisticated. While these capacities and transaction rates are well within the bounds of current database technology, careful planning is required to ensure the databases are prepared for the coming deluge.
4.2.4 Archiving raw data
As the Project proceeds it is reasonable to expect improvements in the analysis of the raw data. Therefore a posterior processing could be quite valuable, provided that the trace data are archived.
One of the algorithms used currently has been developed by P. Green. His base calling algorithm, PHRED, takes as input the trace data produced by the ABI instrument (chromatogram files). Quality parameters are developed based on qualitative features of the trace. Currently 4 such (trace) parameters are used. These are converted to quality thresholds through calibration on known sequence data.
Experiments conducted by Green, involving 17259 reads in 18 cosmids yielded the following results, comparing the error rates of the actual ABI software calling package to those of PHRED.
Of course, the distribution of errors should also be compared, error clusters have potentially serious implications for the assembly problem, more so than well isolated errors. Another potentially important consideration is the location of errors within the read.
It is not unreasonable to expect that the actual conversions, used in the PHRED algorithm, might be improved as the library of known sequence increases. Further, more than one conversion table might be required, depending on the general region of the genome one is attempting to sequence.
C. Tibbetts of George Mason University has developed a based calling algorithm based upon a neural network architecture. He has also worked to maximize the quality of the base calls through an engineering analysis of, for example, the ABI PRISM 377.
Whatever algorithms are used it is important that the called sequence of bases have associated confidence values together with an interpretation of what these values are supposed to mean. For example confidence values could be pairs of numbers, the first representing the confidence that the base call is correct and the second representing the confidence that the base called is the next base. One might also consider adding a third coordinate representing the confidence that the called base corresponds to one base as opposed to more than one. These values should continually be checked for internal consistency; every read should be compared to the assembled sequence. This comparison involves the alignment of the read against the assembled sequence minimizing an adjusted error score.
Finally, there are currently several degrees of freedom in sequencing. Two, that could yield different (and hopefully independent) processes are:
Correlated errors define an upper bound in the accuracy of base calling algorithms that cannot be surmounted by repeated sequencing using the same chemistry. Ideally the confidence values assigned to individual base calls would closely correspond to these intrinsic errors. This can (and should) be tested experimentally.
There are two final points on the issue of archiving the raw data. More powerful algorithms (enhanced by either a growing body of knowledge about the genome or by better platforms) could improve the reads, and hence enhance overall accuracy. Such developments could also enable re-assembly in some regions (if they exist) where errors have occurred.
4.2.5 Measures of success
Databases are crucial tools needed for progress in the Human Genome Project, but represent large direct costs in capital equipment and operations and potentially large hidden costs in duplication of effort and training. We believe the only true measure of success will be whether or not these tools are used by researchers making scientific discoveries of the first rank. That a given database installation is "better" than another in some theoretical sense is not sufficient. There are examples in consumer electronics where the "best" technology is not the one chosen by the majority--a similar situation could easily occur with databases in the Human Genome Project. We urge DOE to critically evaluate the "market impact" of the database efforts it supports by regularly surveying users and comparing with other efforts, supported outside DOE. Fundamentally, the operation of a major database is a service role--of very great importance and with real technical challenges--that may not be in the long-term interests of DOE, assuming other satisfactory database tools are available to its researchers at reasonable cost.
4.3 Sociological issues
Until recently the biological sciences have been based upon relatively free-standing bench-top experimental stations, each with its own desk-top computer and local database. However a "sequencing factory" with high throughput faces new informatics needs: inventory management, a coordinated distributed computing environment (e.g. EPICS), automated tools for sequence annotation and database submission, and tools for sequence analysis. In addition the national and international Human Genome Projects must integrate the genomic information into a common and accessible data structure.
The broadly distributed nature of the Project presents a challenge for management of the informatics effort. In particular, across-the-board imposition of standards for software engineering and data quality will be difficult. The best course is for DOE to "choose its battles", emphasizing the development of common standards in areas of highest priority such as database centers, while tolerating a diversity of approaches in areas such as advanced algorithm development for genomic analysis. In addition to standards, consistent "User Group" input and peer review are needed for all of the genome database centers.
It will be helpful to increase the level of participation of the Computer Science community in genome-related informatics activities. While the human genome sequence database is not among the largest databases being developed today, the diverse nature of genome applications and the need to combine information from several different database sources provide real Computer Science challenges. Reaching out to the academic Computer Science community to engage the interest of graduate students and faculty members has not been easy to date. The genome community continues to debate whether it might be more fruitful to educate biologists in computer sciences, rather than educating computer scientists in biology. In our view, both approaches should continue to be pursued. DOE's informatics program should include outreach activities such as workshops, short courses, and other support which will familiarize Computer Scientists with the challenges of the genome program, and which will educate young biologists in those areas of Computer Science which are of importance to the genome effort.