Informatics: Data Collection and Interpretation
The reference map and sequence generated by genome research will be used as a primary information source for human biology and medicine far into the future. The vast amount of data produced will first need to be collected, stored, and distributed. If compiled in books, the data would fill an estimated 200 volumes the size of a Manhattan telephone book (at 1000 pages each), and reading it would require 26 years working around the clock (Fig. 14: Magnitude of Genome Data).
Because handling this amount of data will require extensive use of computers, database development will be a major focus of the Human Genome Project. The present challenge is to improve database design, software for database access and manipulation, and data- entry procedures to compensate for the varied computer procedures and systems used in different laboratories. Databases need to be designed that will accurately represent map information (linkage, STSs, physical location, disease loci) and sequences (genomic, cDNAs, proteins) and link them to each other and to bibliographic text databases of the scientific and medical literature.
New tools will also be needed for analyzing the data from genome maps and sequences. Recognizing where genes begin and end and identifying their exons, introns, and regulatory sequences may require extensive comparisons with sequences from related species such as the mouse to search for conserved similarities (homologies). Searching a database for a particular DNA sequence may uncover these homologous sequences in a known gene from a model organism, revealing insights into the function of the corresponding human gene.
Correlating sequence information with genetic linkage data and disease gene research will reveal the molecular basis for human variation. If a newly identified gene is found to code for a flawed protein, the altered protein must be compared with the normal version to identify the specific abnormality that causes disease. Once the error is pinpointed, researchers must try to determine how to correct it in the human body, a task that will require knowledge about how the protein functions and in which cells it is active.
Correct protein function depends on the three- dimensional (3D), or folded, structure the proteins assume in biological environments; thus, understanding protein structure will be essential in determining gene function. DNA sequences will be translated into amino acid sequences, and researchers will try to make inferences about functions either by comparing protein sequences with each other or by comparing their specific 3- D structures (Fig. 15: Understanding Gene Function).
Because the 3- D structure patterns (motifs) that protein molecules assume are much more evolutionarily conserved than amino acid sequences, this type of homology search could prove more fruitful. Particular motifs may serve similar functions in several different proteins, information that would be valuable in genome analyses. Currently, however, only a few protein motifs can be recognized at the sequence level. Continued development of analytic capabilities to facilitate grouping protein sequences into motif families will make homology searches more successful.
The Genome Data Base (GDB), located at Johns Hopkins University (Baltimore, Maryland), provides location, ordering, and distance information for human genetic markers, probes, and contigs linked to known human genetic disease. GDB is presently working on incorporating physical mapping data. Also at Hopkins is the Online Mendelian Inheritance in Man database, a catalog of inherited human traits and diseases.
The Human and Mouse Probes and Libraries Database (located at the American Type Culture Collection in Rockville, Maryland) and the GBASE mouse database (located at Jackson Laboratory, Bar Harbor, Maine) include data on RFLPs, chromosomal assignments, and probes from the laboratory mouse.
Public databases containing the complete nucleotide sequence of the human genome and those of selected model organisms will be one of the most useful products of the Human Genome Project. Four major public databases now store nucleotide sequences: GenBank and Genome Sequence DataBase (GSDB) in the United States, European Molecular Biology Laboratory (EMBL) Nucleotide Sequence Database in the United Kingdom, and the DNA Data Bank of Japan (DDBJ). The databases collaborate to share sequences, which are compiled from direct author submissions and journal scans. The four databases now house a total of almost 200 Mb of sequence. Although human sequences predominate, more than 8000 species are represented. [Paragraph updated 7/94.]
The major protein sequence databases are the Protein Identification Resource (National Biomedical Research Foundation), Swissprot, and GenPept (distributed with GenBank). In addition to sequence information, they contain information on protein motifs and other features of protein structure.