THE U.S. HUMAN GENOME PROJECT
The First Five Years: Fiscal Years 1991-1995
III. SCIENTIFIC GOALS
A. Mapping and Sequencing the Human Genome
The human genome consists of 50,000 to 100,000 genes located on 23 pairs of chromosomes. One chromosome in each pair is inherited from the mother, the other from the father. Each chromosome contains a long molecule of DNA, the chemical of which genes are made. The DNA, in turn, is a double-stranded molecule in which each strand is a linear array of units called nucleotides or bases. There are four different bases, called A,T,G, and C. The bases on one DNA strand are precisely paired with the bases on the other strand, so that an A is always opposite T and G opposite C.
The order of the four bases on the DNA strand determines the information content of a particular gene or piece of DNA. Genes differ in length, ranging in size from roughly 2,000 to as many as 2 million base pairs. Mapping is the process of determining the position and spacing of genes, or other genetic landmarks, on the chromosomes relative to one another. There are basically two types of maps, genetic and physical, which differ in the methods used to construct them and in the metric that is used to measure the distance between genes. Sequencing is the process of determining the order of the nucleotides, or base pairs, in a DNA molecule.
Although mapping of human genes began early in the twentieth century, it has been intensively pursued only for the past two decades. For most of this period the methods that were developed, though original and ingenious, have been inadequate for comprehensive mapping and have only allowed the construction of relatively crude maps with very little detail. Recently, much more effective technology has been introduced. To date, about 1,700 of the estimated 50,000 to 100,000 human genes (less than 2 percent) have been mapped.
A frequently asked question is: whose genome will be sequenced? The answer is, no one's. The first complete human genome to be sequenced will be a composite of sequences from many sources, most of these being cell lines that have existed in laboratories all over the world for some time. The sequence will be a generic sequence representative of humans in general and not of any particular individual. The complete sequence will provide a standard against which other partial sequences can be compared. It has been suggested that, due to the great variability between individual human beings, a single sequence would not be very useful.
While it is true that much valuable insight will come from comparing many different human sequences, the presumption is that functionally important DNA is conserved among humans, just as it is between humans and mice in those areas that have been studied. DNA regions of particular interest, such as genes involved in genetic disease, will be sequenced from many individuals in the course of research on those diseases. As more information about the extent of genetic variation accumulates from these and other studies in the next few years, it will be evaluated to determine the impact on strategy for the human genome project.
1. Genetic Map
Genetic maps have many uses, including identification of the genes associated with genetic diseases and other biological properties. Genetic maps also form the essential backbone or scaffold needed to guide a physical mapping effort.
Genetic maps are constructed by determining how frequently two "markers", such as a physical trait, a particular medical syndrome, or a detectable DNA sequence, are inherited together. Genes that lie close together on a chromosome have a much higher chance of being inherited together than do genes that lie farther apart. Genetic studies of families, to determine how frequently two traits are inherited together, lead to the production of "genetic maps" in which distance between genes is measured in centimorgans (in honor of the American geneticist Thomas Hunt Morgan). Two markers are one centimorgan apart if they are separated one percent of the time during transmission from parents to children. The physical or molecular distance to which a centimorgan corresponds varies a great deal, but the genome-wide average distance for a centimorgan is believed to be roughly 1 million base pairs.
The development of genetic mapping tools is prominent among the technical advances that led to the Human Genome Initiative. The introduction of DNA markers, such as restriction fragment length polymorphisms, or RFLPs, to detect genetic variation among individuals has been one of the most important innovations. Such markers are relatively easy to find in large numbers and have been used to construct genetic maps. In the past two years, advances have continued in this area. New types of DNA markers have been defined, and techniques, such as denaturing-gradient gel electrophoresis, have been adapted to detect subtle variations in DNA sequences. As a result, the number of useful markers has increased in the past two years.
It is estimated that 3000 well-spaced and informative markers will be needed to achieve a completely linked map, with markers an average of one centimorgan apart as recommended by the NRC. For the first five years, the genome program has set as its goal the creation of a 2 to 5 centimorgan map, which would require 600 to 1500 such markers. Each marker should be identified by a sequence-tagged site (STS) as defined in the section on physical mapping. A working group has been established to develop a plan for achieving this goal.
The distance between sites on physical maps is measured in units of physical length, such as numbers of nucleotide pairs. Physical maps can be constructed in a variety of different ways. They are used as the basis for the isolation and characterization of individual genes or other DNA regions of interest, as well as to provide the starting material for DNA sequencing. The ability to construct physical maps derives from recombinant DNA techniques that allow the isolation and cloning of DNA fragments, the identification of specific sequence markers on DNA, and the determination of the order of and distance between such markers on a chromosome.
There are several kinds of physical maps, which can be categorized into two general types.The cytogenetic map describes the order and spacing of markers on a DNA molecule. Based on microscopic analysis, cytogenetic maps record the location of genes or DNA markers relative to visible landmarks on the chromosomes. This is the oldest type of physical map and the resolution (precision in locating markers) is rather low, on the order of 10 million base pairs. Nevertheless, the cytogenetic map is still an extremely valuable tool and markers continue to be mapped in this way. At the recent 10th Human Gene Mapping Workshop, the number of mapped markers was reported to be 4362, as opposed to 2057 only two years ago. Another example of this type of physical map is the long-range restriction map, which records the order of and distance between specific sequences, known as restriction sites, on chromosomes.The resolution of long-range restriction maps is between 100,000 and 2 million base pairs.
The second type of physical map consists of a collection of cloned pieces of DNA that represent a complete chromosome or chromosomal segment, together with information about the order of the cloned pieces. There are a variety of techniques for cloning DNA and a number of methods for determining the order of the clones. The technology for constructing overlapping clone sets (known as "contigs") is continually improving. At present, a collection of ordered clones is typically the starting material for sequencing. However, novel approaches that do not require cloning, but still allow the investigator access to the DNA to be sequenced, are under development.
In the past two years, improvements in several techniques have made the initial stages in the construction of physical maps of large genomes significantly easier and more rapid than was predictable at the time of the NRC recommendations. These techniques include pulsed-field gel electrophoresis, yeast artificial chromosome cloning, the polymerase chain reaction (PCR), fluorescence in situ hybridization, and radiation hybrid analysis. Currently, the U.S. government supports research projects to physically map the DNA of all or parts of 11 of the 24 human chromosomes (there are 23 pairs of chromosomes, but the X and Y sex chromosomes are not like each other, resulting in 24 different chromosomes).
NIH is supporting, through its extramural grants program, projects for physical mapping of three chromosomes (3,4,18). The DOE is supporting projects in the Los Alamos and Livermore National Laboratories to produce complete overlapping clone maps of two others (16, 19), and the two agencies are funding separate but complementary physical mapping efforts on another six chromosomes (5,11,17,21,22,X). These projects involve the construction of physical maps of both types, using both state- of-the-art techniques and new methods under development. The DOE also supports the preparation of clone libraries representing the various chromosomes under study at Los Alamos and Livermore.
There are still several technological barriers to the rapid, inexpensive, and routine construction of physical maps. One is the relatively short length of DNA over which a continuous, or uninterrupted, set of overlapping clones can be readily established. Contigs are typically small, consisting of between two and six cosmid clones (a cosmid is a type of vector that can carry a maximum of 40 thousand base pairs). To be more than minimally useful, the length of DNA over which the physical map shows continuity, or "connectivity," must be considerably longer.
A challenging but reasonable goal for physical mapping research projects is to extend to about 2 million base pairs the length of a DNA segment that can be covered by a single contig or spanned by a set of closely spaced, ordered markers. If physical mapping of human chromosomes is to be achieved within the next five years, it is important that current physical mapping efforts give their highest priority to the problem of completing maps, i.e. of achieving uninterrupted continuity of physical mapping data over large regions of DNA.
Another difficulty faced by those trying to assemble physical maps of chromosomes has been the inability to compare the results of one mapping method directly with those of another and to combine maps constructed by two different techniques into a single map. This problem is addressed by the recent proposal of a new concept or definition of a useful physical map. According to the proposed system, data from any of a variety of physical mapping techniques can be reported in a common "language." In this system, each mapped element (individual clone, contig, or sequenced region) is defined by a unique "sequence-tagged site" or STS, which is basically a short DNA sequence that has been shown to be unique. A map is then constructed showing the order and spacing of the STSs.
The STS system, as proposed, appears to have several advantages. The STS map can be represented electronically and stored in a database that is publicly available and contains sufficient information to enable any scientist to recover de novo any mapped chromosomal region in his/her own laboratory. Thus, the proposed STS system will facilitate the scientific community's access to the human physical map. Quality control and project accountability will also be improved because the mapping results reported by any individual laboratory can readily be checked elsewhere.
Access to mapped DNA through the information in the STS database will obviate the need for an expensive, long term, centralized repository of clones, although it will not eliminate the need to generate and map such clones nor the need to store them in and distribute them from the laboratory in which they are produced. The proposed STS system will also facilitate the integration of results from different laboratories, regardless of the methods used, to produce a single, useful physical map and will establish a uniform criterion for determining how complete the map of a particular region is. Finally, an STS map may in the future be the appropriate starting point for DNA sequencing.
The STS proposal is still under discussion in the scientific community and few, if any, mapping projects have started to use the STS system. Another uncertainty is the additional cost of generating STS markers. The NIH and DOE have established a joint working group to develop more detailed plans for testing and implementing the STS approach to physical mapping.
Over the next five years, in addition to generation of STS maps, efforts should be continued to generate complete contig maps of large regions of the human genome. Because current technology is not yet sufficient for this task, however, it is unclear what fraction of the genome can be cloned and ordered during this time. An STS map, with one STS characterized approximately every 100,000 base pairs, is an achievable goal. Such a map will assist continued efforts to isolate the intervening DNA.
Three decades ago when Francis Crick and James Watson elucidated the double helix structure of DNA, there was no way to determine the sequence of even short DNA molecules. Only years later, with the advent of recombinant DNA technology in the early 1970s, was it possible to think of isolating individual genes. That breakthrough, combined with the development of powerful DNA sequencing techniques, provided the technological basis for the Human Genome Initiative.
To date, the only organisms for which a complete DNA sequence has been determined are viruses. The largest published viral genome sequence is that of the Epstein-Barr virus, a sequence of 170,000 base pairs. Scientists are now attempting to sequence the DNA of certain bacteria, approximately 4.5 million base pairs long. The size and complexity of human DNA, however, still makes the sequencing of the human genome awesome to contemplate. Although many short stretches of human DNA have been sequenced--slightly more than 5 million base pairs altogether--the human genome comprises about 3 billion base pairs of DNA and is nearly 1,000 times larger than that of a bacterial genome.
If such a large amount of DNA is to be sequenced, a substantial increase in the speed and reduction in the cost of sequencing technology will be required. The current cost of DNA sequencing, in laboratories that do it routinely, is estimated to be about $2 to $5 per base pair of finished sequence, that is, sequence whose accuracy has been adequately confirmed. In laboratories that sequence DNA only occasionally, the costs are much higher. The costs of DNA preparation, salaries and overhead are included in these figures. These costs must be reduced below 50 cents a base pair before large scale sequencing will be cost effective.
Sequencing technology has improved significantly in the past two years. Machines that automatically identify the order of base pairs in appropriately prepared DNA samples are now readily available. In the most advanced laboratories it is possible, using these machines, for one individual to generate about 2000 base pairs of finished DNA sequence per day per machine, starting with properly prepared cloned DNA.
One approach to lowering the cost of DNA sequencing is further automation. The maximum reduction in cost of current sequencing technology will come from the creation of a fully automated assembly line for rapid DNA sequencing. Efforts are underway in both DOE and NIH-sponsored projects, as well as in private companies, to automate most of the preparatory steps in the sequencing process through the development of high speed robotic workstations for sample handling.
During the next five years, pilot projects will be undertaken in order to test strategies and develop technologies for larger sequencing projects, with the aim of reducing costs to well below $1 per base pair by the end of the first five-year period. These projects should analyze biologically interesting regions in the size range of 200,000 to 1 million base pairs. In these developmental efforts, it will be more important to complete the sequence of chosen segments rather than to merely obtain a very high number of base pairs of sequence comprising many smaller segments. This approach will maximize the possibility of successfully identifying and developing the technology needed to proceed with large-scale genomic analysis.
In addition, the amount of biological information obtained in the sequencing of human DNA in the course of these developmental research programs will be significantly increased if parallel efforts to sequence equivalent regions in the mouse are undertaken. Such comparative approaches will be encouraged.
In order to keep the costs of the human genome project within the original estimates, the cost of routine large-scale sequencing will ultimately have to be reduced to well below 50 cents per base pair. Therefore, sequencing projects larger than these pilot projects, such as the sequencing of an entire human chromosome, will not be considered until the cost of sequencing is reduced to that level. The cost of sequencing will be assessed in five years and a recommendation made as to further technological developments needed before large sequencing projects are undertaken.
It is by no means certain that enhancement of current technology, as described above, will bring the cost of sequencing down sufficiently. Therefore, entirely new approaches to DNA sequencing will also be encouraged. There are a number of techniques that hold some promise, including the use of capillary gel electrophoresis, the use of stable isotopes and mass spectrometry, and new imaging techniques, such as scanning tunneling or atomic force microscopy and X-ray imaging. Projects of this sort are being pursued under support from the DOE and the NIH, as well as in private industry.
Proceed to next section
Return to Table of Contents
File posted April 3, 1995.
Last modified: Monday, April 19, 2010
Home * Contacts * Disclaimer
Document Use and Credits
Publications and webpages on this site were created by the U.S. Department of Energy Genome Program's Biological and Environmental Research Information System (BERIS). Permission to use these documents is not needed, but please credit the U.S. Department of Energy Genome Programs and provide the website http://genomics.energy.gov. All other materials were provided by third parties and not created by the U.S. Department of Energy. You must contact the person listed in the citation before using those documents.
Base URL: www.ornl.gov/hgmis
Site sponsored by the U.S. Department of Energy Office of Science, Office of Biological and Environmental Research, Human Genome Program