Announcements on the First Analysis of Genome Sequence
February 12, 2001
"The distribution of genes on mammalian chromosomes is uneven, making for a striking appearance," said Bob Waterston, Director of the Genome Center at Washington University in St. Louis. "In some regions, genes are crowded together much like buildings in urban centers. In other areas, genes are spread over the vast expanses like farmhouses on the prairie. And then there are large tracts of desert, where only non-coding 'junk DNA' can be found. Each region tells a unique story about the history of our species and what makes us tick."
This landscape contrasts starkly with the genomes of many other organisms, such as the mustard weed, the worm, and the fly. Their genomes more closely resemble uniform, sprawling suburbs, with genes relatively evenly spaced along chromosomes.
The human genome's gene-dense urban centers are predominantly composed of the DNA building blocks G and C and are called "GC-rich regions." In contrast, the junk-DNA deserts are AT rich. GC- and AT-rich regions can actually be seen through a microscope as light and dark bands on chromosomes.
On each human chromosome, there are large and noticeable swings in GC content; one stretch might have 60 percent while an adjacent stretch might have only 30 percent. These swings could never occur randomly and represent a definite organization of "neighborhoods" with local accents.
The urban centers contain a ten-fold higher density of genes than the deserts. "It is as though a détente has been established between genes and the long, repeating segments of junk DNA - a treaty whereby certain repeat elements have agreed to occupy the deserts, leaving the cities for the genes," says Eric Lander, the Director of the Whitehead Institute Center for Genome Research.
Another interesting feature is that so-called "HOX gene clusters," which play an important role in development, are never invaded by junk DNA. This suggests that evolution has a reason for retaining the integrity of these gene clusters.
Near the gene cities are neighboring regions of the dinucleotide CpG - stretches
of up to 30,000 letters with only two bases, C and G, repeating over and over.
Usually underrepresented throughout the genome, many CpG regions help regulate
Predictions regarding the number of genes in the human genome have been as variable as the biotech index in the NASDAQ, with estimates ranging anywhere from 20,000 to 120,000 genes.
Ending this decade-long period of wild speculation, the international consortium now reports that they have arrived at a more accurate and stable estimate. They have concluded that the genome contains between 30,000 and 35,000 genes. Although small gaps in the human genome sequence must be filled before scientists can arrive at an exact number, they now have almost all of the data they need to make an accurate projection.
This news represents a humbling reality check for those who have long harbored hubris about the number of genes in humans. The new estimate indicates that humans have only twice as many genes as the worm or the fly! How can human complexity be explained by a genome with such a paucity of genes? It turns out humans are very thrifty with their genes, able to do more with what they have than other species. Instead of producing only one protein per gene, human genes can produce several different proteins.
Humans use a process called "alternative splicing," in which different parts of a protein can be rearranged as needed-much like parts of tinker toys-to make different proteins from the same basic components. Alternative splicing is possible because human genes are spread out over large regions of genomic DNA, and regions that code for proteins are not necessarily continuous, allowing one gene to code for different parts of a protein. On the average, each human gene probably makes three proteins, more than worms and flies do.
Because genes comprise a tiny fraction of the human genome, they are the most challenging to identify in the genome sequence. Thus, the predicted genes and protein sets described by scientists are not final; they will continue to be fine-tuned as better gene-finding tools are developed.
The low number of genes comes as good news for scientists in academia and pharmaceutical companies. Gene hunters who want to compile a compendium of all genes, and pharmaceutical companies looking for finite numbers of drug targets, have just had their work, time, and expense cut significantly.
Why reinvent the wheel when you've got a strategy that works? Evolution certainly seems to operate that way, especially where human proteins are concerned.
The full set of proteins (the proteome) encoded by the human genome is more complex than those of invertebrates largely because vertebrates have rearranged old protein domains into a richer collection of new architectures. In other words,
Humans have achieved innovations by rearranging and expanding tried-and-true strategies from other species - not by developing novel strategies of their own. "The cheapest way to invent something new is to take a good invention and tweak it to suit a new purpose," says Sir John Sulston, Former Director of the Sanger Centre.
Another way we humans innovate is by expanding protein families. Scientists report that some 60 percent of protein families in humans are superfamilies, with more family members than in any of the other four sequenced organisms. This suggests that gene duplication has been a major evolutionary force during vertebrate evolution.
Many of the families that have undergone expansions in humans are involved in distinctive aspects of vertebrate physiology. One example is the family of immunoglobulin (Ig) domains, first identified in antibodies thirty years ago. Classic Ig domains are absent from the yeast and the mustard weed. In vertebrates, the Ig repertoire includes a wide range of immune functions and is a testament to the notion that a single family of proteins can be extremely versatile, mounting a multi-pronged, orchestrated response to infection.
Another example of a family that has proliferated in humans is epithelial proteins such as keratin. This protein family probably grew to support and line the various organs in humans, including the lining of the small intestine and cilia in the inner ear.
Finally, at least some families of genes in the human genome seem to be shrinking. More than half of our smell receptors seem to be broken. This is curious, given that smell receptors belong to one of the biggest gene families (with more than 1000 members). It seems that despite the high priority given to smell by our vertebrate ancestors, humans seem to have lost their dependence on it. Smell was key to survival in our vertebrate ancestors, but for us, vision is probably more important for survival.
Only a tiny fraction, about 1.5 percent, of the human genome is comprised of protein-coding regions of the genome. The vast majority of the genome-more than 50 percent-consists of repetitive sequences, or "junk DNA," that have been hopping around the genome for 3 billion years.
Junk DNA has helped scientists come to terms with one of the human genome's most perplexing paradoxes-that our genome is 200 times larger than that of baker's yeast but 200 times smaller than that of amoeba! Scientists chalked up this discrepancy in genome sizes to the existence of junk DNA collecting in organisms and the lack of routine housecleaning. Even so, scientists didn't fully appreciate the value of junk DNA-until now.
Junk DNA represents a rich fossil record of clues to our evolutionary past. It is possible to date groups of repeats to when in the evolutionary process they were "born" and to follow their fates in different regions of the genome or in different species. The HGP scientists used 3 million such elements as dating tools.
Based on such "DNA dating," scientists can build family trees of the repeats, showing exactly where they came from and when. These repeats have reshaped the genome by rearranging it, creating entirely new genes, and modifying and reshuffling existing genes.
Calculating the evolutionary age of the repeat elements in the human genome has turned up a wealth of interesting, shocking, and curious facts about the stuff that we are made of (see next two vignettes for details).
One of the most interesting aspects of the repeat elements is that as a species, we humans seem to have a tendency to be pack-rats-in stark contrast to other organisms. The amount of junk we've accumulated in our genome far exceeds those collected by our early evolutionary cousins (with the amoeba being a notable exception).
We have a greater percentage of repeats in our genomes-50 percent-than the mustard weed (11 percent), the worm (7 percent) or the fly (3 percent). Also, our repeat elements are much older-actually, really ancient-when compared to those found in the other organisms.
"This suggests that we haven't been fastidious with our house-cleaning. We have been slow to clean out our drawers, closets, or attics," says Arian Smit, bioinformatics scientist at the Institute for Systems Biology. When we calculate the half-life of some of these elements, we find that while the fly did its last house-cleaning 12 million years ago, mammals last cleaned house 800 million years ago. These features of the human genome probably apply to all mammals.
But one feature-shockingly-does not. It seems that there has been a dramatic decrease in repeats in the human genome over the past 50 million years. It's as if we decided 50 million years ago to stop collecting junk. In contrast, there seems to be no such decline in repeats in rodents.
What's more, it seems as though some of our really ancient junk is extinct and some other junk is teetering on the brink of extinction. But these extinct or near-extinct repeats --called DNA transposons and LTR retroposons, respectively--are alive and kicking in the mouse genome. The contrast between human and mouse genomes suggests that the extinction or near extinction of these repeat elements may be accounted for by some fundamental differences between hominids and rodents.
"Population structure and dynamics would seem to be likely suspects," says Eric Lander, Director, Whitehead Genome Center. "Rodents tend to have large populations, whereas hominid populations are typically small and may undergo frequent bottlenecks. Also responsible may be such factors as inbreeding and genetic drift, Lander continued." Scientists hope that further studies will shed more light on these differences.
And now comes the curious story about one repeat element, a phenomenon researchers call "the mystery of the Alu." This mystery revolves around how the repeat element SINE Alu, known to be a "second-class citizen" of the human genome, found its way into the fancy neighborhoods of the human genome.
Repeat elements, or junk DNA, in the human genome come in four varieties: the "extinct" type (DNA transposons), the near-extinct type (the LTR retroposons), and two other types that still are active in the human genome (LINE elements and SINE elements).
When researchers looked at the distribution of these elements by GC content (or gene-rich neighborhoods), they found a pattern that, at first glance, defied logic and baffled them.
Most repeat elements-second-class citizens in a kingdom where genes rule-wind up in less desirable neighborhoods in the genome-regions that are AT rich and GC poor. But SINE elements seem to have landed in the really fancy neighborhoods of the genome-the regions that are gene rich.
Scientists reckoned there were two possible explanations. One is that the wily SINEs somehow trick their way into the GC-rich neighborhoods. The other hypothesis is that most SINEs land in GC-poor neighborhoods to begin with, and evolution favors the SINEs that happened to land in GC-rich real estate.
The scientists used the draft genome sequence to investigate this mystery by comparing the proclivities of young, adolescent, middle-aged and old Alus. Strikingly, young Alus live in the AT-rich regions and progressively older Alus have a tendency to move up to the GC-rich neighborhoods.
As a result, the latter hypothesis that evolution cares about putting the SINEs near genes must be right. Over the years, SINE elements have acquired a bad reputation among scientists for what looked like parasitic behavior. But this reputation may be unjustified; it appears that SINE elements have remained in the genome over time because they are helpful symbiots that earn their keep in the genome.
The human genome sequence, with its large database of repeats elements, provides a powerful resource for addressing the unusual history of the Y chromosome.
By dating the 3 million repeat elements and examining the pattern of interspersed repeats on Y chromosome, scientists estimated the relative mutation rates in the X and the Y chromosomes and in the male and female germ lines. They found that there are twice as many mutations in males than in females.
To do this, scientists identified the repeat elements from recent subfamilies (effectively, birth cohorts dating from the past 50 million years) and measured the substitution rates for subfamily members on the X and the Y chromosomes. They found that the ratio of mutations in males versus female is 2:1.
Scientists point to several possible reasons for the higher mutation rate in the male germ line, including the fact that there are a greater number of cell divisions involved in the formation of sperm than eggs and the existence of different repair mechanisms in sperm and eggs.
Scientists have identified more than 200 genes in the human genome whose closest relatives are in bacteria. Analogous genes are not found in invertebrates, such as the worm, fly, and yeast. This suggests that these genes were acquired at a more recent evolutionary moment, perhaps after the birth of vertebrates. Most probably, infections led to a transfer of DNA from bacteria to the chromosomes of a human ancestor. Scientists didn't find any single bacterial source for the transferred genes, indicating that several independent gene transfers from different bacteria occurred.
This process, called horizontal transfer, is unlikely to happen today because human eggs and sperm, which pass DNA on to the next generation, are isolated from the outside world, and humans have highly developed immune systems to guard against foreign invaders.
But here's the kicker! Many of the transferred genes are far from trivial and appear to be involved in important physiological functions, which may have provided a survival advantage for vertebrate ancestors. As a result, these genes have been maintained in the human genome over evolution. For instance, monoamine oxidase (MAO), an enzyme that is important in processing neurotransmitters, is involved in psychiatric disorders. Other important acquisitions include RAG1 and RAG2, enzymes critical to the immune system's antibody response.
In a companion volume to the Book of Life, scientists have created a catalogue of 1.4 million single-letter differences, or single nucleotide polymorphisms (SNPs)-with their exact location in the human genome. This SNP map, the word's largest publicly available catalogue of SNPs, promises to revolutionize both mapping diseases and tracing human history
Without the SNP map, scientists studying a disease gene had to go through a laborious, time consuming, and costly process of comparing the genomes of many individuals and finding a set of SNPs within the gene of interest. . Now, a scientist studying a disease gene can first turn to the SNP map to find the gene variations. Since the average gene is about 30,000 letters long, many SNPs can be identified in a typical gene in one short computer session.
"We are using the SNP map in everyday science already. Last month, we were able to ask, 'does a gene that affects how much testosterone is produced by the body affect prostate cancer risk?' We pulled 15 SNPs off the web, and typed them in our patients. The 15 SNPs came in only four combinations. So that gene can now be reduced to four flavors. The whole process took about two weeks, whereas before it would have been a massive project, costly project," explains David Altshuler, a Research Scientist at the Whitehead Genome Center.
The SNP map goes beyond being a reference for disease genes and answers questions about the history of human populations. It supports an existing population- genetics model that postulates that a very small number of people expanded rapidly to populate the whole earth in the last 10,000 to 100,000 years.
Supporting the prediction, scientists report that SNPs aren't evenly distributed and concentrations vary widely throughout the human genome. Some areas of the genome are SNP deserts without a single SNP, while others have a great number. Areas with few SNPs may result from evolution selecting one form of a gene to be maintained throughout time. For example, little variation is seen in the X chromosome. But the HLA region, which codes for proteins on the surface of blood cells that elicit the strongest immune response, has a lot of diversity.
The current SNP map results from the combined efforts of International Human Genome Sequencing Consortium and The SNP Consortium. The SNPs Consortium is an unusual public/private partnership between academic institutions, pharmaceutical companies, and charities, to create a map that would be available to the public without charge. The consortium has far outperformed its original goal of discovering 300,000 SNPs by April of 2001. The catalogue of 1.4 million SNPs is not a complete set of all the SNPs in the genome, but it is more than enough to enable genetic studies that were not possible before.
In April of 1999, the Human Genome Project put together a group called the hard-core analysis group. Chaired by Eric Lander, Director of the Whitehead Institute Center for Genome Research, this group was composed of 40 analysts, including experts in a diverse array of genomic topics, such as proteins, genes, gene assembly, evolution, and repeat elements.
The group pored over the sequence data for six solid months, and over weekly conference calls and meetings at Whitehead and in Philadelphia, began to conduct the initial analysis of the human genome sequence. Meanwhile, a group at the University of Santa Cruz assembled the genome sequence into a "goldenpath"-a tongue-in-cheek reference to the fact that this was still an imperfect sequence.
The genome analysis group represented the largest group of sequence analysts pulled together for any task. E-mails flew back and forth-5,000 in all-across three continents and seven countries. By Thanksgiving of 2000, the group had its analysis together.
The group began writing the Nature paper in October and submitted it in December. Tk tk compares the task to writing a travel guide to the U.S. for which the editor needed to pull together a diverse set of experts. They needed some who, in essence, could write authoritatively about white water rafting on the Colorado River and others who knew the ins and outs of clubbing in Greenwich Village, in New York.
"We needed someone to describe the history of Route 66 and others to talk about cruising Sixth Avenue. We needed someone to paint the big picture descriptions of topographic features like the Rocky Mountains, and also someone to give us food reviews of hole-in-the-wall restaurants in San Francisco," says Lander, Director, Whitehead Genome Center. "It was a challenge, but it was also a heck of a lot of fun."
For a complete list of the Genome Analysis Group members, refer to the Nature paper.
Last modified: Wednesday, October 29, 2003
Home * Contacts * Disclaimer
Document Use and Credits
Publications and webpages on this site were created by the U.S. Department of Energy Genome Program's Biological and Environmental Research Information System (BERIS). Permission to use these documents is not needed, but please credit the U.S. Department of Energy Genome Programs and provide the website http://genomics.energy.gov. All other materials were provided by third parties and not created by the U.S. Department of Energy. You must contact the person listed in the citation before using those documents.
Base URL: www.ornl.gov/hgmis
Site sponsored by the U.S. Department of Energy Office of Science, Office of Biological and Environmental Research, Human Genome Program