A risk enables ORNL to break the petaflop barrier.
The announcement shook the scientific world. With no forewarning, the Japanese in 2003 unveiled a new generation of supercomputer that would literally transform the capabilities and horizons of scientific research. Called the Earth Simulator, the Japanese machine could perform at an unheard of speed of 36 "teraflops," or 36 trillion floating operations per second. Dedicated primarily to climate research, the Earth Simulator was not only the world's most powerful computer—its extraordinary performance exceeded the combined capabilities of the 20 largest American machines.
In the United States, the implications of the Earth Simulator were clear. If America yielded leadership in high-performance computing, the likely result would be a similar loss of status in the international scientific community, followed by an inevitable decline of American economic competitiveness.
In a response reminiscent of the American reaction to the Russian launch of the Sputnik satellite in 1957, the Department of Energy announced plans to build what the agency called a "Leadership-Class Computing Facility." DOE's Office of Science envisioned a facility that possessed, in addition to a collection of intellectual talent, power and cooling infrastructure that surpassed by a degree of magnitude any of the existing computer centers currently housed in American universities or national laboratories. Indeed, the Leadership Class Facility would provide a machine not just larger than the Earth Simulator, but eventually capable of attaining the unimaginable speed of a "petaflop," or 1,000 trillion calculations per second.
The goal of this computational "arms race" was not just the bragging rights that come with the world's largest computer. Of far greater significance was the potential of high-performance computing to make possible transformational discoveries in some of science's most important and most daunting challenges. With computational tools that could barely be conceived only a decade earlier, scientists predicted that the modeling and simulation made possible with a petaflop machine would place in an entirely new context discussions about the ability to understand long-term scientific challenges, such as fusion energy and climate change. The ability to reduce the time needed to process and analyze massive volumes of data—in some cases from months to days—held out the potential for comparable reductions in the time required to move new technologies from the laboratory to the marketplace.
The high-performance computing sector is one where technological breakthroughs often are measured in months rather than years. Despite the Department of Energy's desire to move rapidly, building a Leadership-Class Computing Facility faced an immediate dilemma. While a number of institutions possessed the intellectual talent necessary to design the software and hardware architecture for a petaflop machine, virtually none of these computing centers had readily available either the building or the infrastructure needed to support a supercomputer that demanded power and cooling at levels five hundred to one thousand times greater than most existing machines.
One exception was at Oak Ridge National Laboratory. Using a creative new scheme that utilized private funds and land deeded from the Department of Energy, UT-Battelle, the laboratory's managing contractor, had just completed construction of a $73 million, 350,000 square foot facility that included a full acre of America's most modern computational space. An adequate and reliable source of power, a critical element that was increasingly unavailable in some states, was provided by a new substation built by the Tennessee Valley Authority on the ORNL campus.
The design for ORNL's computational center also included networking and data handling resources to support a future petaflop machine. The new facility boasted 10-gigabyte-per-second connections to the ESnet and Internet2 networks, and a scalable high-performance storage system for storing simulation data. The disk subsystem could transfer data at speeds greater than 200 gigabytes per second.
An enormous risk
From one perspective, the decision by ORNL officials to build the nation's largest computer center was an enormous risk. The facility, named the National Center for Computational Sciences, was constructed with essentially no large-scale program to operate, leaving open the possibility that Oak Ridge could have been home to the world's most modern roller skating rink. In this instance, the new Oak Ridge facility represented the confluence of foresight, boldness, and luck. More than a year before the announcement of the Japanese Earth Simulator, ORNL officials privately concluded that it was only a matter of time before the U.S. government made major investments in building a high-performance supercomputer. While others may have shared this prediction, the Oak Ridge team was certainly among the first to grasp the implications of such a machine on the need for a supporting infrastructure that dwarfed previous computer centers in size and cost.
The prediction proved accurate. When in 2003 the Department of Energy invited competitive bids for a $500 million, four year project to build a petaflop machine, the Oak Ridge proposal contained a critical feature: ORNL would have a new facility ready on day one, with state-of-the-art connectivity, reliable power and space for future expansion. Unknown to many was the fact that, in anticipation of a new program, ORNL had quietly been hiring some 200 computational scientists from around the world with expertise in quantum physics, astrophysics, materials science, climate, chemistry, and biology.
The gamble paid off. In May 2004, after a spirited competition among America's leading computational programs, Oak Ridge was selected as the site for the Leadership-Class Computing Facility. DOE's charge was straightforward. Models and simulations on the supercomputer would offer scientists a "third pillar of science," a transformational addition to the historic pillars of theory and experiment. By December 2008, equipped with a new generation of software and operated with high standards of efficiency, a petaflop machine would enable researchers to explore biology, chemistry, and physics in ways previously unimaginable. DOE, quite simply, expected ORNL to provide scientists with virtual laboratories unmatched by any other computing facility in the world.
In a class by itself
Beginning with a 26-teraflop system in 2005, Oak Ridge embarked upon a three year series of aggressive upgrades designed to build the world's most powerful computing system. The existing Cray XT was upgraded to 119 teraflops in 2006 and to 263 teraflops in 2007.
Four years later, the dream of building a petaflop machine and restoring U.S. leadership in high-performance computing is a reality. On November 14, 2008, Oak Ridge officials announced the successful testing of the new Cray XT, called Jaguar, with a peak performance of 1.64 petaflops. Smashing through the petaflop barrier, the Jaguar incorporated 1.382-petaflop XT5 and 263-teraflop XT4 systems. With approximately 182,000 AMD Opteron processing cores, the new 1.64-petaflop system is more than 60 times larger than its original predecessor.
Aided by modern facilities, Jaguar is also the culmination of a close partnership between ORNL and Cray, dedicated to pushing computing capability relentlessly upward through a series of upgrades. The most recent upgrade occurred in 2008, when a 263-teraflop Cray XT4 was linked to a 1.4-petaflop Cray XT5. The combined system uses an InfiniBand network, a 10-petabyte file system, and approximately 182,000 processing cores to form Oak Ridge's current 1.64-petaflop system. Occupying 284 cabinets, Jaguar uses the latest generation of quad-core Opteron processors from AMD and features 362 terabytes of memory. The machine has 578 terabytes per second of memory bandwidth and unprecedented input/output bandwidth of 284 gigabytes per second to tackle the biggest bottleneck in supercomputing systems—moving data into and out of processors.
Keeping the machines from melting through the floor is no small task. The XT5 portion of Jaguar has a power density of more than 2,000 watts per square foot, creating commensurate heat that must be constantly dissipated. To cool the system, Cray worked with its partner, Liebert, to develop ECOphlex, a technology that pipes a liquid refrigerant through an evaporator on the top and bottom of each cabinet. Fans flush heat into the evaporator, where it vaporizes the refrigerant. The vaporization process absorbs the heat. The coolant is then condensed back to the liquid phase in a chilled-water heat exchange system, transferring the heat to chilled water. This extremely efficient cooling process is a critical element in making possible the design of increasingly powerful supercomputers. The new cooling technology also benefits the computer center's efficiency. While cooling often adds some 80 percent to the volume of power required at computing centers, at Oak Ridge the new cooling process adds only 30 percent.
It's all about the science
As the world's first petaflop system available for open research, Jaguar is already in high demand by scientists who are honing their codes to take advantage of the machine's blistering speed. Jaguar represents a unique balance among speed, power, and other elements essential to scientific discovery. Several design elements make Jaguar the machine of choice for computational sciences—more memory than any other machine, more powerful processors, more I/O bandwidth, and the high-speed SeaStar network developed specifically for very-high-performance computing.
Researchers thus far have been enormously successful in utilizing Jaguar's architecture. From a programming standpoint, the upgraded Jaguar is essentially the same as the XT4 that scientists in Oak Ridge have been using for three years. A consistent programming model enables users to continue to evolve existing codes rather than develop new ones. Applications that ran on previous versions of Jaguar can be recompiled, tuned for efficiency, and then run on the new machine. As the CPU performance continues to grow, the system's basic programming model remains intact. For users, such continuity is critically important for applications that typically last for 20 to 30 years.
Speed and efficiency aside, Jaguar's ultimate value will be measured by the science the machine can deliver. The Department of Energy has dedicated Jaguar, unlike most supercomputers, to addressing a relatively small of number of "grand scientific challenges" too large and complex for most existing systems. A single project on Jaguar might consume millions of processor hours and generate an avalanche of data. Proposed projects are peer-reviewed and funded by DOE's Innovative and Novel Computational Impact on Theory and Experiment program.
Early results are encouraging. A report released in October 2008 by the DOE Office of Advanced Scientific Computing Research showcased ten scientific computing milestones, including five projects conducted at Oak Ridge National Laboratory. Among the highlighted ORNLbased research was one of the largest simulations ever produced of plasma confinement in a fusion reactor, which could potentially pave the way for carbon-free sustainable energy production. Jaguar also performed a billion-particle simulation of the dark matter halo of the Milky Way galaxy, in which researchers performed the largest simulation to date of the dark matter cloud holding our galaxy together. The report noted ORNL scientists who completed combustion simulations that dissected how flames stabilize, extinguish, and reunite, showing the path to cleaner, more efficient diesel-engine designs.
ORNL's Associate Laboratory Director for Computational Sciences, Thomas Zacharia, believes the research community is just beginning to understand Jaguar's potential. "We now for the first time have the tools to address some of science's most intimidating questions: How does the earth's atmosphere affect ocean circulation? How do enzymes aid biofuels production? How do proteins misfold in certain diseases?" Zacharia says. The answers, he contends, will open up dramatic opportunities, not just for science, but for American economic growth. Already, he notes, leading companies such as Boeing and General Motors have used Jaguar's simulations to improve materials for their products.
To Zacharia and his colleagues at ORNL, the end is by no means in sight. Indeed, his plans for 2009 call for Oak Ridge to offer two supercomputers with a combined performance of more than 2.5 petaflops. The goal is not considered a fantasy at Oak Ridge, where since 1991 computational power has increased a millionfold. From their perspective, the risk is worth the opportunity.
Web site provided by Oak Ridge National Laboratory's Communications and External Relations