Like buildings, supercomputers have different architectures. Picture four computer processing units (CPUs) and four data-storage units (computer memory). Give each processor its own memory unit and then connect the processors. If one processor wants to read data in memory attached to another processor, it must ask the other processor for the data. This arrangement is called distributed memory, and the collection of processors is called a cluster. If, instead, each processor is connected to each of the four memory units and can access data directly, this arrangement is called shared memory. Now, put these four processors and their memory units in one box in a shared memory arrangement and call it a node.
Eagle, the IBM RS/6000 SP supercomputer at ORNL, is a cluster of 176 four-processor nodes, combining both distributed and shared memory in a single system. In 1998 when DOE’s Center for Computational Sciences (CCS) at ORNL was planning to purchase a next-generation supercomputer for ORNL, it signed a contract with IBM that called for 16-processor nodes. At the time, a performance evaluation team led by Pat Worley of ORNL’s Computer Science and Mathematics Division (CSMD) compared the 4-processor and 16-processor IBM nodes, to determine which architecture would work best for the codes that were to be run on the machine.
“We found that smaller nodes work better for our science applications,” says Buddy Bland, head of CSMD’s Systems and Operations Group. “So we changed the contract with IBM from 16-processor nodes to 4-processor nodes. As a result, we obtained Eagle eight months earlier at a cost 20% less than the total in the original contract. Now, we must decide which architecture will work best for the 10-teraflop super- computer we want built for 2003. Already climate modelers are writing codes that will port to this future supercomputer.”
Worley and his team have focused their recent performance evaluation efforts on the Compaq AlphaServer SC machine at ORNL known as Falcon and on a new IBM machine at ORNL known as Cheetah. Falcon uses 4-processor nodes, like Eagle. Cheetah uses the new IBM p690 nodes, which each have 32 processors. Worley’s team has found that IBM Power4 processors used in a p690 node are two-and-a-half times faster than Eagle’s processors and twice as fast as Falcon’s processors for a variety of application codes.
Unlike the earlier 16-processor IBM nodes, the 32-processor, p690 node has up to 4 times better bandwidth than Eagle for communication within a node. Hence, a larger volume of messages and other data can be passed more quickly among Cheetah’s processors than among Eagle’s. As Bland puts it, “If you have a really fast water pump, you want a fire hose, not a straw, to increase the speed and volume of flow. Cheetah has the bandwidth equivalent of a fire hose.”
As part of their performance evaluation, Worley and his team do “benchmarking.” They test existing parallel-computing codes to determine whether each code runs faster on, for example, the IBM or Compaq machines. Then they “diagnose” the performance of the code.
“We try to determine why a code runs faster on one machine than another,” Worley says. “We investigate whether a code may run more slowly on one machine because of the coding style—the way a computer program is written. If so, we can advise code developers on how to alter their style so the code will run faster on a particular machine.”
ORNL team members also do performance engineering. They can tune a code to improve its performance on a specific machine. In addition, Worley’s team tells vendors which problems they need to solve in designing their next-generation machines so that certain codes will run faster.
“Our customers are code writers and users, vendors, and system administrators,” Worley says. “We provide advice on how to configure and run their systems and on what machines they should buy next. We guide the development of both codes and supercomputers.
“In our most recent efforts we have focused on evaluating the performance of Falcon and Cheetah in running climate, car crash, computational chemistry, human genome analysis, and materials codes. We measure how fast each code runs and predict how much time and how many processors are needed to get the computing job done.”
The ORNL team was the first to show that a supercomputer made in the United States (Falcon) could exceed a performance goal (5 seconds per model day) for modeling the global climate. Later the team also showed that Eagle can exceed that goal. Without the input of the CCS performance evaluation team, ORNL’s supercomputers would not have nearly as good an output.
Related Web sites
Web site provided by Oak Ridge National Laboratory's Communications and External Relations