The Center for Computational Sciences: High-Performance Computing Comes to ORNL
by Kenneth Kliewer
Ken Kliewer points to Sandia National Laboratories (SNL), whose Intel Paragon is connected with ORNLs Intel Paragons by a high-speed network. The high-performance computers are communicating with each other and running calculations together in parallel to solve complex problems that cannot be solved by one Paragon alone. Photograph by Tom Cerniglio.
ORNLs Center for Computational Sciences is one of the worlds premier computational science centers. Its facilities include parallel Intel Paragon computers having 200 giga-flops of computing power and a data storage system exceeding 100 terabytes in capacity. The Centers focus is the solution of Grand Challenge-level problems, requiring extensive expertise in the development of a broad range of parallel application codes and algorithms. The Center has taken an innovative step into the computing of the future arena by linking its huge Paragons with a Paragon at Sandia National Laboratories through state-of-the-art networks. This extraordinarily powerful distributed computer makes possible solutions of problems too complex to be solved at either site alone.
Computers became a visible part of the national technical scene in the 1950s. Thereafter, improvements came rapidly as both hardware and softwaresystems, strategies, and conceptsevolved and matured. On through the 1970s, improvements and innovations were dramatic, with corresponding advances in performance.
Throughout this period, the basic computer remained the same. The heart of the machine was the central processing unit (CPU) which, as programmed by the user, performed the desired calculations or processes step by step in timethat is, sequentially. But then, with a mighty nudge from Mother Nature, a new perspective emerged.
The subsequent discussion is simplified with the introduction of a measure of speed for computers: FLOPS, which stands for FLoating-point OPerations per Second. For our purposes here, it is appropriate to think of FLOPS as arithmetic operations per second.
While CPUs were moving through the kiloflops class to megaflopsfrom thousands to millions of mathematical operations per secondit was clear that Mother Nature had imposed a limit. When computing, CPUs operate through sequences of electrical signals, but these signals cannot move arbitrarily quickly. Their motion is limited by the speed with which light (or electrical signals) can travel, which is fixed by natural law. However, the expectations of computer users and the computing requirements associated with demanding problems knew no such limits.
The Parallel Solution
Thus emerged a new strategy: Put multiple CPUs into a single computer and have all of them work together simultaneously to solve problems. This procedure is now known as parallel computing. Machines providing the highest level of performance today are all parallel computers, not sequential machines. Within the world of parallel computing, there are two fundamentally different ways of handling machine memory. One way is to ensure that all CPUs are in direct contact with the entire machine memory, a configuration called shared memory. An alternative is for each CPU to have its own share of the total system memory. In this arrangement, any CPU that requires information stored in the memory of another must send a message request-ing this information. The requested information is then supplied through a response message. This configuration, referred to as distributed memory, involves sophisticated strategies for message passing. One message-passing arrangement is shown in Fig. 1.
Distributed memory machines currently provide the highest performance, but the ease of programming shared memory machines has compelled a concerted, largely U.S. effort to develop ever higher-performance machines of this type. ORNL has made seminal contributions to message-passing strategies, as discussed in the article Algorithms, Tools, and Software Aid Use of High-Performance Computers.
Fig. 1. The interconnect architecture of the Intel Paragon is an example of a message-passing system. The squares represent nodes connected in a two-dimensional mesh. Each node contains memory and two or more CPUs, one for message passing and the others for computing. Message-passing paths are indicated by arrows. Typical input/output (I/0) systems and devices are shown on the right.
U.S. Computing Initiative
How did CCS become one of the worlds premier computational centers? The imaginative federal High- Performance Computing and Communications (HPCC) Initiative opened the door for CCS, and a successful proposal from ORNL to the Department of Energys Office of Scientific Computing [OSC, now the Mathematical, Information, and Computational Sciences (MICS) Division] was the first step. The proposal established ORNLs CCS as a DOE High-Performance Computing Research Center (HPCRC) and the Intel Supercomputer Systems Division (SSD), later called the Scalable Systems Division, with its message-passing, parallel Paragon computers, as the major computer supplier.
Ken Kliewer and Buddy Bland check operations of the Intel Paragon XP/S 150, a parallel supercomputer at ORNL. Photograph by Curtis Boles.
However, CCSs technical path to its current Intel Paragon XP/S 150 centerpiece had a different origin: The first machine was the 32-processor Kendall Square Research KSR-1 computer, delivered in September 1991. This shared memory machine was truly innovative at that time. But the defining HPCC event was a cooperative research and development agreement (CRADA) between Intel SSD and ORNL. In September 1992, this CRADA brought to CCS the first of the Intel Paragons, a 66-node XP/S 5 with the general purpose (GP) architecture (i.e., two CPUs per node, one for computing and the other for message passing, as depicted in Fig. 1). The architecture of this and subsequent Paragons was the two-dimensional mesh shown in Fig. 1. (The number following XP/S in the names of Intel machines is a measure of the peak computing speed of the overall machine in billions of FLOPS, or gigaflops.)
Building from this CRADA, ORNL and Intel SSD signed a contract in June 1992 that culminated in the Paragon XP/S 150, but this machine was preceded by a smaller machine, planned initially to be part of CCS only until the arrival of the XP/S 150. The smaller machine, the Paragon XP/S 35, also with GP architecture, had 512 nodes with 16 million bytes [megabytes (MB)] of memory per node; it was also delivered in September 1992. In the same month, our Kendall Square machine was expanded to a 64-processor KSR-1. This Kendall Square machine was an important component of our computing capabilities until August 1997.
Using the XP/S 35 as our computing workhorse, we had achieved an initial goal: we could solve very large-scale and formidable problems, often called Grand Challenges. Also, we could utilize to great advantage the excellent computational environment we had assembled. The small XP/S 5 was a machine for code development, initiating users into the realities of parallel computing, and system and computational experiments. The XP/S 35 was reserved for those who had demonstrated the requisite level of expertise and were prepared for major computational projects. To preserve this highly effective arrangement and to extend our computing capabilities, we renegotiated the contract with Intel SSD in early 1994 to make the XP/S 35 a permanent CCS machine and to increase its memory to 32 MB per node. Preserving this hierarchical machine arrangement has been a major factor in the success of CCS.
At this point, Intel modified the structure of its forefront Paragon nodes to the multi-processor (MP) formthree processors per node, two for computing and one for message passing. This is the architecture of the XP/S 150: 1024 MP nodes, each with at least 64 MB of memory, with mesh connectivity as is illustrated in Fig. 1. As part of the aforementioned CRADA, we began receiving an MP node machine in May 1994, thereby providing our users an opportunity to familiarize themselves with and develop strategies for more effectively using the soon-to-be available XP/S 150.
Our initial plan was fully implemented on January 4, 1995, when Intel delivered the Intel Paragon XP/S 150 to CCS following successful completion of an arduous set of acceptance tests. Now we have a computer that can do 150 billion mathematical operations per second. But the capabilities in high-performance computing and scientific expectations are escalating rapidly. Accordingly, we are currently discussing with DOE the purchase of a machine with power reaching into trillions of mathematical operations per second (teraflops).
Our current computational environment (shown in Fig. 2) has made possible a striking array of scientific successes, a number of which are described in other articles in this issue. The accomplishments of CCS also point clearly to our sensitivity to what is expected of us through the HPCC initiative. Foremost among these expectations were
Fig. 2. The current CCS computational environment.
- bringing immature parallel systems to production-level capabilities,
- making meaningful progress in solving Grand Challenge problems, and
- accelerating the use of high-performance systems by the U.S. industrial sector.
The results achieved using the CCS machines and the ever-improving machine reliability make clear our success with the first two expectations. Our successes with the third expectation are described in the article Industrial-Strength Computing: ORNLs Computational Center for Industrial Innovation.
CCS Users and the User System
The major fraction of our computational resources is provided to Grand Challenge groups selected by DOE-MICS through a competitive process. These groups, whose members are located around the country, include high-energy physicists working on quantum chromodynamics, groundwater transport and remediation investigators, materials scientists focusing on studies emphasizing fundamental principles, and computational chemistry modelers addressing pollution chemistry. An additional major user is the DOE Computer Hardware, Advanced Mathematics, Model Physics (CHAMMP) program, in which complex models of the atmosphere and ocean are developed and used with the goal of understanding global climate change over extended periods of time.
A share of CCS computing resources is provided to the CCS director, with the Computational Center for Industrial Innovation (CCII) using the most. ORNL also benefits from CCS resources allocated to internally funded Laboratory Directed Research and Development proposals and to others through a competitive proposal process. We also encourage innovative tests of parallel computing strategies and benchmarking and provide resources to individuals and groups working in these areas.
Most of our users employ FORTRAN in their calculations, but we also have other languages available, including C, C++, and High Performance FORTRAN (HPF). A large number of codes have been written for the Paragon, many of which were developed at ORNL for specific research problems. Others, like Dyna-3D and KIVA-3, are well-known codes from the sequential world that have been written into parallel form for the Paragons. Several of these codes are described in articles in this issue.
We provide a User Services group to assist users with debugging, coding, and eliminating possible machine and algorithm problems; to handle questions related to accounts and scheduling; and to organize educational courses to train users to become effective parallel programmers. An irreplaceable component of CCS is the VizLab, which provides an array of scientific visualization systems and services for our users. The capabilities and accomplishments of the VizLab are described in the article Scientific Visualization at ORNL.
The impression that may have been imparted to this point is that the only consideration in establishing our computational environment was processor power. Not so. Of equal, or some may say, more importance are data storage and access systems. Powerful computers generate vast quantities of data, and there must be systems in place to obtain, store, and catalog all of these data. Further, those systems must be able to provide rapidly and precisely all these data upon request. And all of these tasks must be done with absolute reliability.
Working collaboratively with the DOE Atmospheric Radiation Measurement (ARM) Program, we have assembled an extraordinary storage system. Currently using NSL-UniTree software, the CCS/ARM storage system is a state-of-the-art client/server file storage system. Principal components include
- a Storage Tek 4410 tape robot, updated with advanced tape drives and with a capacity of about 30 trillion bytes (TB);
- nearly 2 TB of disk storage; and
- a pair of IBM 3494 tape robots, the storage system centerpiece, each having about a 50 TB capacity. One has eight high-performance tape drives, and the other has four.
As may be clear from the specifics here, data storage systems are generally hierarchical. From the computer memory, data will normally flow into disk storage, providing high-speed access, and then, ordinarily, into slower-speed but higher-capacity tape storage environments.
Although the storage system software is currently NSL-UniTree, the future looks very different. CCS, together with Sandia National Laboratories, Lawrence Livermore National Laboratory, Los Alamos National Laboratory, and IBM Global Government Industry, has been developing a software system having far higher capability, the High-Performance Storage System (HPSS). HPSS is indeed the storage system of the future, providing data transfer rates extending into the multigigabit-per-second range. An advantage of HPSS is that it is network centered. That is, the control network can be distinct from the data transfer network, allowing the data to be transferred at network speeds rather than be slowed by having to flow through the memory of a control device. A possible configuration is sketched in Fig. 3.
Fig. 3. The innovative data storage configuration made possible by the High-Performance Storage System (HPSS) software. Note that the control network and the data-transfer network are logically distinct. Also, HPSS permits parallel data transfers to and from disks and tapes.
While NSL-UniTree is sequential, HPSS is parallel. Thus, data from a single data file or multiple data files can be transferred into or out of the storage system through multiple channels simultaneously. Having these parallel capabilities is essential for many high-performance computing environments. Effective use of our multiple tape-drive and Redundant Array of Inexpensive Disks (RAID) storage systems requires these parallel capabilities, as does our research with medical records and medical image archives.
Experiments with HPSS are under way in our computer room. Following extensive stress testing of this system, HPSS became our production storage system software in 1997. We are also very pleased that HPSS, now an IBM product, received a 1997 R&D 100 Award from R&D magazine.
Essential to all of our computing are networks that provide for quick and efficient movement of information. CCS uses many different types of networking hardware to provide access to our machines as well as to the CCS/ARM storage system. Within our com-puter center, High-Performance Parallel Interface (HiPPI) networks provide 800 megabit-per-second (Mb/s) connections among our Paragon systems and the CCS/ARM storage system.
Connectivity between the computer center and our ORNL CCS users occurs through both fiber distributed data interface (FDDI at 100 Mb/s) and Ethernet (10 Mb/s). CCSs principal connection to the outside world occurs through the Energy Sciences Network (ESnet), the DOE Energy Research branch of the Internet where very important innovations and additions are taking place. General connectivity through ESnet, currently at the speed denoted T3, or 45 Mb/s, will reach 622 Mb/s in 1998.
As we emphasized earlier, the building blocks of a lightning-fast parallel supercomputer are small computers linked together that simultaneously solve pieces of a complex problem. So, most computing experts agree that the next logical step in high-performance computing is to link these fast parallel supercomputers together. But what if two of the worlds fastest computers are separated by hundreds or even thousands of miles? Then the challenge becomes finding a way to connect them through a very-high-speed network to ensure the combined power made possible by this connection.
Scientists from CCS and Sandia National Laboratories (SNL) are demonstrating the effectiveness of such an arrangement (see Fig. 4). These two laboratories possess a striking level of computer power, making such a connection noteworthy.
The basic idea is to link the three large Intel Paragons (an 1840-compute-processor XP/S 140 at SNL and the 2048-compute-processor XP/S 150 plus the 512-compute-processor XP/S 35 in the CCS) over a high-speed asynchronous transfer mode (ATM) network, to solve problems too large for one machine aloneextraordinarily formidable problems relevant to both ORNL and SNL. By extending computational parallelism into the network and surmounting an extensive array of technical hurdles on this path to distributed high-speed computing, the researchers are also making important contributions to the technology.
Fig. 4. Through high-speed ATM networks, Intel Paragons at SNL and ORNL will work together to solve complex problems related to national security, materials modeling, and global climate change, among others.
To use this distributed computing power effectively requires codes developed with the ORNL-SNL communication time in mind. One example is a materials science code written by ORNL researchers to model the magnetic structure of complex magnetic alloys. Another is a global change code that couples the ORNL code modeling the atmosphere with the Los Alamos National Laboratory code modeling the ocean to provide a superior climate simulation over extended times. Additional codes that address national security are also being prepared.
To enable the Paragon computers to work together, ATM boards were specially designed and built for them by the small company GigaNet, with support from SNL. The excellence of these boards, which recently received an R&D 100 Award, was initially demonstrated through high-performance links connecting the SNL, ORNL, and Intel booths at the Supercomputing 95 exhibition in San Diego.
Our work is focused on ensuring compatibility among codes, operating systems, ATM, and communications software (PVM and MPI). We are also seeing a fascinating array of performance challenges that must be overcome in order to meet scientific goals. Significant questions concerning network connections and network availability are being addressed.
We do want to emphasize again that our objective with the ORNL-SNL connection is to solve major problems. We are not simply doing a connectivity demonstration. The work being done through CCS is state-of-the-art. The GigaNet ATM boards operate at the speed designated OC-12, or 622 Mb/s. Using an ATM board on each of the XP/S 150 and XP/S 35 supercomputers in the CCS computer center, we were doing experiments at this speed. However, our connections to SNL at this point are no faster than OC-3, or 155 Mb/s. We anticipate having access to SNL over networks providing OC-12 connectivity in 1998. Our goal of OC-48, 2.56 gigabits per second, was achieved through use of four GigaNet boards in parallel at each site. This networking capability permits a huge step in the sizes of the materials science and national security problems that we can solve, as well as open avenues to effective distributed computing with far larger machines than we are currently using.
In conclusion, the Laboratorys CCS has emerged as one of the worlds leading computational science centers. The scientific emphases in CCS are applications and the solution of large-scale, complex problems. It is, however, the technical expertise embodied in the powerful computing environment established in CCS that has made these applications successes possible. This expertise and the innovative spirit in CCS and at SNL have been key features in extending the capabilities and impact of distributed computing through the networked ORNL-SNL system.
KENNETH L. KLIEWER, director of the Center for Computational Sciences (CCS) at ORNL, is a theoretical physicist. Computing and computational science have been at the center of his career. He has a Ph.D. degree in physics from the University of Illinois at Urbana-Champaign. In addition to the CCS directorship, he is the leader of the ORNL High-Performance Storage System (HPSS) development team and a member of the HPSS Executive Committee. Prior to joining ORNL in 1992, he was dean of the School of Science, assistant vice-president for research, and professor of physics at Purdue University. He came to Purdue from Argonne National Laboratory, where he was Associate Laboratory Director for Physical Research. His research interests include surface physics, optical physics, and electrochemistry. He is a fellow of the American Physical Society.
Next article | Contents | Search | Mail | Review Home Page | ORNL Home Page