Algorithms, Tools, and Software Aid Use of High-Performance Computers
By Ed DAzevedo, Jack Dongarra, Tom Dunigan, Al Geist, Chuck Romine, and Pat Worley
Al Geist (left) discusses the Intel Paragon XP/S 150 (shown here) with Jack Dongarra, Ed DAzevedo, Tom Dunigan, Pat Worley, and Chuck Romine. Photograph by Curtis Boles.
ORNL and University of Tennessee at Knoxville (UTK) researchers are at the forefront in developing tools and techniques for using high-performance computers efficiently. Parallel Virtual Machine software enables computers connected around the world to work together to solve complex scientific, industrial, and medical problems. The Distributed Object Library makes it easier to program the Intel Paragon super-computer, recently enabling the ORNL Paragon to break a computational record in molecular dynamics simulations. The Distributed Object Network I/O Library increases the performance in data input and results output on the Intel Paragon, often by as much as an order of magnitude. ORNL and UTK researchers are also leaders in ongoing efforts to standardize the message-passing parallel programming paradigm and to develop improved techniques for evaluating parallel computers.
In the past decade, the world has experienced one of the most exciting periods in computer development, and ORNL researchers have played key roles in these developments. Computer performance improvements have been dramatica trend that promises to continue for the next several years.
One reason for the improved performance is the rapid advance in microprocessor technology. Microprocessors have become smaller, denser, and more powerful. Indeed, if cars had made equal progress, you could buy a car for a few dollars, drive it across the country in a few minutes, and park the car in your pocket! The result is that microprocessor-based supercomputing is rapidly becoming the technology of preference for attacking some of the most important problems of science and engineering.
To exploit microprocessor technology, vendors have developed massively parallel computers. ORNL has always been at the forefront in acquiring and using the latest of these computers. In 1986, ORNL acquired Serial No. 1 of the Intel iPSC-2, containing 64 Intel 386 processors. More recently, ORNL has installed the Intel Paragon XP/S 150, one of the largest parallel computing systems in the world. It contains 3204 high-speed processors and has a peak performance of 150 gigaflop/sthat is, it can perform 150 billion basic arithmetic operations per second.
Massively parallel systems offer the enormous computational power needed for solving Grand Challenge problems such as simulating the climate. Unfortunately, software development has not kept pace with hardware advances. New programming paradigms, languages, scheduling and partitioning techniques, and algorithms are needed to exploit fully the power of these massively parallel machines.
A major new trend for scientific problem solving is distributed computing. In distributed computing, computers connected by a network are used collectively to solve a single large problem. Many scientists are discovering that their computational requirements are best served not by a single, monolithic computer but by a variety of distributed computing resources linked by high-speed networks. This worldwide popularity of distributed computing can be traced back mostly to software developed in ORNLs distributed computing research project.
Combining several parallel super-computers with distributed computing techniques can produce unprecedented computational power. Recently, researchers at ORNL and Sandia National Laboratories (SNL) combined the 3204-processor computer at ORNL with the 2680-processor computer at SNL to solve a large computational problem in materials science.
In this article, we explore some of the issues involved in the use of high-performance computers. These issues are at the heart of effective use of the fastest supercomputers. In particular, we consider software tools and performance evaluation.
The widespread acceptance of parallel computers as an environment for high-performance scientific computation has been inhibited by the lack of efficient and portable parallel software. Writing parallel software is much more difficult than writing sequential software: parallel programs must express both the sequential computations and the interactions among the sequential computations that define the parallelism.
A number of software tools developed at ORNL, discussed below, promise to reduce the complexity of parallel programming.
Enabling heterogeneous network computing. Until recently, all computers on a network were considered separate units connected merely to allow users to transfer files and send electronic mail to each other. Today, researchers in high-performance computing are combining computers on the network in such a way that users can exploit their aggregate performance and memory to run a single parallel application.
A collection of computers that differ in their architecture or in their method of representing data are called heterogeneous. Heterogeneous network computing offers several advantages over large-scale parallel computers:
On the other hand, getting heterogeneous computers to communicate and cooperate with each other is a challenging computer science problem.
- Cost is reduced by using existing hardware, usually a collection of workstations.
- Performance can be improved by assigning each individual task to the most appropriate architecture.
- Development and debugging use the familiar tools running on the users personal workstation.
- User-level or program-level fault tolerance (the ability to operate continuously in the event of a failure) can be implemented either in the application or in the underlying operating system.
To solve this problem, ORNLs distributed computing research project, in collaboration with the University of Tennessee at Knoxville (UTK), produced the Parallel Virtual Machine (PVM) software package. PVM permits a heterogeneous collection of computers linked by a network to be used as a single large parallel computer.
PVM enables users to exploit their existing computer hardware to solve much larger problems at minimal additional cost. The software and documentation are available from the World Wide Web site listed at the end of this article. ORNL researchers are continuing to explore the frontiers of distributed computing, working on security, fault tolerance, high-speed asynchronous transfer mode (ATM) networks, and graphical interfaces to assist users in programming for distributed computing.
Today, hundreds of sites around the world are using PVM to solve important scientific, industrial, and medical problems. Automotive, aerospace, chemical, computer, environmental, medical, pharmaceutical, and oil companies are all using this software as a cost-effective way to design new products. DOE national laboratories, National Science Foundation supercomputer centers, and National Aeronautics and Space Administration research centers, as well as numerous universities around the country, are using PVM both for research and as a teaching tool. With thousands of users, PVM has become the de facto standard for distributed computing worldwide.
Standardizing the message-passing model. The message-passing model is a programming paradigm used widely on parallel computers and on networks of workstations. The basic concept of communicating through messages is well understood, and over the past 10 years, many significant applications have been recast into this paradigm. More recently, several public-domain systems have demonstrated that a message-passing system can be implemented efficiently and portably.
Message passing is used to specify the communication among a set of processes forming a concurrent program. The message-passing paradigm is attractive because it is portable and scalable. Message passing is compatible with distributed-memory multicomputers, shared-memory multiprocessors, networks of workstations, and combinations of these elements. Many diverse message-passing systems were developed, but until 1994, no standard was agreed upon.
In 1992, ORNL and UTK spear-headed an international effort to define a standard interface for message passing. The effort involved more than 80 people from approximately 40 organizations from the United States, Asia, and Europe, including computer vendors and researchers from universities, government laboratories, and industry. This effort recently culminated in the publication of the Message-Passing Interface (MPI) standard.
The MPI standard defines the syntax and semantics of a core of library routines useful to a wide range of users writing portable message-passing programs in Fortran 77 or C. MPI also forms a possible target for compilers of languages such as High-Performance Fortran. Both commercial and public-domain implementations of MPI exist. These run both on tightly coupled, massively parallel processors and on networks of workstations.
Emulating shared memory. The Intel Paragon XP/S 150 at ORNLs Center for Computational Sciences consists of a collection of processors, each with its own local memory, connected by a high-speed communication network. The processors coordinate the computation and exchange of data via message passing. However, efficient coding for these processors requires careful decomposition of data structures and explicit calls to pass data among processors.
Researchers in ORNLs Computer Science and Mathematics Division have developed the Distributed Object Library (DOLIB) to make the Paragon easier to program. DOLIB is a library of Fortran- and C-callable routines that create a shared-memory programming environment for the distributed-memory Paragon.
The original motivation for the creation of DOLIB was in the context of parallel Lagrangian particle tracking on the Paragon, where the message traffic pattern changes dynamically depending on the location of particles at run time. (Lagrangian particle tracking is a numerical technique used in computational fluid dynamics for such applications as the transport of contaminants in groundwater and the advection of moisture in the atmosphere.) Previous efforts to parallelize particle tracking on the Paragon generally involved decomposing the computational grid into subdomains, with padding (or extended ghost regions) to contain flow-field information from neighboring regions. The time step was then restricted to ensure that no particle could exit the extended region. This approach is relatively simple to program, but creates a serious dilemma. If the extended regions are small (to conserve memory), a severe constraint may be placed on the size of the allowable time step. On the other hand, too large an extended region incurs a heavy cost in memory use and communication traffic. For three-dimensional problems with relatively high-velocity flow, the dilemma may have no acceptable solution.
A more natural approach is to consider the flow field as a globally shared array. Processors directly access only the flow-field information they require, using gather and scatter operations. As a result, particles can be allocated dynamically to the processors. DOLIB has been used to parallelize several codes for the Paragon. One such code is a semi-Lagrangian transport (SLT) code, which is used for climate simulations. Using DOLIB, researchers were able to parallelize SLT easily. The performance of the resulting code is competitive with a hand-parallalized code that uses explicit message passing and extended regions. Moreover, on high-resolution refined grids, the amount of memory available for extended regions severely constrains the time step. Thus, the DOLIB version has an advantage.
SOTON-PAR, a molecular dynamics code, has also benefited from DOLIB. ORNL researchers modified SOTON-PAR to use DOLIBs global shared memory. The DOLIB shared memory version allows more efficient use of global aggregate memory, which allows running a billion-particle simulation on the Intel Paragon at ORNL. The DOLIB version shattered the previous world records for simulation size (600 million particles on the 1024-node CM-5 and 400 million particles on ORNLs Paragon). Figure 1 shows that the code has almost ideal scalability with problem size and number of processors.
Fig. 1. Using DOLIB, the SOTON-PAR molecular dynamics code shows linear speedup in number of processors and also in the number of atoms.
Speeding the input/output. Getting data in (and results out) of a parallel supercomputer fast enough is a challenge for many high-performance applications. To speed disk input and output on the Intel Paragon, ORNL researchers developed the Distributed Object Network Input/Output (I/O) Library (DONIO). The system uses the aggregate global shared memory provided by DOLIB as a large (gigabyte) disk cache. On most high-performance supercomputers such as the Intel Paragon, the communication network has a much higher bandwidth (bytes per second) than the disk subsystem. By caching data in the memory of remote processors instead of writing to disk, DONIO can achieve an order-of-magnitude speedup over native system routines. The library also reads or writes data to multiple disks concurrently to achieve near-peak I/O performance.
The DONIO library has been used in several significant applications. For example, researchers used DONIO in a computational chemistry application to reduce time for checkpointing (saving intermediate results used for resuming computation) from 30 minutes to about 50 seconds. In another application, a code for seismic analysis used DONIO to reduce I/O time from nearly 2.5 hours to only 5 minutes. Researchers at Lawrence Livermore National Laboratory are also interested in adapting DONIO for their PARFLOW saturated groundwater code to run on the Cray T3D.
The software tools described previously are used to provide a programming environment or to improve performance on high-performance computers. Performance evaluation is concerned with measuring performance and determining whether and how performance can be improved. For example, performance studies of I/O on the Paragon were used to optimize the implementation of DONIO on the Intel Paragon.
Several metrics are used in evaluating parallel computer architectures. One of the principal scientific metrics is megaflops, millions of floating-point operations per second that a benchmark or application can achieve on a given computer system. For a parallel computer, we usually measure the mega-flops on a single processor and then calculate the performance when multiple processors are used to execute the benchmarks or applications. Ideally, we hope for scalability: that an application will run 10 times faster on 10 processors than it did on one processor, and 1000 times faster on 1000 processors.
Fig. 2. Message-passing performance.
Another indicator of performance involves communication by sending messages. Typically, two separate metrics are involved: latency, or the time it takes to send a small message between processors, and throughput, or the data rate (in bytes per second) achieved when transferring a large message between processors. Figure 2 illustrates the message-passing performance of several parallel processors using these two metrics. The upper-left region of the plot is the high-performance area, whereas the lower right represents the performance of a typical local area network like Ethernet. Another related metric is the latency and bandwidth of I/O, whether reading/writing to a disk or exchanging messages with some other (slower) external device.
In the following subsections, we discuss benchmarking, in which these and other metrics are measured, and performance evaluation methodology.
Benchmarking. The term benchmarking is drawn from its use in surveying, where it represents a mark on a stationary object whose position and elevation have been measured. Once made, the mark is used as a reference point in tidal observations and surveys. Analogously, benchmarking of a computer system is intended to measure new systems relative to a reference point on current systems. In particular, benchmarks are standardized computer programs for which there is a history of measurement data (typically timings) for executions of the programs with specifically defined input and repeatable output.
Benchmarking the performance of a computer is a complicated issue because it is a function of many interrelated quantities. These quantities include the application, the algorithm, the size of the problem, the choices of high-level language and implementation, the level of human effort used to optimize the problem, and the compilers ability to optimize, as well as the operating system, architecture, and hardware characteristics of the system under test.
Because of this complexity, bench-marks typically are only one of many criteria used to evaluate computer performance. Nevertheless, such bench-marks are helpful to algorithm designers seeking the optimal coding style for a given system, to system developers seeking to match machine characteristics to the requirements defined by their target workloads and represented to them through the bench-marks, and to individuals and groups seeking to procure the appropriate computing system for a given installation.
Several benchmarks have evolved for both scientific and commercial processors. Among the best known for scientific computing are the Livermore Fortran Kernels, the Los Alamos Benchmarks, the NAS Kernels, and the LINPACK tests. Recently, motivated by a growing concern in the supercomputing community that existing bench-marks are too simplistic to fully represent scientific computing, researchers have attempted to define new standards. These efforts include the ParkBench activity, initiated by researchers at ORNL and UTK, and the System Performance Evaluation Cooperative (SPEC) project.
Performance evaluation methodology. While benchmarking is primarily concerned with performance measurement, the performance evaluation process also involves deciding which benchmark (and other performance) data to collect and how to use it to answer specific performance-related questions.
For example, the performance evaluation of a computer system may determine whether the system is an appropriate platform for some given task or for a given workload (set of application programs or tasks). This type of evaluation is a crucial part of the procurement process, identifying which computer system to purchase or buy time on. If a particular computer system cannot do the required job in the required time, its cost does not matter.
Performance evaluation can also be used to identify how components of the computer system interact. Such evaluation is important in identifying which components are most limiting to performance and how they might be modified to improve performance. Similarly, evaluation activities are important for providing guidance in the use of the computer, identifying what the computer is good at doing, and which tasks are expensive to perform (and should be avoided or minimized). For example, on distributed-memory machines like the Intel Paragon or IBM SP2, there is a range of costs in accessing data, depending on where those data currently reside: in the local memory of the processor that is scheduled to use the data, in the local memory associated with a different processor but that is accessible via a fast network connection, or in some storage device that is accessible only via a slower network connection. Programming to take advantage of data locality (assigning data to the local memory of processors that are most likely to need the data) is generally a good idea but is not always possible or may require costly restructuring of an existing program. Understanding the relative costs of accessing data from different levels of the memory hierarchy can indicate how important it is to exploit locality (and minimize accesses to nonlocal data).
The evaluation process examines the computer system at a variety of levels. In low-level tests, the individual elements of the system are exercisedfor example, determining how quickly a processor can compute a result, or determining how quickly a processor can access data from local memory, from nonlocal memory, or from an external storage device. These tests try to determine not only peak performance (the best that can be achieved), but also the performance that is typically achieved, and what factors distinguish between peak and typical performance.
The next level is kernel performance: looking at the ability of the system to solve common algorithmic problems or execute system functions. Here, the focus is on seeing how quickly a single processor can calculate, say, a matrix-matrix multiplication or compute a fast Fourier transform, how quickly a group of processors might accomplish the same or larger versions of these tasks, and how quickly one processor can broadcast data to all of the other processors. The reason behind the performance of a given kernel may not always be clear from these tests, but the kernels are chosen to be important functions whose execution time is of intrinsic interest.
The final level is performance of application programs. Initially, these are programs in the required workload that are just being ported to the new system, or they are already ported and optimized programs that are representative of the programs in the workload. For the first type of program, the performance reflects the minimum that can be achieved, but if these application programs cannot or will not be modified, their performance is the most (or only) relevant measurement in the evaluation process. Usually, however, programs will be adapted to make better use of the computer system over time, and the kernel and low-level performance measurements can be used to indicate how much improvement is possible. The application program tests are also the best evaluation of the compilers, which have been the weakest component in some high-performance computer systems. It is frustrating for an application developer to have access to a fast computer system, only to find that the compilers perform so poorly on application programs that the desired performance is not achieved. For example, Fig. 3 shows megaflops per processor as a function of the number of processors for a climate-related parallel application code run on the Intel Paragon, the IBM SP2, and the Cray Research T3D.
Fig. 3. Per-processor performance for a series of different problem sizes on the Intel Paragon (P), the IBM SP2 (S), and the Cray Research T3D (T).
The relative flatness of the curves for the Paragon indicates good scalabilitythat is, per-processor performance is retained when the number of processors increases. One implication of these results is that the interconnection network is fast, and communication costs are manageable. However, the Paragon results also indicate poor performance for the compiled Fortran code when compared to the peak performance of the microprocessors. The results for the SP2 and T3D show better per-processor absolute performance, but are approximately the same fraction of the peak rate for the underlying microprocessor.
A recent addition to the evaluation process is the evaluation of different programming paradigms, or models of computation that are used when designing and implementing programs. Examples of this are the shared memory programming paradigm, in which data locality is hidden from the programmer, and data movement from remote locations is managed automatically by the system software; the message-passing programming paradigm, in which all data movement from nonlocal memory is handled explicitly by the programmer through a special set of system calls; and mixed models like the High-Performance Fortran standard, in which data locality issues can be addressed by specifying what should be local but the data movement required by the program is taken care of automatically. Many of the current high-performance computing systems nominally support two or more of these paradigms, but with vastly different performance characteristics. Moreover, there may be support for more than one type of shared memory or message-passing programming model. For example, most of the current multiprocessor vendors provide both their own proprietary message-passing primitives and one or both of PVM and messaging-passing interface (MPI).
It is important to emphasize that computer systems need to be continually reevaluated. Hardware and software upgrades can change performance dramatically (one hopes) by removing performance bottlenecks in one or more components of the system and by sometimes introducing new performance problems in other components. For this reason, a system administrator or performance specialist should establish a coherent methodology for reevaluating performance at a computer site and for automating the evaluation process as much as possible.
Finally, the evaluation process must take future needs into account. Will the next generation of those programs in the current workload have stricter performance requirements? Is the current computer system scalable; in other words, can increased computational requirements be satisfied by adding more processors or by replacing existing components with faster ones? Answers to these questions typically rely on extrapolation from the characteristics of the current computer system and the current workload. Nevertheless, peak performance measurements can be used reliably to indicate negative resultsfor example, to show that a larger problem cannot be solved quickly enough or that a computer system is not scalable.
When Is Fast Fast Enough?
As parallel computers become ever faster, it is tempting to suppose that they will eventually become fast enough. In practice, however, the users appetite for increased computing power is inexhaustible: the extra power is quickly consumed in an attack on problems previously considered intractable.
The continuing evolution of massively parallel computers and distributed computing systems promises unprecedented speed and power. Some effort will always be necessary to achieve the best performance on these systems. However, current research in parallel and distributed computing at ORNL and elsewhere should enable us to more rapidly capitalize on this power.
- On high-performance computing:
- On software tools:
- On performance evaluation:
EDUARDO DAZEVEDO is a research staff member in ORNLs Computer Science and Mathematics Division (CSMD). He has a Ph.D. degree in computer science from the University of Waterloo, Canada. He first came to the Mathematical Sciences Section at ORNL under an Oak Ridge Associated Universities postdoctoral fellowship in 1990. Since that time, he has been involved in research in numerical linear algebra, triangular mesh generation, and high-performance computing with applications in modeling groundwater flow and contaminant transport.
JACK DONGARRA holds a joint appointment as Distinguished Professor of Computer Science in the Computer Science Department at the University of Tennessee and as Distinguished Scientist in ORNLs Mathematical Sciences Section, CSMD, under the UT/ORNL Science Alliance Program. He specializes in numerical algorithms in linear algebra, parallel computing, use of advanced-computer architectures, programming methodology, and tools for parallel computers. He was involved in the design and implementation of the software packages EISPACK, LINPACK, BLAS, LAPACK, ScaLAPACK, Netlib/XNetlib, PVM/HeNCE, MPI, and the National High-Performance Software Exchange. His current research includes the design of algorithms and techniques for high-performance computer architectures and the development, testing, and documentation of high-quality mathematical software. You can learn more about him through his UT web page: http://www.netlib.org/utk/people/JackDongarra.html.
TOM DUNIGAN is a research staff member in CSMD. He has worked at all three Oak Ridge facilities since the mid-1970s. After he joined the Computer Science Research Group in 1982, he established ORNLs first Internet connection. His research interests are in parallel computing and high-speed networking. He has been involved in beta testing and performance analysis of several first-generation parallel computers at ORNL. He has a Ph.D. degree in computer science from the University of North Carolina. He is an adjunct faculty member of the University of Tennessee Computer Science Department.
AL GEIST, who joined ORNL in 1983, is leader of the Computer Science Group in the Mathematical Sciences Section of CSMD. One of the developers of the Parallel Virtual Machine (PVM) software system and one of the designers of MPI, he has published papers in areas such as solar energy, materials science, solid-state physics, parallel computing, scientific computation, distributed computing, and numerical linear algebra. To learn more, see http://www.epm.ornl.gov/~geist.
CHUCK ROMINE joined the Mathematical Sciences Section of CSMD in 1986 after receiving his Ph.D. degree in applied mathematics from the University of Virginia. His main areas of research interest include parallel numerical linear algebra and software tools for parallel numerical computation. Most recently, as a member of the Partnership in Computational Science team, he has been developing software tools to support parallel models for groundwater flow and contaminant transport on high-performance supercomputers such as the Intel Paragon at ORNL.
PAT WORLEY is a research staff member in CSMD. He works in numerical analysis, parallel algorithm design, computer performance evaluation, and software tools. He is a principal investigator for the Center for Computational Sciences Evaluation of Early Systems Project and a member of the ORNL CHAMMP project team, which is part of the Global Change Research Program. He has a Ph.D. degree in computer science from Stanford University. He joined ORNL in 1987 as a researcher in the Mathematical Sciences Section. For more information, see http://www.epm.ornl.gov/~worley.
Next article | Contents | Search | Mail | Review Home Page | ORNL Home Page