High-Performance Computing: Innovative Assistant to Science
By James Arthur Kohl
Jim Kohl (left) and Phil Papadopoulos enjoy the ambience of Kohls office, where scientific experiments are simulated and their results are viewed on the computer using CUMULVS. Photograph by Tom Cerniglio.
ORNL researchers push the envelope to put heavy-duty computer power within scientists grasp using the CUMULVS system developed at ORNL. It helps scientists simulate experiments and change the parameters in midcourse to influence the results, saving time and money.
Scientists today are frequently turning to high-performance computer simulation as a more cost-effective alternative to physical prototyping or experimentation. The on-line nature of computer simulation encourages more flexible collaboration scenarios for researchers across the country, and the interactivity that is possible can often shorten the design cycle. Yet many issues must be overcome when developing a high-performance computer simulation, especially when many scientists will interact with the simulation over a wide-area network. The scientists must be able to observe the progress of the simulation and coordinate control over it, and the user environment must be capable of withstanding or recovering from system failures. The efficient handling of these issues requires a special expertise in computer science and a level of effort higher than the typical application scientist is willing to expend.
CUMULVS, a new software system developed at ORNL by James Kohl and Philip Papadopoulos, bridges the gap between scientists and their computer simulations. CUMULVS allows a collaborating group of scientists to attach dynamically to an ongoing computer simulation in order to view the progress of the computation and control the simulation interactively. CUMULVS also provides a framework for integrating fault tolerance into the simulation program, for saving user-directed checkpoints, for migrating the simulation program across heterogeneous computer architectures, for automatically restarting failed tasks, and for reconfiguring the simulation while it is running.
Whats in a Name?
People sometimes ask us, Why do you call your system CUMULVS and why do you spell it with a V? Well, heres the answer.
Originally, Phil Papadopoulos and I wrote a simple system to integrate PVM message passing with AVS visualization, called PVMAVS. This system worked well, but it was not very general or flexible. So, we rewrote the system from scratch and then called it StovePipe, which vaguely stood for Steering and Visualization of Parallel Programs. Unfortunately, we later found out that stovepipe was a term used to describe out-of-date technology in the Department of Defense. To avoid bad connotations, we changed the name of the system from StovePipe to CUMULVS, which in some sense refers to a bunch of hot air, or a voluminous amorphous form. Seriously speaking, CUMULVS is an acronym for Collaboration, User Migration, User Library for Visualization and Steering. My comment is always that, since I am the visualization guy on the team, I must have a V in there somewhere. So we mimicked old Roman lettering and replaced the last u in cumulus with a v. And thats the last word on the next-to-the-last letter. JAK
With CUMULVS, each of the collaborators can start up an independent viewer program that will connect to the running simulation program. This viewer allows scientists to browse through the various data fields being computed and observe the ongoing convergence toward a solution. Especially for long-running simulations, it can save countless hours waiting for results that might have gone awry in the first few moments of the run. Once connected, the scientists each view the simulation from their own perspective, receiving a steady sequence of snapshots of the program data that is of interest to them. CUMULVS supports a variety of visualization systems for viewing these snapshots, from commercial packages such as AVS, to public domain interfaces such as Tcl/Tk (developed at Sun Microsystems). CUMULVS takes care of the many complicated tasks of collecting information and coordinating communication between the various viewers and the simulation program. The result is a single, simple picture of the computation space, presented as a uniform field of data even if the actual data are distributed across a set of parallel tasks.
Scientists can use CUMULVS for collaborative computational steering in which certain parameters of a physical simulation or of an algorithm can be adjusted while the program is running. In a typical scenario, several scientists would attach to a simulation to view its progress. One of them might discover that something has gone wrong or is heading in the wrong direction. For example, if they were trying to synthesize a new material, they might decide the cooling rate must be changed to make the material stronger. At this point the scientist could adjust, or steer, certain physical or algorithmic features to try to fix the simulation, or the simulation could simply be restarted with a new set of inputs (such as an altered cooling rate). This type of interaction can save immense amounts of time by shortening the experimentation cycle. The scientist need not wait for the entire simulation to be completed before making the next adjustment. CUMULVS provides mechanisms that allow groups of scientists to cooperatively manipulate a simulation, with automatic locking capabilities that are invoked to prevent conflicting steering requests for any single parameter. Although only one scientist at a time can adjust the value of any single parameter, any number of different parameters can all be adjusted simultaneously.
To interact with a simulation using CUMULVS, the simulation program must be instrumented to describe the primary computational data fields and algorithmic or physical parameters. These special declarations consist of the data type, the size and cardinality of arrays, and any distributed data decompositions. CUMULVS needs declarations only for the data that are to be viewed and the parameters that are to be steered. At the point in the simulation where the data values are deemed valid, a single library call is made to temporarily pass control to CUMULVS. Here, any pending viewer requests are handled and any steering parameter updates are processed. If no viewers are attached, this library call carries only the overhead to check once for an incoming message, so that there is negligible intrusion to the simulation.
The various communication protocols that CUMULVS uses to coordinate the interactions between the viewers and the simulation are tolerant to computer faults and network failures. In addition, CUMULVS provides a checkpointing mechanism for making the simulation program itself fault-tolerant. Using this mechanism, the state of a simulation program can be saved periodically. Then the data stored in the checkpoint can be used to automatically restart the simulation if a computer node should crash or if a network should fail. CUMULVS checkpoints can also be used to migrate parallel simulation tasks across heterogeneous computer architectures on the fly. This capability is typically not possible with traditional checkpointing schemes, but CUMULVS uses a special user-directed checkpointing approach. Because the scientist has precisely described the data in the simulation program, CUMULVS has the additional semantic information necessary to automatically migrate a task using a minimal amount of information. CUMULVS can save the state of a simulation task and then restore it, even if the new tasks computer is of a different architecture or data format. Beyond that, CUMULVS can actually reconfigure an entire application by reorganizing the checkpoint data to fit a new data decomposition. So, a checkpoint saved on a cluster of workstations can be restarted to continue executing on a large parallel machine with many more nodes, or vice versa.
To better understand the usefulness of these CUMULVS features, consider a sample scenario. Suppose youre an engineer on a big project team that is designing a new high-tech jet airplane. Your job is to make sure that the air flows smoothly over the wings and around the engine intakes. If your design is off by even a small amount, you might bring down a multimillion-dollar aircraft, not to mention making a lifelong enemy out of the poor pilot, assuming he or she survives.
What you need is some way to really test out your ideas, try things out a few different ways, and make sure youve got it right before they start forming those expensive prototype airfoils. So you put your expertise to work, sit down at the computer, and whip up a computational fluid dynamics (CFD) simulation of the air flowing around one of your wings. You sit back and patiently wait for the results. And you wait. And you wait some more. Whew! Finally, after many hours of waiting for your program to converge on a solution, you get the answer. But something went wrong. Your wing design didnt produce the smooth flow you expected. What happened? you ask.
With CUMULVS at your side, the problems with your wing simulation can easily be revealed. You decide to apply CUMULVS to the simulation program to see whats going wrong. In your CFD airflow program you add a few special declarations so that CUMULVS knows whats in your simulation and where. You give your simulation the name flow so your viewer can find it. You describe the main computational data fields, pressure and density. You also declare a few of the steerable parameters, like the Mach number and the wings angle of attack. After recompiling your program, you are ready to go.
This time, you start up your simulation and can immediately attach a CUMULVS viewer to see whats happening. You request the main pressure data fieldits huge, but you want to see an overview of the whole array anyway. You tell CUMULVS to view the entire region of computation but at a coarse granularity, showing only every tenth data point. Using this smaller collection of data points for viewing greatly reduces the intrusion to your simulation, and the load on your network, while exploring such an immense dataset. The CUMULVS view of pressure appears as requested, and you begin to watch it slowly change as the simulation proceeds. From this high-level view, you can already see that something isnt quite right. It looks like the angle of attack of the wing is off by a mile. But it turns out to be a simple program bug, and you fix it in no time.
Now youre ready to try again, so you start up the simulation and connect up with CUMULVS. Your wing looks much better now, so you disconnect your viewer and let the simulation run while you go out to lunch. When you get back, you connect up again to see how things are going. The simulation is working, getting closer to an answer, but you can see that the performance of the wing will not be as good as was hoped. While watching through your viewer, you tell CUMULVS to adjust one of the wing model parameters to see if you can improve the design. After a moment you see the changes to the wing appear in your viewer, but it takes several more iterations before the effects of the new information begin to be seen in the simulation results.
After tediously tweaking your model over the next few hours, you decide that your simulation program is just too darn slow. You need to split your simulation program into smaller, independent pieces that can run simultaneously, or in parallel, so you can get the job done faster. With several different computers all working together on the problem, your simulation program might run in minutes instead of hours. You use a system like the MPI message-passing standard or PVM (the Parallel Virtual Machine system developed jointly at Emory University, ORNL, and the University of Tennessee by a team led by Al Geist at ORNL and Jack Dongarra at ORNL and the University of Tennessee, both of whom have also been instrumental in the development of MPI). Either of these systems allows you to write a parallel program. You parallelize the CFD algorithm by breaking the original calculation down into cells, and then you assign sets of these cells to a collection of parallel tasks that will cooperate to solve your problem. After each task finishes its iteration of work, the tasks will talk to each other, sending messages among themselves to share intermediate results on the way to a solution.
You start up a run of your new parallel program on a few workstations, and sure enough, you get the answer back in a fraction of the time. But things are way off again, and its worse than before. Theres a huge region of turbulence off the tip of the wing. Something has gone wrong with the way you reorganized your simulation program. Now what?! you exclaim. Its time for CUMULVS again.
Parallel programming is notoriously difficult because it lacks the single thread of control you would have in a conventional serial program. In addition, the data used in your parallel computation are probably spread across a number of distributed computer systems. This is done to capitalize on locality by leveraging faster local data accesses against more costly remote data accesses. Often, the data decompositions that make your parallel program the fastest are the ones that are the most complicated. CUMULVS helps a great deal with these complex data decompositions because it un-jumbles the data and presents them to the scientist in their original form, as if all of the data were present on a single computer.
To help CUMULVS collect the parallel data in your wing simulation, you need to enhance the special declarations that you made for the serial version. This effort involves defining the way each data field has been broken apart and distributed among the parallel tasks. After adding a few extra lines of decomposition declarations for CUMULVS, you are ready to go.
You start up a fresh parallel wing simulation and then attach your CUMULVS viewer to see what is happening. Sure enough, almost immediately you begin to see turbulence form off the wing tip. But its still not clear to you why its there. Something must be wrong with the parallel algorithm. You go back into your simulation code and add some CUMULVS declarations for the residuals array. This array doesnt represent the physical model, but instead describes the error associated with your mathematical computation. Viewing the residuals data field with CUMULVS should indicate whether there is an algorithmic problem.
This time when you connect your viewer, you request the residuals data field. Aha! you declare. It looks like the residual error in one corner of your computation space is stuck. One column of the space is not being updated from iteration to iteration, causing a fixed boundary condition that appears as turbulence in the model. You must have figured your array bounds wrong for that pesky parallel data decomposition.
With the bug fixed, you are now up to full speed and back in the task at handdesigning that airplane wing. The parallel simulation runs much faster than the serial version, and you are able to very quickly try a wide range of variations on your design. After a few days of successful experimentation, suddenly everything clicks, and you arrive at a solid arrangement of the wing parameters. You excitedly call up one of your colleagues and ask her to take a look at your design. She hooks up her CUMULVS viewer to the parallel simulation running on your workstation cluster and checks out your new design. This looks great! she says. She decides to try some minor enhancements, so she takes the last CUMULVS checkpoint saved for your simulation and reconfigures it to run on her big 512-node multi-processor system.
She picks up where you left off on your workstation cluster and begins to explore some slight adjustments to your model. She steers one wing parameter, just a little bit, to smooth off a final rough edge, and then calls you up to look at the simulation. You connect another viewer to her simulation and immediately agree that the wing is even better. The two of you contact the rest of the team, and everybody brings up their CUMULVS viewers to take a look at the new wing design. Everyone agrees that it is time to build a physical prototype of the wing. Success at last!
Fig. 1. CUMULVS view for a computational fluid dynamics (CFD) simulation of the airflow over a jet airplane wing. This demonstration won ORNL a silver medal for innovation at the Supercomputing 96 conference.
Fig. 2. CUMULVS view of the residuals field for the CFD simulation in Fig. 1. The spike in the corner reveals an omitted computation cell in the parallel decomposition.
This type of CUMULVS functionality was recently put to the test as part of the High Performance Computing Challenge competition of the Supercomputing 96 conference held in November 1996 in Pittsburgh. CUMULVS won a silver medal for innovation for its contribution to high-performance scientific computing. The ORNL teamconsisting of James Kohl, Philip Papadopoulos, Dave Semeraro, and Al Geistdemonstrated a collaborative visualization of three-dimensional airflow over an aircraft wing (much like the hypothetical scenario described above). Figure 1 illustrates a CUMULVS view of the running CFD airflow simulation over a wing. Figure 2 shows a view of the residuals field that helped Dave Semeraro debug the simulation.
In the High Performance Computing Challenge at the Supercomputing 95 conference, CUMULVS won the award for best interface and fault tolerance. In addition to winning awards, it is hoped that CUMULVS will win over many scientists.
JAMES ARTHUR KOHL is a staff scientist in the Mathematical Sciences Section of ORNLs Computer Science and Mathematics Division, which he joined in 1993. He received a Ph.D. degree in electrical and computer engineering from the University of Iowa, and he holds B.S. and M.S. degrees in this field from Purdue University. He was involved in visualization projects for parallel programming at the IBM T. J. Watson Research Center in 1992 and at DOEs Argonne National Laboratory, where he worked from 1983 through 1990 on regular short-term appointments. He is currently a member of the Parallel Virtual Machine (PVM) research group at ORNL, where he developed the widely used XPVM visualization interface. In 1996 Kohl received a Division Directors Award from the Computer Science and Mathematics Division for innovative work on user interfaces and reliability in distributed computing environments (the preliminary CUMULVS work). For three consecutive years, he has won recognition on ORNL teams in Heterogeneous and High-Performance Computing Challenges at Supercomputing conferences. He was recently selected as editor for a new weekly e-mail newsletter PT Digest for parallel software tools, a joint project by ORNL and the University of Tennessee that has been sponsored by the National HPCC Software Exchange. His research interests include program visualization, user interface design, and parallel computer architecture and software development. Kohl has been a member of the IEEE Computer Society, ACM SIGARCH, SIGSOFT, and SIGCHI. He is a member of the Order of the Engineer, Eta Kappa Nu, Tau Beta Pi, Phi Kappa Phi, Golden Key, Phi Eta Sigma, and Mensa International.
Next article | Contents | Search | Mail | Review Home Page | ORNL Home Page