Software Components to Facilitate
By Jack Dongarra, Noel Nachtigal, Esmond Ng, Barry Peyton, Bill Shelton, and David Walker
The authors in the library of the Computer Science and Mathematics Building: sitting from left are Jack Dongarra, Barry Peyton, and David Walker; standing from left are Bill Shelton, Esmond Ng, and Noel Nachtigal. Photograph by Tom Cerniglio.
ORNL researchers have developed computationally efficient linear algebra packages that are portable across different computer architectures, including workstations and massively parallel processors. These packages include ScaLAPACK, which is a parallel version of the LAPACK library, an iterative methods library known as the quasi-minimal residual package (QMRPACK), and several sparse matrix solvers. These computational tools are used in several important scientific and engineering applications, resulting in state-of-the-art algorithms that perform simulations in a fraction of the time of the original solvers they replaced. These applications range from first-principles electronic structure codes that are being developed as part of the DOE Grand Challenge in materials science to codes used in the automotive and aerospace industries.
Sophisticated mathematical software libraries relieve the scientist of the burden of having to be an expert in parallel programming, numerical methods, and computer science. Grand Challenge application codes make use of these software libraries to solve complex mathematical equations on sequential and parallel computers. Very often these equations can be formulated as linear algebra problems, so most of the software components that we have developed perform linear algebra computations. Numerical linear algebra is the branch of applied mathematics that deals with the computational aspects of matricesorderly arrays of symbols by rows and columns.
In this article, we discuss the main features and uses of a parallel library for dense linear algebra computations (ScaLAPACK), a package based on the quasi-minimal residual algorithm (QMRPACK), and sparse linear algebra techniques. These software components, which were developed at ORNL, provide a common interface, thus making the codes highly portable between different computer architectures ranging from personal computers and workstations to massively parallel platforms. The algorithms in these software libraries are numerically stable and highly reliable, and they are optimized to achieve the highest possible computational performance.
A good example of the use of these software components is the large system multiple scattering (LSMS) method, whose continuing development is part of a Grand Challenge application in materials science; it is sponsored by DOEs High-Performance Computation and Communications Program (HPCCP). Several of the software libraries have been incorporated into the most computationally intensive sections of the LSMS method, resulting in a highly efficient algorithm. These sections include the calculation of the inverse of a large matrix, solutions of a system of linear equations, and several matrix operations such as dot products, matrix-vector, and matrix-matrix multiplication.
A Parallel Dense Linear Algebra Library
Software libraries are widely used on personal computers, workstations, and conventional supercomputers, so it is natural to want to use the same libraries on parallel supercomputers. The use of software libraries makes it easier to build complex application programs and ensures that these applications are portable between different types of computers. Unfortunately, few high-quality software libraries are available for parallel computers, making applications more difficult to program and often resulting in duplication of effort. The Linear Algebra PACKage (LAPACK) is a popular library of dense linear algebra routines that is used on personal computers, workstations, and conventional super-computers. The Scalable LAPACK (ScaLAPACK) project is developing a software library for performing dense linear algebra computations on parallel computers. The term scalable refers to the ability of the software to make effective use of additional computational resources as the problem size increases; in a highly scalable algorithm, computational efficiency is preserved as the problem size and number of processors increase together. ScaLAPACK and LAPACK both perform dense linear algebra computations. These sorts of computations involve matrices in which most of the entries are nonzero, contrasting with sparse matrices, discussed in the next section, in which a significant fraction of the entries are zero. ScaLAPACK is designed to achieve good performance while being easy to use and simple to port between different computers. Dense linear algebra is important in a number of areas, particularly in the DOE Grand Challenge in materials science, where it has been used to investigate magnetic moment formation and stability in magnet materials. These types of materials are used in magnetic motors, data storage devices, and recording media, representing billions of dollars in annual revenues. In addition, dense linear algebra problems arise in boundary integral methods used in computing water flow past ships and ocean structures (such as oil rigs) and in solving electromagnetic scattering problems. Both LAPACK and ScaLAPACK are being used to help computational scientists solve these and other Grand Challenge problems.
The memory of high-performance computers consists of several hierarchical layers, each with a different access time. Usually layers that have more memory take longer to access. The uppermost levels are the registers that can be accessed most rapidly. The next levels are cached memory and main memory. In addition, on a parallel computer, nonlocal memory residing on a remote processor represents another memory layer.
The performance of an application depends on how data are accessed in these different layers. To get good performance, we want to move as little data as possible between different layers of memory. Where data movement is necessary, we move it in a few big chunks, rather than in many small chunks. These considerations are central to the design of ScaLAPACK and LAPACK, thus leading to the formulation of most library routines as block-partitioned algorithms. In block-partitioned algorithms, most of the work involves operations on two or more matrices, resulting in the efficient use of the memory hierarchy.
A dense linear algebra problem typically involves matrices and vectors. On a parallel computer, these data must be split up and assigned to different processors so that each processor can perform its particular part of the problem. The data distribution must be done carefully to make sure that each processor has the same amount of work to do in each phase of an application. If some processors have more work to do, then other processors may have to wait while they “catch up.” This situation, known as load imbalance, results in poorer performance. In ScaLAPACK, matrices and vectors are distributed over processors using a block-cyclic data distribution. In the block-cyclic data distribution of a matrix, the data assigned to a particular process are not contiguous but are scattered in a regular way over the whole matrix. This arrangement ensures that good load balance is maintained in the major parts of an algorithm.
Good performance is not the only goal in designing a software library. It must also be portable and easy to use. One problem in using distributed memory parallel computers is that each processor has its own local memory that is separate from the other processors; therefore, it knows only about its own data—that is, the local portions of the matrices and vectors that have been assigned to it by the data distribution. A processor cannot directly refer to part of a matrix on another processor.
ScaLAPACK is easy to use because matrices and vectors are referenced as global objects. Thus, when referring to a particular matrix entry, we can use global indices, rather than specifying it by giving the processor number and the local indices. This makes programming with ScaLAPACK much easier, and the resulting code looks very similar to the sequential version using LAPACK.
ScaLAPACK addresses the issue of portability by constructing the library routines out of lower-level components. Most computations on a single processor are performed using the basic linear algebra subprograms (BLAS). These routines are widely available, and on many platforms have been optimized to give very high performance. Message passing between processors is done using the basic linear algebra communication subprograms (BLACS). Versions of BLACS have been written based on the parallel virtual machine (PVM) and message-passing interface (MPI), as well as for the native message-passing systems of a number of parallel computers. The parallel BLAS (PBLAS) are parallel versions of most of the operations available in BLAS, and they are based on BLAS and BLACS. PBLAS make it easier for library designers to extend ScaLAPACK to include new algorithms. ScaLAPACK is constructed using BLAS, PBLAS, and BLACS as building blocks; it is therefore portable to essentially any computer that uses PVM or MPI. Since its first release into the public domain in December 1994, more than 1000 copies of the ScaLAPACK software have been distributed. Current work is extending ScaLAPACK to include out-of-core routines and other types of matrices, such as banded matrices. Further information is available on the World Wide Web at http://www.netlib.org/scalapack/.
Sparse Matrices and Structural Dynamics
When an automobile, airplane, space shuttle, or rocket is in motion, it inevitably undergoes stresses and strains induced by a significant amount of vibration. Can a next-generation automobile, airplane, space shuttle, or rocket be designed to be more reliable than todays models under such conditions? This question is extremely important because the cost of building cars must be very low, while the cost of building aircraft and spacecraft will be very high. How do scientists and engineers study this type of problem? One solution is to build prototypes, perform experiments with the prototypes, and construct models to analyze the experimental results. The experimental results will allow scientists to adjust the parameters of the models so that new and improved prototypes can be built. This iterative process eventually converges on prototypes that will survive the experiments. The ultimate products are then constructed on the basis of the prototypes. This is one essential application of structural dynamics.
Structural dynamics modeling is computationally intensive. We have been working with researchers in the Structural Dynamics and Vibration Group at Sandia National Laboratories to help them improve the efficiency of their modeling effort. Our efforts have met with much success.
The heart of many structural dynamics applications is a numerical linear algebra problem, and the Sandia modeling effort is no exception. Ultimately, the overwhelming majority of the total computer time is spent in a small number of linear algebra routines that solve symmetric, positive, definite systems of linear equations, where the coefficient matrices are large and have relatively few nonzero entries; such matrices are said to be sparse. Routines such as these, which are critical to the overall performance of the modeling software, are known as the key kernels within the software.
Efficient solution of sparse linear systems requires careful exploitation and preservation of the zero entries in the coefficient matrices. The way in which we solve such linear systems is to factorize each coefficient matrix into the product of two triangular matrices. The solution to each of the original linear systems can then be obtained by solving two triangular linear systems. An important note about this approach is that extra nonzero entries, or fill entries, are introduced into the triangular factors during the factorization. Thus, a crucial step in the solution process is to arrange the computation so that the number of fill entries is small.
Variable-banded methods, or profile methods, which attempt to limit fill entries around the diagonal of a matrix, are commonly used to solve the linear systems in many structural analysis applications. Such a linear-system solver was originally incorporated into Sandias structural dynamics modeling code. Unfortunately, these solvers do not make efficient use of either storage or computing time; consequently, the modeling software originally could deal effectively only with models too small to be of practical interest to scientists and engineers.
Fig. 1. Nonzero counts in sparse Cholesky factors.
Fig. 2. Performance of sparse Cholesky factorizations.
Fortunately, however, much better technology now exists to solve linear systems of this kind. General sparse solvers, which employ new techniques to limit the number of fill entries introduced and to take advantage of the sparsity pattern, have proven very effective on modern workstations and vector supercomputers, especially after dense matrix operations have been incorporated into the solution method to exploit the memory hierarchies found on modern uniprocessor machines. Over the past few years, we have further improved the algorithms on which this technology is based, and then we developed a sparse linear-system solver based on the new algorithms. As an illustration of advances in sparse matrix technology, a good implementation of one of the profile methods factors a sparse, symmetric, positive definite matrix of order 19,600 in 51.49 s on an IBM RS/6000 work-station; our solver takes only 3.24 s to factor the same matrix on the same machine. The number of matrix elements stored in the profile method is more than 3 million, but our sparse matrix code stores fewer than 1 million matrix elements. More examples are provided in Figs. 1 and 2. Figure 1 compares the numbers of matrix elements stored and manipulated in a profile method (RCM) and in a method (MD) that uses the minimum-degree algorithm to reduce fill. Figure 2 compares the times required to factor the matrices in RCM and MD; the difference between MD and MD/B is that MD does not exploit dense matrix operations while MD/B does. The matrices were derived from finite element discretizations of a square domain; their dimensions range from 10,000 to 40,000.
We have been working with Sandia researchers to incorporate our new fast, sparse linear-system solver in a large-scale structural dynamics modeling effort. Incorporation of our package into the modeling software enabled it to solve large problems of interest to scientists and engineers. This approach increased the problem size that can be dealt with in an effective manner and reduced significantly the time required to solve almost all problems over the approach originally employed in the software. As a result of the collaboration, studies of the response of structures to stresses and strains induced by heavy vibration can now be done much more quickly and with greater accuracy.
The new code enables Sandia and an industrial partner in California to perform state-of-the-art modeling in a fraction of the time required using the solver it replaced. The solver is the most heavily used module within the modeling software, and because of our input, the amount of time required from a products inception to production could be reduced drastically. Application of Sandias structural dynamics modeling package could increase the competitiveness of a U.S. automaker, improve the design of missiles developed at Sandia, and help assess the safety of the aging bridges found within the U.S. highway system.
Structural dynamics modeling is not the only application that gives rise to large, sparse, symmetric, positive, definite linear systems. Other examples include numerical solution of self-adjoint elliptic partial differential equations and numerical optimization. In particular, our sparse linear-system solver has been used in interior-point methods for solving linear programming problems.
Applications of an Algorithm
One of the major scientific projects at ORNL involves the computation of material properties starting from first principles. The goal is to obtain methods that can predict the material properties of an alloy, for example, using as the only inputs the constitutive elements of the alloy; in other words, no experimental data are used. Such methods are called parameter-free methods, and because they require no experimental inputs, relying only on basic principles of physics to describe the interactions among the atoms making up the system, they are also referred to as ab initio methods. If successful, these methods would allow us to predict properties of a material without having to first build samples of that material, thus reducing considerably the time and money spent designing new technologically advanced materials, such as those required by the power-generation or aerospace industries.
One important class of problems in ab initio materials modeling involves the computation of the minimum energy state of a system of electrons interacting in the field of atomic nuclei using an approach called density functional theory. This computation requires the solution of a set of Schroedinger equations, which in turn can be reduced to solving several linear systems of equations whose matrices describe the interactions among the different particles in the system. These matrices are not particularly large—the largest ones treated had only 2000 to 3000 unknowns—but, unfortunately, they are dense and thus have several million nonzero elements. An overwhelming fraction of the total computational effort is spent solving these linear systems, so it is here that we focused our attention.
The previous approach used to solve these linear systems was to factor the matrix into a product of two triangular matrices, and then use these to obtain the solution to the original system. This approach worked well while the matrices were still small. Unfortunately, the amount of time spent factoring each matrix grows rapidly as more and more atoms are included in the atom clusters: the time grows by a factor of 8 every time the number of atoms in the cluster is doubled. As a result, the technique can no longer be used once the cluster of atoms has grown to around 100.
We replaced this factorization technique with a completely different method, an algorithm from the class of iterative methods for the solution of linear systems, the quasi-minimal residual (QMR) algorithm. Like other iterative methods, the QMR algorithm successively computes approximations to the exact solution of the linear system, each time attempting to solve these linear systems to improve upon the last approximate solution. While it can be shown that the algorithm generates a sequence of approximate solutions that does, in fact, eventually converge to the exact solution, the hope is that in practice, it will find very good approximations after only a few iterations, as each iteration is relatively inexpensive.
Whether this is the case depends on the properties of the matrix. In our case, it turns out that the matrices that describe the interactions among atoms in clusters are very well suited for the QMR algorithm and that the algorithm rapidly gives very good approximations to the exact solution. In fact, for the materials tested, QMR solved the system up to 40 times faster when compared to the previous factorization-based method. Even more important, it turned out that when the number of atoms in the cluster is doubled, the time spent using QMR only doubles, instead of growing by a factor of 8; as a result, much larger cluster sizes can be treated. This advance enables better prediction of material properties and pushes the first-principles methods to systems previously envisioned as unattainable.
We have developed efficient linear algebra packages for use on high-performance workstations and massively parallel computers. These linear algebra packages have been incorporated in important computational-intensive applications, resulting in significant improvements in performance. Our current work involves increasing the capabilities of linear algebra packages that we have developed. These enhancements will enable these packages to treat a much wider class of matrix problems than is currently possible. As a result, necessary computational tools will be provided to an even larger number of researchers, allowing them to concentrate on their research and not on computer science and numerical linear algebra issues. Recent increases in efficiency are already allowing investigators to push their research into areas once thought intractable. By extending their research into new areas, scientists and engineers will provide important technological information that can be used to develop new efficient products, improving the overall competitiveness of U.S. industries.
Necessary computational tools will be provided to an even larger number of researchers, allowing them to concentrate on their research and not on computer science.
JACK DONGARRA was profiled in the previous article Algorithms, Tools, and Software Aid Use of High-Performance Computers.
NOEL NACHTIGAL is a research staff member in the Mathematical Sciences Section of ORNLs Computer Science and Mathematics Division. A native of Romania, Nachtigal obtained his Ph.D. degree in applied mathematics in 1991 from the Massachusetts Institute of Technology. After a 2-year stint as a postdoctoral scientist at the Research Institute for Advanced Computer Science at NASA Ames in Moffett Field, Nachtigal joined the Mathematical Sciences Section as the 1993 Householder Fellow. He is a coauthor of the quasi-minimal residual method for solving non-Hermitian linear systems and of the QMRPACK software package. More recently, he has been involved in research projects in materials science and computational fluid dynamics.
ESMOND G. NG, a native of Kwangtung, China, received his Ph.D. degree in computer science from the University of Waterloo in Waterloo, Ontario, in 1983. He joined ORNL in 1985. He is currently a senior research staff member and a group leader in the Mathematical Sciences Section of the Computer Science and Mathematics Division. Ng is also an adjunct professor in the Department of Computer Science at the University of Tennessee at Knoxville. He serves on the Technical Committee on Computational Linear Algebra of the International Association for Mathematics and Computers in Simulation (IMACS). He is also a member of the IMACS Technical Committee on Parallel- and Supercomputing. His research interests include sparse matrix computations, numerical linear algebra, parallel computing, and mathematical software development and software engineering. He is a co-author of the book Parallel Algorithms for Matrix Computations, which was published in 1990 by SIAM Publications.
BARRY W. PEYTON is a research staff member in the Mathematical Sciences Section of the Computer Science and Mathematics Division. A native of Charlotte, North Carolina, he received his Ph.D. degree in mathematical sciences from Clemson University in 1986. He joined ORNL in 1988. His research interests include sparse matrix computations, numerical linear algebra, parallel computing, and combinatorial algorithms. Peyton was a cowinner of the 1988 Gordon Bell Prize in recognition of the outstanding performance achieved by a parallel version of a sparse matrix code that he codeveloped prior to coming to ORNL. He is a coauthor of the book Parallel Algorithms for Matrix Computations, which was published in 1990 by SIAM Publications.
BILL SHELTON has a biographical sketch located with the article Developing a Grand Challenge Materials Application Code: An Interdisciplinary Approach.
DAVID WALKER is a senior research staff member in the Mathematical Sciences Section of ORNLs Computer Science and Mathematics Division. He holds a B.A. degree in mathematics from Jesus College, Cambridge University, an M.S. degree in astrophysics, and a Ph.D. degree in physics, both from Queen Mary College, University of London. He held postdoctoral appointments at the University of London and at the Jet Propulsion Laboratory. In 1986 he became a staff scientist in the Concurrent Computational Project at the California Institute of Technology, and in 1988 he was appointed to the University of South Carolina mathematics faculty as an associate professor. In 1990 he joined the ORNL staff. His research interests focus on software and algorithms for the scientific use of message-passing computers. He has been closely involved in the development of the ScaLAPACK parallel software library and the MPI message-passing standard. He has also contributed to the design of a parallel version of the Community Climate Model for predicting future climate. Walker has published a number of papers on the parallel implementation of particle-in-cell algorithms for plasma simulations. He has also been involved in the benchmarking of science and engineering applications codes on parallel computers. He has published more than 60 papers on parallel computing and has coauthored three books on the subject. In 1992 he founded a series of conferences on high-performance computing under the auspices of the Gordon Research Conference organization.
Next article | Contents | Search | Mail | Review Home Page | ORNL Home Page