The Science
A team of researchers from ORNL, Carnegie Mellon University (CMU), University of California, Berkeley (UCB), and Penn State University (PSU) developed a novel algorithm for resilient and communication-efficient parallel matrix multiplication in HPC systems.
The algorithm, known as 3D Coded SUMMA:
- performs the communication-efficient parallel matrix multiplication and is capable of recovering from compute node failures using redundancy through coded computation.
- requires 50% less redundancy than traditional replication and has an execution time overhead of only 5-10%.
The Impact
Current HPC strategies for obtaining timely and efficient results (through checkpoint/restart, algorithm-based fault tolerance, etc.) may not provide the necessary failure tolerance at a reasonable cost in systems that experience high failure rates.
The developed algorithm:
- offers a new capability for such systems by providing the failure tolerance of traditional redundant computing at significantly lower cost.
- applies the latest advances in coding theory to failure tolerant computing, opening up an entirely new area of research.
PI(s): Pulkit Grover (CMU) and Christian Engelmann (ORNL)
ASCR Program/Facility: Early Career
Funding: DOE/ASCR for ORNL, NSF for CMU, UCB, and PSU
Publication: Haewon Jeong, Yaoqing Yang, Christian Engelmann, Vipul Gupta, Tze Meng Low, Pulkit Grover, Viveck Cadambe, and Kannan Ramchandran. 3D Coded SUMMA: Communication-Efficient and Robust Parallel Matrix Multiplication. In Lecture Notes in Computer Science: Proceedings of the 26th European Conference on Parallel and Distributed Computing (Euro-Par) 2020, Warsaw, Poland, August 24-28, 2020 DOI: https://doi.org/10.1007/978-3-030-57675-2_25.