Skip to main content
Research Highlight

3D Coded SUMMA: Communication-Efficient and Robust Parallel Matrix Multiplication

The Science

A team of researchers from ORNL, Carnegie Mellon University (CMU), University of California, Berkeley (UCB), and Penn State University (PSU) developed a novel algorithm for resilient and communication-efficient parallel matrix multiplication in HPC systems.

The algorithm, known as 3D Coded SUMMA:

  • performs the communication-efficient parallel matrix multiplication and is capable of recovering from compute node failures using redundancy through coded computation.
  • requires 50% less redundancy than traditional replication and has an execution time overhead of only 5-10%.

The Impact

Current HPC strategies for obtaining timely and efficient results (through checkpoint/restart, algorithm-based fault tolerance, etc.) may not provide the necessary failure tolerance at a reasonable cost in systems that experience high failure rates.

The developed algorithm:

  • offers a new capability for such systems by providing the failure tolerance of traditional redundant computing at significantly lower cost. 
  • applies the latest advances in coding theory to failure tolerant computing, opening up an entirely new area of research.

PI(s): Pulkit Grover (CMU) and Christian Engelmann (ORNL)
ASCR Program/Facility: Early Career
Funding: DOE/ASCR for ORNL, NSF for CMU, UCB, and PSU
Publication: Haewon Jeong, Yaoqing Yang, Christian Engelmann, Vipul Gupta, Tze Meng Low, Pulkit Grover, Viveck Cadambe, and Kannan Ramchandran. 3D Coded SUMMA: Communication-Efficient and Robust Parallel Matrix Multiplication. In Lecture Notes in Computer Science: Proceedings of the 26th European Conference on Parallel and Distributed Computing (Euro-Par) 2020, Warsaw, Poland, August 24-28, 2020 DOI: