Skip to main content
SHARE
Publication

3D Coded SUMMA: Communication-Efficient and Robust Parallel Matrix Multiplication...

Publication Type
Conference Paper
Book Title
Lecture Notes in Computer Science: Proceedings of the 26th European Conference on Parallel and Distributed Computing (Euro-Par) 2020
Publication Date
Page Numbers
392 to 407
Volume
12247
Publisher Location
Switzerland
Conference Name
26th European Conference on Parallel Processing (Euro-Par) 2020
Conference Location
Warsaw, Poland
Conference Sponsor
N/A
Conference Date
-

In this paper, we propose a novel fault-tolerant parallel matrix multiplication algorithm called 3D Coded SUMMA that achieves higher failure-tolerance than replication-based schemes for the same amount of redundancy. This work bridges the gap between recent developments in coded computing and fault-tolerance in high-performance computing (HPC). The core idea of coded computing is the same as algorithm-based fault-tolerance (ABFT), which is weaving redundancy in the computation using error-correcting codes. In particular, we show that MatDot codes, an innovative code construction for parallel matrix multiplications, can be integrated into three-dimensional SUMMA (Scalable Universal Matrix Multiplication Algorithm [30]) in a communication-avoiding manner. To tolerate any two node failures, the proposed 3D Coded SUMMA requires ∼50% less redundancy than replication, while the overhead in execution time is only about 5–10%.