DISP: Optimizations towards Scalable MPI Startup...

by Huansong Fu, Swaroop S Pophale, Manjunath Gorentla Venkata, Weikuan Yu

Publication Type

Conference Paper

Publication Date

November, 2016

Page Numbers

53 to 62

Publisher Location

Piscataway, New Jersey, United States of America

Conference Name

COM-HPC '16

Conference Location

Salt Lake City, Utah, United States of America

Conference Date

Nov 13, 2016 - Nov 18, 2016

View DOI Listing

Abstract

Despite the popularity of MPI for high performance computing, the startup of MPI programs faces a scalability challenge as both the execution time and memory consumption increase drastically at scale. We have examined this problem using the collective modules of Cheetah and Tuned in Open MPI as representative implementations. Previous improvements for collectives have focused on algorithmic advances and hardware off-load. In this paper, we examine the startup cost of the collective module within a communicator and explore various techniques to improve its efficiency and scalability. Accordingly, we have developed a new scalable startup scheme with three internal techniques, namely Delayed Initialization, Module Sharing and Prediction-based Topology Setup (DISP). Our DISP scheme greatly benefits the collective initialization of the Cheetah module. At the same time, it helps boost the performance of non-collective initialization in the Tuned module. We evaluate the performance of our implementation on Titan supercomputer at ORNL with up to 4096 processes. The results show that our delayed initialization can speed up the startup of Tuned and Cheetah by an average of 32.0% and 29.2%, respectively, our module sharing can reduce the memory consumption of Tuned and Cheetah by up to 24.1% and 83.5%, respectively, and our prediction-based topology setup can speed up the startup of Cheetah by up to 80%.

DISP: Optimizations towards Scalable MPI Startup...

Abstract

Researchers

Organizations