Skip to main content
SHARE
Publication

DISP: Optimizations towards Scalable MPI Startup...

by Huansong Fu, Swaroop S Pophale, Manjunath Gorentla Venkata, Weikuan Yu
Publication Type
Conference Paper
Publication Date
Page Numbers
53 to 62
Publisher Location
Piscataway, New Jersey, United States of America
Conference Name
COM-HPC '16
Conference Location
Salt Lake City, Utah, United States of America
Conference Date
-

Despite the popularity of MPI for high performance computing, the startup of MPI programs faces a scalability challenge as both the execution time and memory consumption increase drastically at scale. We have examined this problem using the collective modules of Cheetah and Tuned in Open MPI as representative implementations. Previous improvements for collectives have focused on algorithmic advances and hardware off-load. In this paper, we examine the startup cost of the collective module within a communicator and explore various techniques to improve its efficiency and scalability. Accordingly, we have developed a new scalable startup scheme with three internal techniques, namely Delayed Initialization, Module Sharing and Prediction-based Topology Setup (DISP). Our DISP scheme greatly benefits the collective initialization of the Cheetah module. At the same time, it helps boost the performance of non-collective initialization in the Tuned module. We evaluate the performance of our implementation on Titan supercomputer at ORNL with up to 4096 processes. The results show that our delayed initialization can speed up the startup of Tuned and Cheetah by an average of 32.0% and 29.2%, respectively, our module sharing can reduce the memory consumption of Tuned and Cheetah by up to 24.1% and 83.5%, respectively, and our prediction-based topology setup can speed up the startup of Cheetah by up to 80%.