Skip to main content

Strategies for Integrating Deep Learning Surrogate Models with HPC Simulation Applications...

by Junqi Yin, Feiyi Wang, Mallikarjun Shankar
Publication Type
Conference Paper
Book Title
2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
Publication Date
Page Numbers
1256 to 1265
Publisher Location
New Jersey, United States of America
Conference Name
ExSAIS 2022: Workshop on Extreme Scaling of AI for Science, co-Located with IPDPS 2022
Conference Location
Lyons, France
Conference Sponsor
Conference Date

The emerging trend of the convergence of high performance computing (HPC), machine learning/deep learning (ML/DL), and big data analytics presents a host of challenges for large-scale computing campaigns that seek best practices to interleave traditional scientific simulation-based workloads with ML/DL models. A portfolio of systematic approaches to incorporate deep learning into modeling and simulation serves a vital need when we support AI for science at a computing facility. In this paper, we evaluate several strategies for deploying deep learning surrogate models in a representative physics application on supercomputers at the Oak Ridge Leadership Computing Facility (OLCF). We discuss a set of recommended deployment architectures and implementation approaches. We analyze and evaluate these alternatives and show their performance and scalability up to 1000 GPUs on two mainstream platforms equipped with different deep learning hardware and software stacks.