Hvac: Removing I/O Bottleneck for Large-Scale Deep Learning Applications

Show authors

Publication Type

Conference Paper

Book Title

2022 IEEE International Conference on Cluster Computing (CLUSTER)

Publication Date

September, 2022

Page Numbers

324 to 335

Issue

Publisher Location

New Jersey, United States of America

Conference Name

IEEE International Conference on Cluster Computing (CLUSTER 2022)

Conference Location

Heidelberg, Germany

Conference Sponsor

IEEE

Conference Date

Sep 6, 2022 - Sep 9, 2022

View DOI Listing

Abstract

Scientific communities are increasingly adopting deep learning (DL) models in their applications to accelerate scientific discovery processes. However, with rapid growth in the computing capabilities of HPC supercomputers, large-scale DL applications have to spend a significant portion of training time performing I/O to a parallel storage system. Previous research works have investigated optimization techniques such as prefetching and caching. Unfortunately, there exist non-trivial challenges to adopting the existing solutions on HPC supercomputers for large-scale DL training applications, which include non-performance and/or failures at extreme scale, lack of portability and generality in design, complex deployment methodology, and being limited to a specific application or dataset. To address these challenges, we propose High-Velocity AI Cache (HVAC), a distributed read-cache layer that targets and fully exploits the node-local storage or near node-local storage technology. HVAC seamlessly accelerates read I/O by aggregating node-local or near node-local storage, avoiding metadata lookups and file locking while preserving portability in the application code. We deploy and evaluate HVAC on 1,024 nodes (with over 6000 NVIDIA V100 GPUS) of the Summit supercomputer. In particular, we evaluate the scalability, efficiency, accuracy, and load distribution of HVAC compared to GPFS and XFS-on-NVMe. With four different DL applications, we observe an average 25 % performance improvement atop GPFS and 9% drop against XFS-on-NVMe, which scale linearly and are considered the performance upper bound. We envision HVAC as an important caching library for upcoming HPC supercomputers such as Frontier.

Hvac: Removing I/O Bottleneck for Large-Scale Deep Learning Applications

Abstract

Researchers

Organizations