Abstract
Heterogeneous computing with accelerators is growing in importance in high performance computing (HPC), deep learning (DL), and other areas. Recently, application datasets have expanded beyond the memory capacity of these accelerators, and often beyond the capacity of their hosts. Meanwhile, non-volatile memory (NVM) storage has emerged as a pervasive component on nearly all computing systems including HPC systems because NVM provides massive amounts of memory capacity at affordable cost and power. Currently, for accelerator applications to use NVM, they must manually orchestrate data movement across multiple memories. This effort typically requires careful restructuring of the application, and it only performs well for applications with simple data access behaviors. To address this issue, we have developed DRAGON, a solution that enables all classes of GP-GPU applications to transparently compute on terabyte datasets residing in NVM, while ensuring the integrity of data buffers as necessary for NVM. DRAGON leverages the page-faulting mechanism on the recent NVIDIA GPUs by extending capabilities of CUDA Unified Memory (UM). Further, DRAGON improves overall performance by dynamically optimizing accesses to NVM. We empirically evaluate DRAGON on NVIDIA P100 GPU and a 2.4 TB Micron 9100 NVMe card using traditional HPC kernels and popular DL workloads; our experimental results show that DRAGON transparently expands memory capacity, and utilizes Linux’s page-cache mechanism to obtain additional speedups up to 2.3x against CUDA-UM via automated I/O, data transfer, and computation overlapping.