Concepts for OpenMP Target Offload Resilience

by Christian Engelmann, Geoffroy R Vallee, Swaroop S Pophale

Publication Type

Conference Paper

Book Title

OpenMP: Conquering the Full Hardware Spectrum

Publication Date

August, 2019

Page Numbers

78 to 93

Volume

11718

Conference Name

15th International Workshop on OpenMP (IWOMP 2019)

Conference Location

AUCKLAND, New Zealand

Conference Sponsor

N/A

Conference Date

Sep 11, 2019 - Sep 13, 2019

View DOI Listing

Abstract

Recent reliability issues with one of the fastest supercomputers in the world, Titan at Oak Ridge National Laboratory (ORNL), demonstrated the need for resilience in large-scale heterogeneous computing. OpenMP currently does not address error and failure behavior. This paper takes a first step toward resilience for heterogeneous systems by providing the concepts for resilient OpenMP offload to devices. Using real-world error and failure observations, the paper describes the concepts and terminology for resilient OpenMP target offload, including error and failure classes and resilience strategies. It details the experienced general-purpose computing graphics processing unit (GPGPU) errors and failures in Titan. It further proposes improvements in OpenMP, including a preliminary prototype design, to support resilient offload to devices for efficient handling of errors and failures in heterogeneous high-performance computing (HPC) systems.

Concepts for OpenMP Target Offload Resilience

Abstract

Researchers

Organizations