Early Experiences Writing Performance Portable OpenMP 4 Codes

by Veronica G Melesse Vergara, Wayne D Joubert, Matthew G Lopez, Oscar R Hernandez Mendoza

Publication Type

Conference Paper

Publication Date

May, 2016

Conference Name

Cray User Group Conference 2016

Conference Location

London, United Kingdom

Conference Date

May 8, 2016 - May 12, 2016

Abstract

In this paper, we evaluate the recently available directives in OpenMP 4 to parallelize a computational kernel using both the traditional shared memory approach and the newer accelerator targeting capabilities. In addition, we explore various transformations that attempt to increase application performance portability, and examine the expressiveness and performance implications of using these approaches. For example, we want to understand if the target map directives in OpenMP 4 improve data locality when mapped to a shared memory system, as opposed to the traditional first touch policy approach in traditional OpenMP. To that end, we use recent Cray and Intel compilers to measure the performance variations of a simple application kernel when executed on the OLCF’s Titan supercomputer with NVIDIA GPUs and the Beacon system with Intel Xeon Phi accelerators attached. To better understand these trade-offs, we compare our results from traditional OpenMP shared memory implementations to the newer accelerator programming model when it is used to target both the CPU and an attached heterogeneous device. We believe the results and lessons learned as presented in this paper will be useful to the larger user community by providing guidelines that can assist programmers in the development of performance portable code.

Early Experiences Writing Performance Portable OpenMP 4 Codes

Abstract

Researchers

Organizations