Abstract
The current proliferation of GPU-based HPC systems necessitates a method for assessing the performance of simulations on heterogeneous machines. The addition of GPUs to a system adds multiple hierarchical levels of parallelism to the node architecture. In this paper, we demonstrate that the traditional load imbalance metric is insufficient for capturing the load imbalance on GPU-based machines, since it treats the GPU as a monolithic entity and ignores the internal parallelism. We propose a new hierarchical metric that improves the correlation of measured performance and application workload by up to 20.61%. Using our metric for determining application load instead of the traditional metric as the input for the load balancing algorithm reduces the residual load imbalance by up to 4× in our application.