Abstract
In this work, we demonstrate the challenges in predicting HPC cluster power consumption in the face of significant temporal skew in power consumption behavioral patterns. Predicting large power swings that extend several megawatts has significant operational value for HPC centers, however, prediction is challenging due to the relative rarity of such events and also due to the abrupt or disjoint deviation from the average power consumption levels. To study the impact of this challenge, we have trained a recurrent neural network (RNN) as a reasonably sophisticated model to predict power consumption of the one-year worth of node power consumption data from the Summit supercomputer located in the Oak Ridge Leadership Computing Facility. By studying the prediction results, we have found that although simple usage of RNN models can provide good results on average power consumption levels, it would fail at predicting the power swings that have more operational value. With such results, we discuss potential next steps in addressing such issues aiming towards a robust usage of power prediction techniques in HPC operations.