Achievement
In an era where supercomputers perform at unprecedented speeds, their immense GPU energy consumption has become a critical challenge. Research from Oak Ridge National Laboratory has demonstrated a highly effective strategy for improving the energy efficiency of these powerful systems. The team successfully showed that by intelligently limiting the power supplied to a next-generation "superchip," it is possible to achieve significant GPU energy savings with a manageable impact on performance. Their core achievement was to evaluate and compare analytical methods for identifying the optimal power settings for different computational tasks within a large-scale scientific application.
The approach is analogous to finding the most fuel-efficient speed for different parts of a journey; just as a car consumes fuel differently on a highway versus a city street, a supercomputer performs different types of calculations that can be optimized for GPU energy use. By running a complex materials science application on a state-of-the-art integrated CPU-GPU platform, the researchers created a decision-making framework to analyze the energy-runtime performance trade-offs for each distinct processing cycle. This work provides a clear methodology for identifying the "sweet spot" where a supercomputer can perform its work with balanced runtime performance and GPU energy consumption, paving the way for more sustainable high-performance computing (HPC).
Significance and Impact
The findings from this research have important implications for the future of high-performance computing, offering a practical pathway toward greater GPU energy efficiency and sustainability in exascale systems. The key insights from this work include:
- Establishes a Method for Significant GPU Energy Savings: Demonstrates that carefully tuning power limits for individual tasks can substantially reduce GPU energy consumption in large-scale scientific applications.
- Informs Future Hardware and Software Strategies: Provides valuable insights into how to manage power on tightly integrated CPU-GPU architectures, guiding future system designs.
- Highlights Task-Specific Optimization: Shows that a one-size-fits-all power setting is inefficient; different computational jobs require different power levels for an optimal balance of speed and GPU energy use.
- Creates a Foundation for Adaptive Power Management: Lays the groundwork for developing automated tools that can dynamically adjust power caps in real-time, enhancing overall supercomputer efficiency.
This analysis of energy-runtime performance trade-offs provides a foundational methodology for developing the adaptive, automated tools needed for smarter supercomputing environments.
Research Details
To identify opportunities for GPU energy savings, the research team followed a systematic, multi-step process that focused on measuring the real-world impact of power constraints on a scientific application. The high-level methodology involved the following key steps:
- The team ran a complex materials science application on a state-of-the-art computer system featuring an NVIDIA GH200 superchip, which tightly integrates a CPU and GPU.
- They systematically applied a range of power limits to the chip, from a low setting of 200 watts up to the default maximum of 1,000 watts.
- During each run, they used specialized tools to precisely measure GPU energy consumption and execution time for the application's key computational tasks.
- Finally, they used two different analytical methods to evaluate the trade-offs between performance and GPU energy savings, identifying the most efficient power setting for each distinct task.
Research Context and Support
This research was made possible through the collaboration of a dedicated team and the support of leading national scientific institutions and funding bodies.
Facility
- Oak Ridge Leadership Computing Facility at Oak Ridge National Laboratory.
Sponsor/Funding
- US Department of Energy’s Office of Science, Advanced Scientific Computing Research program. Managed by UT-Battelle LLC under contract DE-AC05-00OR22725 with the US Department of Energy (DOE).
Lead Author and Team
- Lead Author: Maria Patrou, Oak Ridge National Laboratory
- Team: Thomas Wang (Camas High School); Wael Elwasif, Markus Eisenbach, Ross Miller, William Godoy, and Oscar Hernandez (Oak Ridge National Laboratory).
Citation and DOI
- Citation: Maria Patrou, Thomas Wang, Wael Elwasif, Markus Eisenbach, Ross Miller, William Godoy, and Oscar Hernandez. 2025. Power-Capping Metric Evaluation for Improving Energy Efficiency in HPC Applications. In High Performance Computing: ISC High Performance 2025 International Workshops, Hamburg, Germany, June 10–13, 2025, Revised Selected Papers. Springer-Verlag, Berlin, Heidelberg, 231–244. https://doi.org/10.1007/978-3-032-07612-0_18
- DOI: https://doi.org/10.1007/978-3-032-07612-0_18
Summary
This research demonstrates that applying custom, task-specific power limits to modern superchips is a highly effective strategy for saving significant amounts of GPU energy in high-performance computing. By meticulously analyzing the relationship between power, performance, and GPU energy consumption, the study proves that a one-size-fits-all approach is inefficient. Instead, fine-grained power management, tailored to the unique demands of each computational task, can yield substantial efficiency gains, often with a manageable impact on computational speed. This work provides a clear methodology for balancing computational speed with GPU energy use, contributing directly to the goals of sustainability and intelligent resource management in the exascale computing era and paving the way for more automated systems that can intelligently balance performance needs against GPU energy-saving goals.
Regarding your last point about including a table or graphs, I cannot directly insert visual elements into the document. However, I can confirm that the core of the research involves the comparison and evaluation of decision-maker metrics to identify optimal power settings, which would typically be presented in such visual formats.