Achievement
As supercomputers become exponentially more powerful, their energy consumption has emerged as a major operational and environmental concern. This research contributes to the understanding of measuring and analyzing the energy used by powerful computing chips GPUs, when running two different large-scale scientific simulations. By providing a detailed, application-level view of power traces and energy performance, the work offers crucial insights for codesigning more sustainable and cost-effective high-performance computing (HPC) systems for science applications. The research team led by ORNL, in collaboration with the University of Wisconsin and the University of California, Davis accomplished a direct comparison of the energy performance of three different high-end GPUs: two from NVIDIA (the A100 and the newer H100) and one from AMD (the MI250X).
To create a realistic testing environment, the team ran two distinct types of scientific software that are widely used on the world's most powerful supercomputers: QMCPACK, which simulates materials structures from first principles using Quantum Monte Carlo techniques, and AMReX-Castro, which models astrophysical phenomena like exploding stars using adaptive mesh refinement (AMR) methods. A key element of the study was quantifying the energy savings achieved by using a faster, less-precise calculation method ("mixed-precision") against the standard, high-precision method, demonstrating a practical path toward greater efficiency. These detailed measurements provide a data-driven foundation for building the next generation of energy-efficient supercomputers, the importance of which cannot be overstated.
Significance and Impact
Understanding the specific energy demands of major scientific applications is critical for shaping the future of high-performance computing. This research moves beyond traditional performance metrics like speed to answer a more pressing question: how can we maximize scientific discovery for every watt of power consumed? The work's primary contributions provide tangible benefits for hardware designers, software developers, and the scientific community.
- Informing Future Supercomputer Design: This analysis provides crucial data for the "co-design" of next-generation supercomputer architectures. By understanding how real-world applications use energy, hardware vendors and system architects can create more balanced and energy-efficient systems from the ground up.
- Demonstrating Energy Savings: The study quantifies significant energy savings, ranging from 6% to 25% for the materials science code and up to 45% for the astrophysics code when run on NVIDIA GPUs. The analysis highlighted that this benefit was not observed on the AMD hardware tested, pointing to important differences in how different architectures handle mixed-precision workloads.
- Creating New Performance Metrics: The research proposes and utilizes a "science-per-energy" metric (e.g., scientific throughput per Joule). This shifts the focus from just achieving the fastest time-to-solution to optimizing the amount of scientific progress achieved per unit of energy consumed, promoting a more sustainable approach to computation.
- Evaluating Vendor Tools: The work identified specific gaps in the monitoring tools for AMD GPUs, noting that reported utilization fluctuated rapidly in ways that did not match the workload, making it harder for developers to optimize their code. This feedback is invaluable for both application developers and the hardware vendor in refining the supporting software ecosystem.
These findings highlight a path forward where performance and energy efficiency are treated as equally important goals, a necessary step in the evolution of sustainable computing.
Research Details
To achieve their results, the researchers followed a systematic process designed to capture detailed, real-world energy usage data across different hardware and software configurations. The methodology focused on running production-level scientific codes on state-of-the-art hardware and measuring their behavior with precision.
- Ran advanced scientific simulations: The team used two established, large-scale software packages to represent common workloads on supercomputers. QMCPACK was used to simulate the electronic structure of nickel oxide, a representative materials science problem, while AMReX-Castro was used to simulate a Sedov blast wave, a standard benchmark in astrophysics.
- Tested on state-of-the-art hardware: These simulations were executed on three powerful, modern GPUs from leading manufacturers: the NVIDIA A100, the NVIDIA H100, and the AMD MI250X. This allowed for direct comparisons of performance and energy efficiency across different architectures.
- Measured energy use with an open-source tool: The team developed and used a specialized, open-source software tool, HWEnergyTracer.jl, to run alongside the simulations. This tool continuously recorded power draw, temperature, and utilization of the GPUs by querying the manufacturers' own management libraries.
- Compared calculation methods: For both simulations, the researchers compared the energy consumed when running with standard, high-precision (double-precision) calculations against a more energy-efficient, mixed-precision approach that uses a combination of single and double precision.
This methodical approach ensured that the results were grounded in the real-world conditions experienced at major high-performance computing centers.
Facility
This research used resources of the Oak Ridge Leadership Computing Facility and the Experimental Computing Laboratory at the Oak Ridge National Laboratory.
Sponsor/Funding
This work was primarily supported by the US Department of Energy (DOE), Office of Science, Office of Advanced Scientific Computing Research through EXPRESS: 2023 Exploratory Research for Extreme Scale Science. Additional support for author P.R.C. Kent was provided by the DOE’s Office of Science, Basic Energy Sciences, Materials Sciences and Engineering Division as part of the Computational Materials Sciences Program and the Center for Predictive Simulation of Functional Materials.
Principal Investigator and Team
- Lead Researcher: William F. Godoy, Oak Ridge National Laboratory
- Team: Oscar Hernandez, Paul R. C. Kent, Maria Patrou, Kazi Asifuzzaman, Narasinga Rao Miniskar, Pedro Valero-Lara, Jeffrey S. Vetter (Oak Ridge National Laboratory); Matthew D. Sinclair (University of Wisconsin-Madison); Jason Lowe-Power, Bobby R. Bruce (University of California, Davis).
Citation and DOI
- Citation: William F. Godoy, Oscar Hernandez, Paul R. C. Kent, Maria Patrou, Kazi Asifuzzaman, Narasinga Rao Miniskar, Pedro Valero-Lara, Jeffrey S. Vetter, Matthew D. Sinclair, Jason Lowe-Power, and Bobby R. Bruce. 2025. Characterizing GPU Energy Usage in Exascale-Ready Portable Science Applications. In High Performance Computing: ISC High Performance 2025 International Workshops, Hamburg, Germany, June 10–13, 2025, Revised Selected Papers. Springer-Verlag, Berlin, Heidelberg, 177–190. https://doi.org/10.1007/978-3-032-07612-0_14
- DOI: https://doi.org/10.1007/978-3-032-07612-0_14
Summary
As the financial and environmental costs of supercomputing continue to rise, this research provides an essential blueprint for the co-design of future hardware and software. By profiling two major scientific codes on the latest GPUs from NVIDIA and AMD, the study delivers actionable insights for the high-performance computing community.
The key findings are clear: newer-generation GPUs like the NVIDIA H100 are demonstrably more energy-efficient than their predecessors, and computational strategies such as using mixed-precision calculations offer a practical pathway to significant energy savings without compromising core scientific goals. Ultimately, this work provides a blueprint for measuring, understanding, and optimizing computational performance, helping to ensure that the future of scientific discovery is not only more powerful but also more sustainable.