Achievement
Choosing the right number of computer nodes in High-Performance Computing (HPC) is a critical challenge, requiring a difficult balance between job speed and power consumption. To solve this, researchers developed an automated framework that uses artificial intelligence to make this decision more effectively. The system’s first innovation is a method that learns from noisy system data by paying "attention" to the most important information. Its second is a data-efficient approach that intelligently selects only the most useful data for training. In tests using real-world supercomputer data, this method consistently found better compromises for the runtime-versus-power trade-off compared to existing techniques.
Significance and Impact
- Accelerates Scientific Discovery: Automates the complex and error-prone task of resource configuration, allowing scientists to focus on their research rather than on system tuning.
- Improves HPC System Efficiency: Improves resource utilization and energy conservation, building on previous work that demonstrated the potential to cut resource usage by 42%.
- Lowers Operational Costs: By improving efficiency and reducing power consumption, the technology helps lower the substantial operational costs of multi-million-dollar supercomputing facilities.
- Provides Actionable Decision Support: Delivers a range of high-quality, balanced options (the Pareto front) to system operators and users, offering flexible choices rather than a single, rigid recommendation.
Research Details
- The research was grounded in practice by using real-world data from two production supercomputers, PM100 and Adastra, to ensure the approach was relevant and robust.
- The team developed an AI-driven method that intelligently samples a small but highly informative subset of the available data. This approach significantly reduced training overhead, enabled faster convergence, and produced smoother, more stable trade-off recommendations by suppressing erratic results.
- A core part of the study involved comparing their attention-informed approach against standard methods to determine which could find a better set of compromises between job runtime and power usage.
- The quality and diversity of the trade-off solutions found by each method were measured using standard performance metrics to provide an objective comparison.
Facility
The analysis was conducted on the Stampede3 supercomputer at the Texas Advanced Computing Center (TACC), using production data from the PM100 and Adastra supercomputers.
Sponsor/Funding
This work was supported by the U.S. Department of Energy, Office of Science, Advanced Computing Research, under contract DE-AC02-DE-AC05-00OR22725.
Team
- Ashna Nawar Ahmed, Texas State University
- Banooqa Banday, Texas State University
- Terry Jones, Oak Ridge National Laboratory
- Tanzima Z. Islam, Texas State University
Citation and DOI
Ahmed, A. N., Banday, B., Jones, T., & Islam, T. Z. Attention-Informed Surrogates for Navigating Power-Performance Trade-offs in HPC. Proceedings of the ML for Systems workshop (co-located with NeurIPS). San Diego, CA. (to appear)
Summary
This research introduces a data-efficient, AI-driven framework for making smarter scheduling decisions in High-Performance Computing. By using attention-based techniques and intelligent data sampling, the method effectively models the complex trade-off between performance and power. This work paves the way for more sustainable next-generation supercomputing systems that accelerate scientific discovery while minimizing operational costs.