Intelligent Systems and Facilities Group
Addressing system software challenges for scientific instruments and facilities.
Mission Statement
We enable scientific breakthroughs with innovative system software in the heterogeneous computing continuum that spans from scientific instruments and devices at the edge to Cloud computing systems to supercomputers. Our research advances scientific discovery with smart computing systems and infrastructures, automated and autonomous experiments, quantum computing, and artificial intelligence driven design, discovery, and evaluation. Our work revolutionizes the scientific computing ecosystem with intelligent tools and autonomous agents improving the operation of supercomputers, computing facilities, and computing, data and networking infrastructures. We focus on research and development of intelligent system architectures, software frameworks, and software tools that enable scientists to seamlessly utilize diverse resources across experimental, computing and data facilities, while ensuring performance portability and efficiency.
Overview
This Oak Ridge National Laboratory research group collaborates with computer, computational, instrument and domain science experts across national laboratories and universities. It fosters scientific leadership in smart scientific computing systems and infrastructures, autonomous scientific experiments and laboratories, and intelligently operated science facilities to further the US Department of Energy’s science mission.
- The group is transforming how modern scientific research uses modeling, simulation, and data analytics to predict and understand the outcomes of experiments. The scientific laboratory of the future will have artificial intelligence embedded self-driving capabilities, where the human scientist will define the hypothesis, and the robot scientist will autonomously design, perform, and analyze the experiments. The group’s research in the Interconnected Science Ecosystem (INTERSECT) realizes this vision by federating scientific instruments, computing resources, and data resources so that instrument and computational science workflows can be orchestrated across the instrument-to-cloud-to-supercomputer ecosystem by applying the collective power of DOE science facilities within an Integrated Research Infrastructure. The group created the INTERSECT architecture, consisting of science use case design patterns, a system of systems architecture, and a microservice architecture. It designed a Scientific Data Layer, defining and linking data assets on storage subsystems. It developed resource management techniques, such as co-scheduling instrument and computing resources. It created capabilities to integrate quantum computing systems in the federated ecosystem.
- Our team leads the Software Tools Ecosystem Project (STEP) which is the DOE’s effort to address a thorny problem: as computers have increased in complexity and scale, using them effectively has become much more difficult. Today’s most powerful systems feature heterogeneous technologies for both computation and storage. Invariably, applications cannot take full advantage of machine resources without considerable tuning. Advanced computing tools enable performance gains that dramatically increase the effectiveness of supercomputers. Comprised of five universities and two national laboratories, STEP is pioneering advanced software technologies for both understanding performance bottlenecks and facilitating run time mitigation of performance degrading phenomena. For instance, the STEP team recently helped the ExaWind application, a high-resolution simulation of wind-based power generation systems, to speed up their execution time by a factor of 24 on ORNL's Frontier supercomputer. When advanced computing tools provide deep insights into the most complicated applications, they unleash the full potential of supercomputers to realize scientific discoveries.
- To realize a sustainable and intelligent scientific infrastructure, our research reimagines energy as a dynamic, adaptive resource across the edge-to-supercomputer continuum. We focus on establishing intelligent systems that optimize power and performance under user-level constraints, creating a contract where energy use is orchestrated alongside compute resources and guided by both application behavior and user-defined priorities. Current systems operate with limited visibility and rely on static power caps, coarse-grain sensors, and blind over-provisioning (e.g. frequency) leading to wasted energy and constrained scalability. Our work involves understanding and validating low-level power sensors across large-scale systems, identifying hardware control knobs, and integrating them into a runtime-aware framework. We are developing systems that dynamically adjust power settings based on real-time application phases and performance-energy tradeoffs, while minimizing the need for application changes through the use of autotuning and algorithmic adaptations (e.g., mixed precision). In parallel, we explore hardware simulators and construct detailed power-performance models to inform predictive control strategies and support intelligent facility operations. This comprehensive approach enables fully autonomous, adaptive energy management aligned with diverse scientific workflows. By bridging measurement, modeling, and control, our research lays the foundation for the next generation of smart laboratories and supercomputers that operate efficiently, responsively, and with minimal manual tuning to build self-managing, data-driven scientific ecosystems.
- The emergence of quantum computing brings new opportunities for scientific discoveries that can exceed capabilities of classical computing (i.e., quantum advantage). Our research focuses on the integration of quantum and supercomputing with software tools and resource management capabilities that make effective use of quantum resources in a scientific computing ecosystem. We design and develop a software framework to build and execute hybrid applications. The envisioned software architecture masks heterogeneity while allowing for performance critical optimizations. Our work emphasizes a separation of concerns through a layered design that includes platform, application and tool viewpoints. As coordination and scheduling of quantum computing resources is of critical importance, we are exploring approaches that balance the constraints of traditional batch scheduled computers interleaved with remote or on-premises quantum computers. We are also investigating the connection of distributed quantum computers and the communication middleware needed to map lower-level primitives to higher-level programming environments. In keeping with our groups mission, this research is being carried out in concert with facility staff and other quantum computing experts at ORNL to ensure it is well grounded in state of research and practice.
- Writing code once and running on multiple hardware is a challenging task due to the performance, correctness and reliability requirements for scientific applications. Our research in high-productivity languages and performance portability focuses on advancing the Julia programming language for high-performance computing. Our research team at ORNL developed JACC, Julia for accelerators, a library that scientists can use to test ideas exploiting CPU/GPU parallel capabilities without modifying their code. JACC advances the Julia language as it continues to be adopted by science teams due to its value proposition for performance combined with a rich mathematical, data analysis and AI ecosystem.
- AI solutions, in particular large language models (LLMs), provide an invaluable opportunity to increase the productivity of portable HPC software. Our initial work exploring AI-assisted HPC software resulted in ChatBLAS, the first AI generated Basic Linear Algebra Subprograms (BLAS) implementation which showed the potential of AI. Our work has expanded into “ChatHPC” adding numerical libraries, programming models, parallel I/O, and performance tools. We leverage important open-source investments in LLMs, like CodeLlama, towards fine-tuned high-performance computing software to elevate developer’s productivity and trust leveraging AI.