New project seeks to ensure resilience in workflow management systems
Future computational workflows will span distributed research infrastructures that include multiple instruments, resources, and facilities to support and accelerate scientific discovery. However, the diversity and distributed nature of these resources makes harnessing their full potential difficult. To address this problem, a team of researchers from the University of Southern California (USC), the Renaissance Computing Institute (RENCI) at the University of North Carolina, and Oak Ridge, Lawrence Berkeley and Argonne National Laboratories have received a grant from the U.S. Department of Energy (DOE)’s Advanced Scientific Computing Research Program to develop the fundamentals of a computational platform that is fault tolerant, robust to various environmental conditions and adaptive to workloads and resource availability. The grant is planned for five years and includes $8.75 million of funding.
“Researchers are faced with challenges at all levels of current distributed systems, including application code failures, authentication errors, network problems, workflow system failures, filesystem and storage failures and hardware malfunctions,” said Ewa Deelman, research professor, research director at the USC Information Sciences Institute and the project PI. “Making the computational platform performant and resilient is essential for empowering DOE researchers to achieve their scientific pursuits in an efficient and productive manner.”
Swarm intelligence
Key to the project is swarm intelligence, a term derived from the behavior of social animals (e.g., ants) that collectively achieve success by working in groups. Swarm Intelligence, or SI, in computing refers to a class of artificial intelligence (AI) methods used to design and develop distributed systems that emulate the desirable features of these social animals – flexibility, robustness and scalability.
“In Swarm Intelligence, agents currently have limited computing and communication capabilities and can suffer from slow convergence and suboptimal decisions,” said Prasanna Balaprakash, director of AI programs and distinguished R&D staff scientist at Oak Ridge, and co-PI of the newly funded project. “Our aim is to enhance traditional SI-based control and autonomy methods by exploiting advancements in AI techniques and in high-performance computing.”
The enhanced metasystem, called SWARM (Scientific Workflow Applications on Resilient Metasystem), will enable robust execution of DOE-relevant scientific workflows such as astronomy, genomics, molecular dynamics and weather modeling across a continuum of resources – from edge devices near sensors and instruments through wide-area networks to leadership-class systems.
Distributed workflows and challenges
The project develops a distributed approach to workflow development and profiling. The research team assumes that DOE scientists will submit jobs and workflows to a distributed workload pool. Once a set of workflows becomes available in the workflow pool, the agents need to estimate each task’s characteristics, which determine the resource requirements. To improve the estimation cost, the team will compare the performance and prediction capabilities of analytical and ML models in different settings such as online continual learning. “Such methods balance the accuracy of a given model on new data while retaining useful information from previous data. The research will include mathematically rigorous performance modeling and online continual learning methods.” remarked Krishnan Raghavan, an assistant computational mathematician in Argonne’s Mathematics and Computer Science division and a co-PI of SWARM.
In SWARM there is no central controller: the agents must reach a consensus on the best resource allocation. “In imitation of biological swarms, we will investigate how coalitions can adapt to various fault tolerance strategies and can reassign tasks if necessary,” said Argonne senior computer scientist Franck Cappello, who is leading the development efforts on fault recovery and adaptation algorithms. Here the agents will coordinate decision-making for optimal resource allocation while minimizing communication between agents such as by formation of hierarchies and by adoption of adaptive communication strategies.
Evaluation
To demonstrate the efficacy of the swarm intelligence-inspired approach, the team will evaluate the method by swarm simulations at both the micro and macro level, by emulation and prototyping on testbeds. A variety of real-world DOE scientific workflows will be tested – from instrument workflows involving telescope and light source data to domain simulation workflows that perform molecular dynamics simulations.
“Of particular interest are edge and instrument-in-the-loop computing workflows,” said co-PI Anirban Mandal, assistant director for network research and infrastructure at RENCI. “We expect a growing role for automation of these workflows. With these essential tools, DOE scientists will be more productive and the time to discovery will be decreased”. “In this project, we will re-imagine how workflows can be managed to improve both compute and networking at micro and macro levels”, said Mariam Kiran, Group Leader for Quantum Communications and Networking at ORNL.
Making an impact
With innovative SI solutions, AI, resilience, compute, and network systems research, Deelman and her team are confident that their research will make an impact on how science networks and infrastructure should communicate and cooperate to support robust DOE scientific workflow executions.
UT-Battelle manages ORNL for DOE’s Office of Science, the single largest supporter of basic research in the physical sciences in the United States. The Office of Science is working to address some of the most pressing challenges of our time. For more information, please visit energy.gov/science.