Abstract
Next generation of science workflows are expected to be executed over complex federations composed of supercomputers, science instruments, storage systems and networks, with new additions of the edge and cloud systems and services. The sheer complexity of these multi-domain federations makes it hard to manage them and optimize their performance, as small impedance mismatches (that can dynamically develop between systems) could drastically degrade the entire federation performance. Recent proliferation of Software Defined Everything (SDX) technologies combined with containerization frameworks provide custom instruments that can monitor and collect critical measurements at various levels to support diagnoses and performance optimization; but their data too enormous for human operators and analysts to process and generate decisions. Machine Learning (ML) methods that extract critical parameters, relationships and trends from the data offer general solutions. Artificial Intelligence (AI) and ML methods must be custom-developed for these problems based on solid, rigorous foundations, since black-box approaches are often ineffective and unsound.
We propose to develop comprehensive AI-Science for the performance of science federations to (i) monitor and control storage, networks, experiments, and computing systems across multiple domains via softwarization layers, at speeds and scales orders of magnitude superior to current practice, (ii) optimally realize and orchestrate complex workflows with high performance by using dynamic state and performance estimation methods, and (iii) aggregate measurements across sites and time to develop infrastructure-level profiles, optimizations and diagnoses using AI-Science based on foundational principles from ML, game theory, and information fusion areas.