Abstract
The operation of large US Department of Energy (DOE) research facilities, like the DIII-D National Fusion Facility, results in the collection of complex multi-dimensional scientific datasets, both experimental and model-generated. In the future, it is envisioned that integrated data analysis coupled with large-scale high performance computing (HPC) simulations will be used to improve experimental planning and operation. Practically, massive data sets from these simulations provide the physics basis for generation of both reduced semi-analytic and machine-learning-based models. Storage of both HPC simulation datasets (generated from US DOE leadership computing facilities) and experimental datasets presents significant challenges. In this paper, we present a vision for a DOE-wide data management workflow that integrates US DOE fusion facilities with leadership computing facilities. Data persistence and long-term availability beyond the length of allocated projects is essential, particularly for verification and recalibration of artificial intelligence and machine learning (AI/ML) models. Because these data sets are often generated and shared among hundreds of users across multiple leadership computing facility centers, they would benefit from cross-platform accessibility, persistent identifiers (e.g. DOI, or digital object identifier), and provenance tracking. The ability to handle different data access patterns suggests that a combination of low cost, high latency (e.g. for storing ML training sets) and high cost, low latency systems (e.g. for real-time, integrated machine control feedback) may be needed.