Abstract
Large-scale simulations or scientific experiments produce petabytes of data per run. This poses massive challenges for I/O and storage when scientific analysis workflows are run manually offline. Unsupervised deep learning-based techniques to extract patterns and non-linear relations from these large amounts of data provide a way to build scientific understanding from raw data, reducing the need for manual pre-selection of analysis steps, but require exascale compute and memory to process the full dataset available. In this paper, we demonstrate a heterogeneous streaming workflow in which plasma simulation data is streamed directly to a Machine Learning (ML) application training a model on the simulation data in-transit, completely circumventing the capacity-constrained filesystem bottleneck. This workflow employs openPMD to provide a high level interface to describe scientific data and also uses ADIOS2, to transfer volumes of data that exceed the capabilities of the filesystem. We employ experience replay to avoid catastrophic forgetting in learning from this non-steady state process in a continual manner and adapt it to improve model convergence while learning in-transit. As a proof-of-concept, we approach the ill-posed inverse problem of predicting particle dynamics from radiation in a particle-incell (PIConGPU) simulation of the Kelvin-Helmholtz instability (KHI). We detail hardware-software co-design challenges as we scale PIConGPU to full Frontier, the Top-1 system as of June 2024 Top500 list.