Skip to main content
Publication

Machine Learning Assisted HPC Workload Trace Generation for Leadership Scale Storage Systems...

by Arnab Kumar Paul, Jong Youl Choi, Ahmad Maroof Karimi Nln, Feiyi Wang
Publication Type
Conference Paper
Journal Name
The 31st International Symposium on High-Performance Parallel and Distributed Computing
Book Title
HPDC '22: Proceedings of the 31st International Symposium on High-Performance Parallel and Distributed Computing
Publication Date
Page Numbers
199 to 212
Publisher Location
New York, United States of America
Conference Name
The 31st International Symposium on High-Performance Parallel and Distributed Computing (HPDC)
Conference Location
Minneapolis, Minnesota, United States of America
Conference Sponsor
ACM SIGARCH, ACM SIGHPC, University of Minnesota
Conference Date
-

Monitoring and analyzing a wide range of I/O activities in an HPC cluster is important in maintaining mission-critical performance in a large-scale, multi-user, parallel storage system. Center-wide I/O traces can provide high-level information and fine-grained activities per application or per user running in the system. Studying such large-scale traces can provide helpful insights into the system. It can be used to develop predictive methods for making predictive decisions, adjusting scheduling policies, or providing decisions for the design of next-generation systems. However, sharing real-world I/O traces to expedite such research efforts leaves a few concerns; i) the cost of sharing the large traces is expensive due to this large size, and ii) privacy concern is an issue.

We address such issues by building an end-to-end machine learn- ing (ML) workflow that can generate I/O traces for large-scale HPC applications. We leverage ML based feature selection and gener- ative models for I/O trace generation. The generative models are trained on I/O traces collected by the darshan I/O characterization tool over a period of one year. We present a two-step generation process consisting of two deep-learning models, called the feature generator and the trace generator. The combination of two-step generative models provides robustness by reducing the bias of the model and accounting for the stochastic nature of the I/O traces across different runs of an application. We evaluate the performance of the generative models and show that the two-step model can generate time-series I/O traces with less than 20% root mean square error.