Skip to main content

Understanding and Leveraging the I/O Patterns of Emerging Machine Learning Analytics...

Publication Type
Conference Paper
Book Title
Driving Scientific and Engineering Discoveries Through the Integration of Experiment, Big Data, and Modeling and Simulation
Publication Date
Page Numbers
119 to 138
Publisher Location
Cham, Switzerland
Conference Name
Smoky Mountains Computational Science and Engineering Conference (SMC)
Conference Location
Kingsport, Tennessee, United States of America
Conference Sponsor
UT-Battelle and DOE
Conference Date

The scientific community is currently experiencing unprecedented amounts of data generated by cutting-edge science facilities. Soon facilities will be producing up to 1 PB/s which will force scientist to use more autonomous techniques to learn from the data. The adoption of machine learning methods, like deep learning techniques, in large-scale workflows comes with a shift in the workflow’s computational and I/O patterns. These changes often include iterative processes and model architecture searches, in which datasets are analyzed multiple times in different formats with different model configurations in order to find accurate, reliable and efficient learning models. This shift in behavior brings changes in I/O patterns at the application level as well at the system level. These changes also bring new challenges for the HPC I/O teams, since these patterns contain more complex I/O workloads. In this paper we discuss the I/O patterns experienced by emerging analytical codes that rely on machine learning algorithms and highlight the challenges in designing efficient I/O transfers for such workflows. We comment on how to leverage the data access patterns in order to fetch in a more efficient way the required input data in the format and order given by the needs of the application and how to optimize the data path between collaborative processes. We will motivate our work and show performance gains with a study case of medical applications.