Skip to main content
SHARE
Publication

A Big Data Analytics Framework for HPC Log Data: Three Case Studies Using the Titan Supercomputer Log...

Publication Type
Conference Paper
Book Title
HPCMASPA at 2018 IEEE International Conference on Cluster Computing (CLUSTER)
Publication Date
Page Numbers
571 to 579
Conference Name
IEEE International Conference on Cluster Computing (CLUSTER 2018)
Conference Location
Belfast, United Kingdom
Conference Sponsor
IEEE
Conference Date
-

Reliability, availability and serviceability (RAS) logs of high performance computing (HPC) resources, when closely investigated in spatial and temporal dimensions, can provide invaluable information regarding system status, performance, and resource utilization. These data are often generated from multiple logging systems and sensors that cover many components of the system. The analysis of these data for finding persistent temporal and spatial insights faces two main difficulties: the volume of RAS logs makes manual inspection difficult and the unstructured nature and unique properties of log data produced by each subsystem adds another dimension of difficulty in identifying implicit correlation among recorded events. To address these issues, we recently developed a multi-user Big Data analytics framework for HPC log data at Oak Ridge National Laboratory (ORNL). This paper introduces three in-progress data analytics projects that leverage this framework to assess system status, mine event patterns, and study correlations between user applications and system events. We describe the motivation of each project and detail their workflows using three years of log data collected from ORNL's Titan supercomputer.