Skip to main content
SHARE
Publication

STREAM: A Scalable Federated HPC Telemetry Platform...

by Ryan M Adamson, Timothy D Osborne, Corwin P Lester, Rachel L Palumbo
Publication Type
Conference Paper
Book Title
CUG Conference Proceedings
Publication Date
Page Numbers
1 to 6
Issue
1
Publisher Location
United States of America
Conference Name
Cray User Group 2023 (CUG)
Conference Location
Helsinki, Finland
Conference Sponsor
Various
Conference Date
-

Obtaining and analyzing high performance computing (HPC) telemetry in real time is a complex task that can impact algo- rithmic performance, operating costs, and ultimately scientific outcomes. If your organization operates multiple HPC systems, filesystems, and clusters, telemetry streams can be synthesized in order to ease operational and analytics burden. In order to collect this telemetry, the Oak Ridge Leadership Computing Facility (OLCF) has deployed STREAM (Streaming Telemetry for Resource Events, Analytics, and Monitoring), which is a distributed and high-performance message bus based on Apache Kafka. STREAM collects center-wide performance information and must interface with many sources, including five HPE deployed supercomputers, each with their own Kafka cluster which is managed by HPCM. OLCF Supercomputers and their attached scratch filesystems currently send more than 300 million messages to over 200 topics producing around 1.3 Terabytes per day of telemetry data to STREAM. This paper describes the architectural principles that enable STREAM to be both resilient and highly performant while supporting multiple upstream Kafka clusters and other data sources. It also discusses the design challenges and decisions faced in adapting our existing system- monitoring infrastructure to support the first Exascale computing platform.