Skip to main content
SHARE
Publication

Understanding the Impact of Interconnect Failures on System Operation...

by Matthew A Ezell
Publication Type
Conference Paper
Publication Date
Conference Name
Cray User Group
Conference Location
Napa Valley, California, United States of America
Conference Date
-

Hardware failures are inevitable on large high performance computing systems. Faults or performance degradations in the high-speed network can reduce the entire system’s performance. Since the introduction of the Gemini interconnect, Cray systems have become resilient to many networking faults that were fatal in their previous generation systems. These new network reliability and resiliency features have enabled higher uptimes on Cray systems by allowing them to continue running with reduced network performance. Oak Ridge National Laboratory has developed a set of user-level diagnostics that stresses the high-speed network and searches for components that are not performing as expected. Nearest-neighbor bandwidth tests check every network chip and network link in the system. Additionally, performance counters stored on the network ASIC’s memory mapped registers (MMRs) are used to better understand the state of the network. Applications have also been characterized under various suboptimal network conditions to better understand what impact network problems have on user codes.