Skip to main content

Finally, A Way to Measure Frontend I/O Performance...

by Christopher J Zimmer, Veronica G Melesse Vergara, Saurabh Gupta
Publication Type
Conference Paper
Publication Date
Conference Name
Cray Users Group 2016
Conference Location
London, United Kingdom
Conference Date

Identifying sources of variability in the Spider 2
file system on Titan is challenging because it spans multiple
networks with layers of hardware performing various functions
to fulfill the needs of the parallel file system. Several efforts
have targeted file system monitoring but only focused on metric
logging associated with the storage side of the file system. In
this work, we enhance that view by designing and deploying
a low-impact network congestion monitoring system designed
especially for the I/O routers that are deployed on service nodes
within the Titan Cray XK7 Gemini network. To the best of our
knowledge, this is is the first tool that provides a capability of
live monitoring for performance bottlenecks at the I/O router
level. Our studies show high correlation between I/O router
congestion and I/O bandwidth. Ultimately, we plan on using
this tool for I/O hotspot identification within Titan and guided
scheduling for large I/O.