Abstract
Identifying sources of variability in the Spider 2
file system on Titan is challenging because it spans multiple
networks with layers of hardware performing various functions
to fulfill the needs of the parallel file system. Several efforts
have targeted file system monitoring but only focused on metric
logging associated with the storage side of the file system. In
this work, we enhance that view by designing and deploying
a low-impact network congestion monitoring system designed
especially for the I/O routers that are deployed on service nodes
within the Titan Cray XK7 Gemini network. To the best of our
knowledge, this is is the first tool that provides a capability of
live monitoring for performance bottlenecks at the I/O router
level. Our studies show high correlation between I/O router
congestion and I/O bandwidth. Ultimately, we plan on using
this tool for I/O hotspot identification within Titan and guided
scheduling for large I/O.