Understanding Application and System Performance Through System-Wide Monitoring

TACC Stats is a continuous monitoring tool for HPC systems that collects data at the core and process level for every job executing on a monitored system. That data can be aggregated at the system, group, user, application, job, node, or core level. TACC Stats has been in production use for about 5...

Full description

Saved in:
Bibliographic Details
Published in2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) pp. 1702 - 1710
Main Authors Evans, R. Todd, Browne, James C., Barth, William L.
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.05.2016
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:TACC Stats is a continuous monitoring tool for HPC systems that collects data at the core and process level for every job executing on a monitored system. That data can be aggregated at the system, group, user, application, job, node, or core level. TACC Stats has been in production use for about 5 years and is now used by numerous HPC systems around the world. This paper reports on a major new version of TACC Stats and the additional analyses which can now be accomplished. The data collected is now a truly comprehensive range of metrics spanning all system resources including energy consumption, vectorization, I/O activity and network activity as well as a full set of computationally oriented metrics. TACC Stats also includes a new capability which enables online monitoring of the resource use data which is gathered. TACC Stats automatically customizes itself for different chip architectures and has been extended to execute on Cray systems. In additional to describing the new capabilities, we also describe several analyses, some incorporating the new data such as I/O behavior. These analyses and reports can give insights to identify performance issues with jobs and applications, diagnose system and job errors, and understand the resource needs of users.
DOI:10.1109/IPDPSW.2016.145