‘‘Endless’’ Workload Analysis of Large-Scale Supercomputers

Modern supercomputers are so large and complex that some of their hardware components inevitably go out of order from time to time. Therefore, supercomputer systems require constant and careful health monitoring, and such control is set up in everyday practice of any large HPC center. But a lot of a...

Full description

Saved in:
Bibliographic Details
Published inLobachevskii journal of mathematics Vol. 42; no. 1; pp. 184 - 194
Main Authors Shvets, P. A., Voevodin, V. V.
Format Journal Article
LanguageEnglish
Published Moscow Pleiades Publishing 2021
Springer Nature B.V
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Modern supercomputers are so large and complex that some of their hardware components inevitably go out of order from time to time. Therefore, supercomputer systems require constant and careful health monitoring, and such control is set up in everyday practice of any large HPC center. But a lot of attention should be also paid to the quality of supercomputer usage, describing how fully and efficiently computational resources are utilized. And this task is still far from being solved, leading to system administrators of most supercomputers knowing very little about the quality of their supercomputer job flow as well as possible ways to improve it. In this paper, we present a looped report system that allows to obtain and analyze information of any level of detail about all important aspects describing the quality of the supercomputer workload, starting from the overall system functioning and up to individual job launches. It provides great flexibility by offering an ‘‘endless’’ number of workload analysis scenarios, which allows to determine root causes of various cases of performance degradation using the same approach. This report system is built upon the previously developed TASC software package, aimed at identifying and analyzing performance issues both at the level of individual parallel applications and the entire supercomputer as a whole.
ISSN:1995-0802
1818-9962
DOI:10.1134/S1995080221010236