‘‘Endless’’ Workload Analysis of Large-Scale Supercomputers
Modern supercomputers are so large and complex that some of their hardware components inevitably go out of order from time to time. Therefore, supercomputer systems require constant and careful health monitoring, and such control is set up in everyday practice of any large HPC center. But a lot of a...
Saved in:
Published in | Lobachevskii journal of mathematics Vol. 42; no. 1; pp. 184 - 194 |
---|---|
Main Authors | , |
Format | Journal Article |
Language | English |
Published |
Moscow
Pleiades Publishing
2021
Springer Nature B.V |
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Modern supercomputers are so large and complex that some of their hardware components inevitably go out of order from time to time. Therefore, supercomputer systems require constant and careful health monitoring, and such control is set up in everyday practice of any large HPC center. But a lot of attention should be also paid to the quality of supercomputer usage, describing how fully and efficiently computational resources are utilized. And this task is still far from being solved, leading to system administrators of most supercomputers knowing very little about the quality of their supercomputer job flow as well as possible ways to improve it. In this paper, we present a looped report system that allows to obtain and analyze information of any level of detail about all important aspects describing the quality of the supercomputer workload, starting from the overall system functioning and up to individual job launches. It provides great flexibility by offering an ‘‘endless’’ number of workload analysis scenarios, which allows to determine root causes of various cases of performance degradation using the same approach. This report system is built upon the previously developed TASC software package, aimed at identifying and analyzing performance issues both at the level of individual parallel applications and the entire supercomputer as a whole. |
---|---|
ISSN: | 1995-0802 1818-9962 |
DOI: | 10.1134/S1995080221010236 |