Automating Workload Analysis of Large-Scale Supercomputer Systems

The architecture of modern supercomputers is extremely complex, so it is exceedingly difficult to monitor and maintain the efficiency of their functioning. And even if it is possible to collect the necessary data on the operation of all important supercomputer components, how not to drown in this ‘‘...

Full description

Saved in:

Bibliographic Details
Published in	Lobachevskii journal of mathematics Vol. 42; no. 7; pp. 1547 - 1559
Main Authors	Shvets, P. A., Voevodin, V. V., Zhumatiy, S. A.
Format	Journal Article
Language	English
Published	Moscow Pleiades Publishing 01.07.2021 Springer Nature B.V
Subjects	Algebra Analysis Automation Efficiency Geometry Mathematical Logic and Foundations Mathematics Mathematics and Statistics Probability Theory and Stochastic Processes Supercomputers Workload Workloads efficiency monitoring data data analysis workload analysis supercomputing system software high-performance computing
Online Access	Get full text

Cover

Loading…

More Information
Summary:	The architecture of modern supercomputers is extremely complex, so it is exceedingly difficult to monitor and maintain the efficiency of their functioning. And even if it is possible to collect the necessary data on the operation of all important supercomputer components, how not to drown in this ‘‘sea of information’’ and not miss the onset of a critical situation? This requires the automation of the workload analysis process. One of the possible solutions is to create a set of rules that automatically detect and notify supercomputer administrators about the occurrence of certain critical situations or cases of a significant decrease in the efficiency of supercomputer functioning. Such approach allows quickly identifying the most interesting and important situations for the administrator, as well as correctly prioritizing the workload analysis process in whole. This article describes the process of developing a set of 19 rules, each of which determines a way to detect the onset of a certain critical situation, provides a description of the possible causes of its occurrence, and also specifies the criticality of the situation that has arisen. These rules allow monitoring different aspects of supercomputer behavior: the efficiency of using application packages, the operation of the queue system, the load and availability of service servers, the presence of global performance issues in user applications, and the peculiarities of using separate partitions of the supercomputer. The developed rules formed the basis of the software solution that was implemented and evaluated on the Petaflop-level Lomonosov-2 supercomputer.
ISSN:	1995-0802 1818-9962
DOI:	10.1134/S1995080221070210