Automating Workload Analysis of Large-Scale Supercomputer Systems

The architecture of modern supercomputers is extremely complex, so it is exceedingly difficult to monitor and maintain the efficiency of their functioning. And even if it is possible to collect the necessary data on the operation of all important supercomputer components, how not to drown in this ‘‘...

Full description

Saved in:
Bibliographic Details
Published inLobachevskii journal of mathematics Vol. 42; no. 7; pp. 1547 - 1559
Main Authors Shvets, P. A., Voevodin, V. V., Zhumatiy, S. A.
Format Journal Article
LanguageEnglish
Published Moscow Pleiades Publishing 01.07.2021
Springer Nature B.V
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:The architecture of modern supercomputers is extremely complex, so it is exceedingly difficult to monitor and maintain the efficiency of their functioning. And even if it is possible to collect the necessary data on the operation of all important supercomputer components, how not to drown in this ‘‘sea of information’’ and not miss the onset of a critical situation? This requires the automation of the workload analysis process. One of the possible solutions is to create a set of rules that automatically detect and notify supercomputer administrators about the occurrence of certain critical situations or cases of a significant decrease in the efficiency of supercomputer functioning. Such approach allows quickly identifying the most interesting and important situations for the administrator, as well as correctly prioritizing the workload analysis process in whole. This article describes the process of developing a set of 19 rules, each of which determines a way to detect the onset of a certain critical situation, provides a description of the possible causes of its occurrence, and also specifies the criticality of the situation that has arisen. These rules allow monitoring different aspects of supercomputer behavior: the efficiency of using application packages, the operation of the queue system, the load and availability of service servers, the presence of global performance issues in user applications, and the peculiarities of using separate partitions of the supercomputer. The developed rules formed the basis of the software solution that was implemented and evaluated on the Petaflop-level Lomonosov-2 supercomputer.
ISSN:1995-0802
1818-9962
DOI:10.1134/S1995080221070210