Global Experiences with HPC Operational Data Measurement, Collection and Analysis

As we move into the exascale era, supercomputers grow larger, denser, more heterogeneous, and ever more complex. Operating such machines reliably and efficiently requires deep insight into the operational parameters of the machine itself as well as its supporting infrastructure. To fulfill this need...

Full description

Saved in:
Bibliographic Details
Published in2020 IEEE International Conference on Cluster Computing (CLUSTER) pp. 499 - 508
Main Authors Ott, Michael, Shin, Woong, Bourassa, Norman, Wilde, Torsten, Ceballos, Stefan, Romanus, Melissa, Bates, Natalie
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.09.2020
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:As we move into the exascale era, supercomputers grow larger, denser, more heterogeneous, and ever more complex. Operating such machines reliably and efficiently requires deep insight into the operational parameters of the machine itself as well as its supporting infrastructure. To fulfill this need, early adopter sites have started the development and deployment of Operational Data Analytics (ODA) frameworks allowing the continuous monitoring, archiving, and analysis of near realtime performance data from the machine and infrastructure levels, providing immediately actionable information for multiple operational uses. To understand their ODA goals, requirements, and use cases, we have conducted a survey among eight early adopter sites from the US, Europe, and Japan that operate top 50 high-performance computing systems. We have assessed the technologies leveraged to build their ODA frameworks, identified use cases and other push and pull factors that drive the sites' ODA activities, and report on their operational lessons.
ISSN:2168-9253
DOI:10.1109/CLUSTER49012.2020.00071