Improvements to Supercomputing Service Availability Based on Data Analysis

As the demand for high-performance computing (HPC) resources has increased in the field of computational science, an inevitable consideration is service availability in large cluster systems such as supercomputers. In particular, the factor that most affects availability in supercomputing services i...

Full description

Saved in:
Bibliographic Details
Published inApplied sciences Vol. 11; no. 13; p. 6166
Main Authors Lee, Jae-Kook, Kwon, Min-Woo, An, Do-Sik, Yoon, Junweon, Hong, Taeyoung, Woo, Joon, Kim, Sung-Jun, Li, Guohua
Format Journal Article
LanguageEnglish
Published Basel MDPI AG 01.07.2021
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:As the demand for high-performance computing (HPC) resources has increased in the field of computational science, an inevitable consideration is service availability in large cluster systems such as supercomputers. In particular, the factor that most affects availability in supercomputing services is the job scheduler utilized for allocating resources. Consequent to submitting user data through the job scheduler for data analysis, 25.6% of jobs failed because of program errors, scheduler errors, or I/O errors. Based on this analysis, we propose a K-hook method for scheduling to increase the success rate of job submissions and improve the availability of supercomputing services. By applying this method, the job-submission success rate was improved by 15% without negatively affecting users’ waiting time. We also achieved a mean time between interrupts (MTBI) of 24.3 days and maintained average system availability at 97%. As this research was verified on the Nurion supercomputer in a real service environment, the value of the research is expected to be found in significant service improvements.
ISSN:2076-3417
2076-3417
DOI:10.3390/app11136166