Improvements to Supercomputing Service Availability Based on Data Analysis
As the demand for high-performance computing (HPC) resources has increased in the field of computational science, an inevitable consideration is service availability in large cluster systems such as supercomputers. In particular, the factor that most affects availability in supercomputing services i...
Saved in:
Published in | Applied sciences Vol. 11; no. 13; p. 6166 |
---|---|
Main Authors | , , , , , , , |
Format | Journal Article |
Language | English |
Published |
Basel
MDPI AG
01.07.2021
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | As the demand for high-performance computing (HPC) resources has increased in the field of computational science, an inevitable consideration is service availability in large cluster systems such as supercomputers. In particular, the factor that most affects availability in supercomputing services is the job scheduler utilized for allocating resources. Consequent to submitting user data through the job scheduler for data analysis, 25.6% of jobs failed because of program errors, scheduler errors, or I/O errors. Based on this analysis, we propose a K-hook method for scheduling to increase the success rate of job submissions and improve the availability of supercomputing services. By applying this method, the job-submission success rate was improved by 15% without negatively affecting users’ waiting time. We also achieved a mean time between interrupts (MTBI) of 24.3 days and maintained average system availability at 97%. As this research was verified on the Nurion supercomputer in a real service environment, the value of the research is expected to be found in significant service improvements. |
---|---|
ISSN: | 2076-3417 2076-3417 |
DOI: | 10.3390/app11136166 |