Can I/O Variability Be Reduced on QoS-Less HPC Storage Systems?

For a production high-performance computing (HPC) system, where storage devices are shared between multiple applications and managed in a best effort manner, I/O contention is often a major problem. In this paper, we propose a balanced messaging-based re-routing in conjunction with throttling at the...

Full description

Saved in:
Bibliographic Details
Published inIEEE transactions on computers Vol. 68; no. 5; pp. 631 - 645
Main Authors Huang, Dan, Liu, Qing, Choi, Jong, Podhorszki, Norbert, Klasky, Scott, Logan, Jeremy, Ostrouchov, George, He, Xubin, Wolf, Matthew
Format Journal Article
LanguageEnglish
Published New York IEEE 01.05.2019
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:For a production high-performance computing (HPC) system, where storage devices are shared between multiple applications and managed in a best effort manner, I/O contention is often a major problem. In this paper, we propose a balanced messaging-based re-routing in conjunction with throttling at the middleware level. This work tackles two key challenges that have not been fully resolved in the past: whether I/O variability can be reduced on a QoS-less HPC storage system, and how to design a runtime scheduling system that can scale up to a large amount of cores. The proposed scheme uses a two-level messaging system to re-route I/O requests to a less congested storage location so that write performance is improved, while limiting the impact on read by throttling re-routing. An analytical model is derived to guide the setup of optimal throttling factor. We thoroughly analyze the virtual messaging layer overhead and explore whether the in-transit buffering is effective in managing I/O variability. Contrary to the intuition, in-transit buffer cannot completely solve the problem. It can reduce the absolute variability but not the relative variability. The proposed scheme is verified against a synthetic benchmark as well as being used by production applications.
Bibliography:USDOE Office of Science (SC)
Nnoe
ISSN:0018-9340
1557-9956
DOI:10.1109/TC.2018.2881709