Integration framework for online thread throttling with thread and page mapping on NUMA systems
Non-Uniform Memory Access (NUMA) systems are prevalent in HPC, where optimal thread-to-core allocation and page placement are crucial for enhancing performance and minimizing energy usage. Moreover, considering that NUMA systems have hardware support for a large number of hardware threads and many p...
Saved in:
Published in | Journal of parallel and distributed computing Vol. 205; p. 105145 |
---|---|
Main Authors | , , , , |
Format | Journal Article |
Language | English |
Published |
Elsevier Inc
01.11.2025
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Non-Uniform Memory Access (NUMA) systems are prevalent in HPC, where optimal thread-to-core allocation and page placement are crucial for enhancing performance and minimizing energy usage. Moreover, considering that NUMA systems have hardware support for a large number of hardware threads and many parallel applications have limited scalability, artificially decreasing the number of threads by using Dynamic Concurrency Throttling (DCT) may bring further improvements. However, the optimal configuration (thread mapping, page mapping, number of threads) for energy and performance, quantified by the Energy-Delay Product (EDP), varies with the system hardware, application and input set, even during execution. Because of this dynamic nature, adaptability is essential, making offline strategies much less effective. Despite their effectiveness, online strategies introduce additional execution overhead, which involves learning at run-time and the cost of transitions between configurations with cache warm-ups, thread and data reallocation. Thus, balancing the learning time and solution quality becomes increasingly significant. In this scenario, this work proposes a framework to find such optimal configurations into a single, online, and efficient approach. Our experimental evaluation shows that our framework improves EDP and performance compared to online state-of-the-art techniques of thread/page mapping (up to 69.3% and 43.4%) and DCT (up to 93.2% and 74.9%), while being totally adaptive and requiring minimum user intervention.
•Thread-to-core allocation and page placement are key to performance and energy efficiency in NUMA systems.•Parallel applications often scale poorly, so Dynamic Concurrency Throttling reduces threads to improve performance.•Optimal thread mapping, page placement, and thread count for Energy Delay Product depend on hardware and input variability.•Online strategies often add runtime overhead, requiring a balance between learning time and solution quality.•This study presents a framework that combines optimization strategies into a unified, efficient online solution. |
---|---|
ISSN: | 0743-7315 |
DOI: | 10.1016/j.jpdc.2025.105145 |