Integration framework for online thread throttling with thread and page mapping on NUMA systems

Non-Uniform Memory Access (NUMA) systems are prevalent in HPC, where optimal thread-to-core allocation and page placement are crucial for enhancing performance and minimizing energy usage. Moreover, considering that NUMA systems have hardware support for a large number of hardware threads and many p...

Full description

Saved in:

Bibliographic Details
Published in	Journal of parallel and distributed computing Vol. 205; p. 105145
Main Authors	Schwarzrock, Janaina, Rocha, Hiago Mayk G. de A., Lorenzon, Arthur F., de Souza, Samuel Xavier, Beck, Antonio Carlos S.
Format	Journal Article
Language	English
Published	Elsevier Inc 01.11.2025
Subjects	Dynamic concurrency throttling NUMA systems Page mapping Parallel applications Thread mapping NUMA systems Parallel applications Page mapping Dynamic concurrency throttling Thread mapping
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Non-Uniform Memory Access (NUMA) systems are prevalent in HPC, where optimal thread-to-core allocation and page placement are crucial for enhancing performance and minimizing energy usage. Moreover, considering that NUMA systems have hardware support for a large number of hardware threads and many parallel applications have limited scalability, artificially decreasing the number of threads by using Dynamic Concurrency Throttling (DCT) may bring further improvements. However, the optimal configuration (thread mapping, page mapping, number of threads) for energy and performance, quantified by the Energy-Delay Product (EDP), varies with the system hardware, application and input set, even during execution. Because of this dynamic nature, adaptability is essential, making offline strategies much less effective. Despite their effectiveness, online strategies introduce additional execution overhead, which involves learning at run-time and the cost of transitions between configurations with cache warm-ups, thread and data reallocation. Thus, balancing the learning time and solution quality becomes increasingly significant. In this scenario, this work proposes a framework to find such optimal configurations into a single, online, and efficient approach. Our experimental evaluation shows that our framework improves EDP and performance compared to online state-of-the-art techniques of thread/page mapping (up to 69.3% and 43.4%) and DCT (up to 93.2% and 74.9%), while being totally adaptive and requiring minimum user intervention. •Thread-to-core allocation and page placement are key to performance and energy efficiency in NUMA systems.•Parallel applications often scale poorly, so Dynamic Concurrency Throttling reduces threads to improve performance.•Optimal thread mapping, page placement, and thread count for Energy Delay Product depend on hardware and input variability.•Online strategies often add runtime overhead, requiring a balance between learning time and solution quality.•This study presents a framework that combines optimization strategies into a unified, efficient online solution.
ISSN:	0743-7315
DOI:	10.1016/j.jpdc.2025.105145