Optimizing parallel heterogeneous system efficiency: Dynamic task graph adaptation with recursive tasks

Task-based programming models are currently an ample trend to leverage heterogeneous parallel systems in a productive way (OpenACC, Kokkos, Legion, OmpSs, PaRSEC, StarPU, XKaapi, ...). Among these models, the Sequential Task Flow (STF) model is widely embraced (PaRSEC's DTD, OmpSs, StarPU) sinc...

Full description

Saved in:
Bibliographic Details
Published inJournal of parallel and distributed computing Vol. 205; p. 105157
Main Authors Furmento, Nathalie, Guermouche, Abdou, Lucas, Gwenolé, Morin, Thomas, Thibault, Samuel, Wacrenier, Pierre-André
Format Journal Article
LanguageEnglish
Published Elsevier Inc 01.11.2025
Elsevier
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Task-based programming models are currently an ample trend to leverage heterogeneous parallel systems in a productive way (OpenACC, Kokkos, Legion, OmpSs, PaRSEC, StarPU, XKaapi, ...). Among these models, the Sequential Task Flow (STF) model is widely embraced (PaRSEC's DTD, OmpSs, StarPU) since it allows to express task graphs naturally through a sequential-looking submission of tasks, and tasks dependencies are inferred automatically. However, STF is limited to task graphs with task sizes that are fixed at submission, posing a challenge in determining the optimal task granularity. Notably, in heterogeneous systems, the optimal task size varies across different processing units, so a single task size would not fit all units. StarPU's recursive tasks allow graphs with several task granularities by turning some tasks into sub-graphs dynamically at runtime. The decision to transform these tasks into sub-graphs is decided by a StarPU component called the Splitter. After deciding to transform some tasks, classical scheduling approaches are used, making this component generic, and orthogonal to the scheduler. In this paper, we propose a new policy for the Splitter, which is designed for heterogeneous platforms, that relies on linear programming aimed at minimizing execution time and maximizing resource utilization. This results in a dynamic well-balanced set comprising both small tasks to fill multiple CPU cores, and large tasks for efficient execution on accelerators like GPU devices. We then present an experimental evaluation showing that just-in-time adaptations of the task graph lead to improved performance across various dense linear algebra algorithms. •A new mechanism determines before execution whether a task should be subdivided.•An algorithm, based on linear programming, allows automatic task granularity adaptation.•Experimental evaluation shows this approach matches or surpasses state-of-the-art libraries.•The genericity of the solution opens an easy apply to more linear algebra applications.
ISSN:0743-7315
1096-0848
DOI:10.1016/j.jpdc.2025.105157