Combining data and computation distribution directives for hybrid parallel rogramming : a transformation system

This paper describes dstep, a directive-based programming model for hybrid shared and distributed memory machines. The originality of our work is the definition and an implementation of a unified high-level programming model addressing both data and computation distributions, providing a particularl...

Full description

Saved in:
Bibliographic Details
Published inInternational journal of parallel programming Vol. 44; no. 6; pp. 1268 - 1295
Main Authors Habel, Rachid, Silber-Chaussumier, Frédérique, Irigoin, François, Brunet, Elisabeth, Trahay, François
Format Journal Article
LanguageEnglish
Published Springer Verlag 01.12.2016
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:This paper describes dstep, a directive-based programming model for hybrid shared and distributed memory machines. The originality of our work is the definition and an implementation of a unified high-level programming model addressing both data and computation distributions, providing a particularly fine control of the computation. The goal is to improve the programmer productivity while providing good performances in terms of execution time and memory usage. We define a generic compilation scheme for computation mapping and communication generation. We implement the solution in a source-to-source compiler together with a runtime library. We provide a series of optimizations to improve the performance of the generated code, with a special focus on reducing the communications time. We evaluate our solution on several scientific kernels as well as on the more challenging NAS BT benchmark, and compare our results with the hand written Fortran MPI and UPC implementations. The results show first that our solution allows to make explicit the non trivial parallel execution of the NAS BT benchmark using the \dstep directives. Second, the results show that our generated MPI + OpenMP BT program runs with a 83.35 speedup over the original NAS OpenMP C benchmark on a hybrid cluster composed of 64 quadricores (256 cores). Overall, our solution dramatically reduces the programming effort while providing good time execution and memory usage performances. This programming model is suitable for a large variety of machines as multi-core and accelerator clusters
ISSN:0885-7458
1573-7640
DOI:10.1007/s10766-016-0428-3