Combining data and computation distribution directives for hybrid parallel rogramming : a transformation system

This paper describes dstep, a directive-based programming model for hybrid shared and distributed memory machines. The originality of our work is the definition and an implementation of a unified high-level programming model addressing both data and computation distributions, providing a particularl...

Full description

Saved in:

Bibliographic Details
Published in	International journal of parallel programming Vol. 44; no. 6; pp. 1268 - 1295
Main Authors	Habel, Rachid, Silber-Chaussumier, Frédérique, Irigoin, François, Brunet, Elisabeth, Trahay, François
Format	Journal Article
Language	English
Published	Springer Verlag 01.12.2016
Subjects	Computer Science Distributed, Parallel, and Cluster Computing Distributed-memory OpenMP MPI Source-to-source transformation Shared-memory Optimization
Online Access	Get full text

Cover

Loading…

More Information
Summary:	This paper describes dstep, a directive-based programming model for hybrid shared and distributed memory machines. The originality of our work is the definition and an implementation of a unified high-level programming model addressing both data and computation distributions, providing a particularly fine control of the computation. The goal is to improve the programmer productivity while providing good performances in terms of execution time and memory usage. We define a generic compilation scheme for computation mapping and communication generation. We implement the solution in a source-to-source compiler together with a runtime library. We provide a series of optimizations to improve the performance of the generated code, with a special focus on reducing the communications time. We evaluate our solution on several scientific kernels as well as on the more challenging NAS BT benchmark, and compare our results with the hand written Fortran MPI and UPC implementations. The results show first that our solution allows to make explicit the non trivial parallel execution of the NAS BT benchmark using the \dstep directives. Second, the results show that our generated MPI + OpenMP BT program runs with a 83.35 speedup over the original NAS OpenMP C benchmark on a hybrid cluster composed of 64 quadricores (256 cores). Overall, our solution dramatically reduces the programming effort while providing good time execution and memory usage performances. This programming model is suitable for a large variety of machines as multi-core and accelerator clusters
ISSN:	0885-7458 1573-7640
DOI:	10.1007/s10766-016-0428-3