CUDA-Zero： a framework for porting shared memory GPU applications to multi-GPUs

As the prevalence of general purpose computations on GPU, shared memory programming models were proposed to ease the pain of GPU programming. However, with the demanding needs of more intensive workloads, it＇s desirable to port GPU programs to more scalable distributed memory environment, such as mu...

Full description

Saved in:

Bibliographic Details
Published in	Science China. Information sciences Vol. 55; no. 3; pp. 663 - 676
Main Authors	Chen, DeHao, Chen, WenGuang, Zheng, WeiMin
Format	Journal Article
Language	English
Published	Heidelberg SP Science China Press 01.03.2012 Springer Nature B.V
Subjects	Annotations Arrays Computation Computer Science Distributed memory GPU Information Systems and Communication Service Kernel functions Mathematical models Message passing Parallel processing Ports Programming Research Paper Workload 共享内存工作负载应用程序框架程序移植编程模型访问模式 CUDA parallelization multi-GPU data access pattern
Online Access	Get full text

Cover

Loading…

More Information
Summary:	As the prevalence of general purpose computations on GPU, shared memory programming models were proposed to ease the pain of GPU programming. However, with the demanding needs of more intensive workloads, it＇s desirable to port GPU programs to more scalable distributed memory environment, such as multi-GPUs. To achieve this, programs need to be re-written with mixed programming models （e.g. CUDA and message passing）. Programmers not only need to work carefully on workload distribution, but also on scheduling mechanisms to ensure the efficiency of the execution. In this paper, we studied the possibilities of automating the process of parallelization to multi-GPUs. Starting from a GPU program written in shared memory model, our framework analyzes the access patterns of arrays in kernel functions to derive the data partition schemes. To acquire the access pattern, we proposed a 3-tiers approach： static analysis, profile based analysis and user annotation. Experiments show that most access patterns can be derived correctly by the first two tiers, which means that zero efforts are needed to port an existing application to distributed memory environment. We use our framework to parallelize several applications, and show that for certain kinds of applications, CUDA-Zero can achieve efficient parallelization in multi-GPU environment.
Bibliography:	11-5847/TP As the prevalence of general purpose computations on GPU, shared memory programming models were proposed to ease the pain of GPU programming. However, with the demanding needs of more intensive workloads, it＇s desirable to port GPU programs to more scalable distributed memory environment, such as multi-GPUs. To achieve this, programs need to be re-written with mixed programming models （e.g. CUDA and message passing）. Programmers not only need to work carefully on workload distribution, but also on scheduling mechanisms to ensure the efficiency of the execution. In this paper, we studied the possibilities of automating the process of parallelization to multi-GPUs. Starting from a GPU program written in shared memory model, our framework analyzes the access patterns of arrays in kernel functions to derive the data partition schemes. To acquire the access pattern, we proposed a 3-tiers approach： static analysis, profile based analysis and user annotation. Experiments show that most access patterns can be derived correctly by the first two tiers, which means that zero efforts are needed to port an existing application to distributed memory environment. We use our framework to parallelize several applications, and show that for certain kinds of applications, CUDA-Zero can achieve efficient parallelization in multi-GPU environment. CUDA, parallelization, data access pattern, multi-GPU ObjectType-Article-2 SourceType-Scholarly Journals-1 ObjectType-Feature-1 content type line 23
ISSN:	1674-733X 1869-1919
DOI:	10.1007/s11432-011-4497-z