Adaptive heterogeneous scheduling for integrated GPUs

Many processors today integrate a CPU and GPU on the same die, which allows them to share resources like physical memory and lowers the cost of CPU-GPU communication. As a consequence, programmers can effectively utilize both the CPU and GPU to execute a single application. This paper presents novel...

Full description

Saved in:

Bibliographic Details
Published in	2014 23rd International Conference on Parallel Architecture and Compilation Techniques (PACT) pp. 151 - 162
Main Authors	Kaleem, Rashid, Barik, Rajkishore, Shpeisman, Tatiana, Lewis, Brian T., Chunling Hu, Pingali, Keshav
Format	Conference Proceeding
Language	English
Published	ACM 01.08.2014
Subjects	C++ languages Graphics processing units Heterogeneous computing integrated GPUs Irregular applications Kernel load balancing Programming scheduling Scheduling algorithms
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Many processors today integrate a CPU and GPU on the same die, which allows them to share resources like physical memory and lowers the cost of CPU-GPU communication. As a consequence, programmers can effectively utilize both the CPU and GPU to execute a single application. This paper presents novel adaptive scheduling techniques for integrated CPU-GPU processors. We present two online profiling-based scheduling algorithms: naïve and asymmetric. Our asymmetric scheduling algorithm uses low-overhead online profiling to automatically partition the work of dataparallel kernels between the CPU and GPU without input from application developers. It does profiling on the CPU and GPU in a way that doesn't penalize GPU-centric workloads that run significantly faster on the GPU. It adapts to application characteristics by addressing: 1) load imbalance via irregularity caused by, e.g., data-dependent control flow, 2) different amounts of work on each kernel call, and 3) multiple kernels with different characteristics. Unlike many existing approaches primarily targeting NVIDIA discrete GPUs, our scheduling algorithm does not require offline processing. We evaluate our asymmetric scheduling algorithm on a desktop system with an Intel 4 th Generation Core Processor using a set of sixteen regular and irregular workloads from diverse application areas. On average, our asymmetric scheduling algorithm performs within 3.2% of the maximum throughput with a CPU-and-GPU oracle that always chooses the best work partitioning between the CPU and GPU. These results underscore the feasibility of online profile-based heterogeneous scheduling on integrated CPU-GPU processors.
DOI:	10.1145/2628071.2628088