GPU Accelerating for Rapid Multi-core Cache Simulation

To find the best memory system for emerging workloads, traces are obtained during application's execution, then caches with different configurations are simulated using these traces. Since program traces can be several gigabytes, simulation of cache performance is a time consuming process. Comp...

Full description

Saved in:

Bibliographic Details
Published in	2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum pp. 1387 - 1396
Main Authors	Wan Han, Long Xiang, Gao Xiaopeng, Li Yi
Format	Conference Proceeding
Language	English
Published	IEEE 01.05.2011
Subjects	Computational modeling Graphics processing unit Instruction sets Instruments Parallel processing Partitioning algorithms
Online Access	Get full text

Cover

Loading…

More Information
Summary:	To find the best memory system for emerging workloads, traces are obtained during application's execution, then caches with different configurations are simulated using these traces. Since program traces can be several gigabytes, simulation of cache performance is a time consuming process. Compute unified device architecture (CUDA) is a software development platform which enables programmers to accelerate the general-purpose applications on the graphics processing unit (GPU). This paper presents a real time multi-core cache simulator, which was built based on the Pin tool to get the memory reference, and fast method for multi-core cache simulation using the CUDA-enabled GPU. The proposed method is accelerated by the following techniques: execution parallelism exploration, memory latency hiding, a novel trace compression methodology. We describe how these techniques can be incorporated into CUDA code. Experimental results show that the hybrid parallel method of time-partitioning combines with set-partitioning presented here is 11.10× speedup compared to the CPU serial simulation algorithm. The present simulator can characterize cache performance of single-threaded or multi-threaded workloads at the speeds of 6-15 MIPS. It can simulates 6 cache configurations within one single pass at this speeds compared to CMPim, which can only simulate one cache configuration each simulation pass at the speeds of 4-10 MIPS.
ISBN:	9781612844251 1612844251
ISSN:	1530-2075
DOI:	10.1109/IPDPS.2011.295