Real Time Cache Performance Analyzing for Multi-core Parallel Programs

Modern processors mostly use cache to hide the memory access latency, so cache performance is very important to application program. A detailed cache performance analysis will provide programmers a clear view of their program behaviors, which can help them to identify the performance bottleneck and...

Full description

Saved in:
Bibliographic Details
Published in2013 International Conference on Cloud and Service Computing pp. 16 - 23
Main Authors Rui Wang, Yuan Gao, Guolu Zhang
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.11.2013
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Modern processors mostly use cache to hide the memory access latency, so cache performance is very important to application program. A detailed cache performance analysis will provide programmers a clear view of their program behaviors, which can help them to identify the performance bottleneck and to optimize the source code. As the chip industry turn to integrate multiple cores into one chip, multi-core/many-core processor becomes the new approach to maintain the Moor's Law. Therefore, Parallel programs turn to be more important even in the personal computers. In parallel programs, the interaction between tasks is the source of bugs and errors and is hard to handling for most of programmers. The detailed cache behaviors will greatly helpful to the programmer to find the errors and optimize the programs. However, the existing cache performance analysis tools, due to the limitations of the hardware performance counters they depend on to get data, cannot get as much data as we expected. Those tools cannot reveal the program routines characteristics on shared cache and the source of cache misses with limited metrics on cache misses. In this paper, we propose a method to obtain and analysis real time cache performance with binary instrumentation and cache emulation. We instrument the parallel program while it is running, and get the trace data about memory access. Then we transport the trace data to an carefully configured cache emulation module to get the detailed cache behavior information. The emulation module can not only get more information than hardware performance counter but also can be configured to simulate different target hardware environment. Additionally, we use the performance data to form a group of cache performance metrics which can intuitively help programmers to optimize their codes. The accuracy of this method is demonstrated by comparing the summary result with the hardware performance counter. Finally, we design an cache performance analysis tool named CC-Analyzer for parallel programs. Comparing with the existing technologies, CC-Analyzer is able to analyze the cause of cache misses and gather much more performance statistics when the parallel program is running on different cache architectures.
DOI:10.1109/CSC.2013.11