Understanding the Performance of Dynamic Data Race Detection

With increasing per-node concurrency, the interest in dynamic data race detection for OpenMP applications increased significantly in recent years. Benchmarks such as DataRaceBench (DRB) help evaluate the classification quality of data race detection tools for simple memory access patterns. Various p...

Full description

Saved in:
Bibliographic Details
Published in2021 IEEE/ACM 5th International Workshop on Software Correctness for HPC Applications (Correctness) pp. 33 - 40
Main Authors Protze, Joachim, Tharigen, Isabel, Wahle, Jonas
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.11.2021
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:With increasing per-node concurrency, the interest in dynamic data race detection for OpenMP applications increased significantly in recent years. Benchmarks such as DataRaceBench (DRB) help evaluate the classification quality of data race detection tools for simple memory access patterns. Various publications use short-running benchmark kernels from OmpSRC and DRB also for performance benchmarking of data race detection tools. Due to the short execution time, one-time initialization overhead dominates the measurement. Such results are not representative for the overhead with real codes. This paper proposes a new problem class for the SPEC OMP 2012 benchmark designed to analyze the runtime overhead of data race detection tools. Prior work reported runtime overheads of 80 × and higher for the OpenMP data race detection tool Archer (i.e., execution time with the tool is 80 times as long as without a tool). For a specific application, we report 500 × runtime overhead in this paper. This overhead stands in contrast to the 2-20 × runtime overhead claimed by the underlying tool ThreadSanitizer. We use our newly proposed input data set to observe and investigate significant runtime overhead of dynamic data race detection for specific applications. With the help of performance analysis tools and hardware performance counters, we can identify massively concurrent read accesses of the same shared variable as the root cause. We identify parallel matrix-vector multiplication as an application pattern responsible for such huge runtime overheads in data race analysis. Finally, we propose a modification of ThreadSanitizer, limiting the runtime overhead for these applications to less than 40×.
DOI:10.1109/Correctness54621.2021.00010