SourcererCC: Scaling Code Clone Detection to Big-Code

Despite a decade of active research, there has been a marked lack in clone detection techniques that scale to large repositories for detecting near-miss clones. In this paper, we present a token-based clone detector, SourcererCC, that can detect both exact and near-miss clones from large inter-proje...

Full description

Saved in:
Bibliographic Details
Published inProceedings / International Conference on Software Engineering pp. 1157 - 1168
Main Authors Sajnani, Hitesh, Saini, Vaibhav, Svajlenko, Jeffrey, Roy, Chanchal K., Lopes, Cristina V.
Format Conference Proceeding
LanguageEnglish
Published ACM 01.05.2016
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Despite a decade of active research, there has been a marked lack in clone detection techniques that scale to large repositories for detecting near-miss clones. In this paper, we present a token-based clone detector, SourcererCC, that can detect both exact and near-miss clones from large inter-project repositories using a standard workstation. It exploits an optimized inverted-index to quickly query the potential clones of a given code block. Filtering heuristics based on token ordering are used to significantly reduce the size of the index, the number of code-block comparisons needed to detect the clones, as well as the number of required token-comparisons needed to judge a potential clone. We evaluate the scalability, execution time, recall and precision of SourcererCC, and compare it to four publicly available and state-of-the-art tools. To measure recall, we use two recent benchmarks: (1) a big benchmark of real clones, BigCloneBench, and (2) a Mutation/Injection-based framework of thousands of fine-grained artificial clones. We find SourcererCC has both high recall and precision, and is able to scale to a large inter-project repository (25K projects, 250MLOC) using a standard workstation.
ISSN:1558-1225
DOI:10.1145/2884781.2884877