TCCCD: Triplet-Based Cross-Language Code Clone Detection

Code cloning is a common practice in software development, where developers reuse existing code to accelerate programming speed and enhance work efficiency. Existing clone-detection methods mainly focus on code clones within a single programming language. To address the challenge of code clone insta...

Full description

Saved in:

Bibliographic Details
Published in	Applied sciences Vol. 13; no. 21; p. 12084
Main Authors	Fang, Yong, Zhou, Fangzheng, Xu, Yijia, Liu, Zhonglin
Format	Journal Article
Language	English
Published	Basel MDPI AG 01.11.2023
Subjects	Analysis Cloning code clone detection Code reuse Computational linguistics cross-language Deep learning Language processing Machine learning Methods Natural language interfaces Neural networks pre-trained model Programming languages Semantics Software quality Syntax Text analysis triplet learning Unix Vector space
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Code cloning is a common practice in software development, where developers reuse existing code to accelerate programming speed and enhance work efficiency. Existing clone-detection methods mainly focus on code clones within a single programming language. To address the challenge of code clone instances in cross-platform development, we propose a novel method called TCCCD, which stands for Triplet-Based Cross-Language Code Clone Detection. Our approach is based on machine learning and can accurately detect code clone instances between different programming languages. We used the pre-trained model UniXcoder to map programs written in different languages into the same vector space and learn their code representations. Then, we fine-tuned TCCCD using triplet learning to improve its effectiveness in cross-language clone detection. To assess the effectiveness of our proposed approach, we conducted thorough comparative experiments using the dataset provided by the paper titled CLCDSA (Cross Language Code Clone Detection using Syntactical Features and API Documentation). The experimental results demonstrated a significant improvement of our approach over the state-of-the-art baselines, with precision, recall, and F1-measure scores of 0.96, 0.91, and 0.93, respectively. In summary, we propose a novel cross-language code-clone-detection method called TCCCD. TCCCD leverages the pre-trained model UniXcode for source code representation and fine-tunes the model using triplet learning. In the experimental results, TCCCD outperformed the state-of-the-art baselines in terms of the precision, recall, and F1-measure.
ISSN:	2076-3417 2076-3417
DOI:	10.3390/app132112084