RCFG2Vec: Considering Long-Distance Dependency for Binary Code Similarity Detection
Binary code similarity detection(BCSD), as a fundamental technique in software security, has various applications, including malware family detection, known vulnerability detection and code plagiarism detection. Recent deep learning-based BCSD approaches have demonstrated promising performance. Howe...
Saved in:
Published in | IEEE/ACM International Conference on Automated Software Engineering : [proceedings] pp. 770 - 782 |
---|---|
Main Authors | , , , , |
Format | Conference Proceeding |
Language | English |
Published |
ACM
27.10.2024
|
Subjects | |
Online Access | Get full text |
ISSN | 2643-1572 |
DOI | 10.1145/3691620.3695070 |
Cover
Loading…
Summary: | Binary code similarity detection(BCSD), as a fundamental technique in software security, has various applications, including malware family detection, known vulnerability detection and code plagiarism detection. Recent deep learning-based BCSD approaches have demonstrated promising performance. However, they face two significant challenges that limit detection performance. First, most approaches that use sequence networks (like RNN and Transformer) utilize coarse-grained tokenization methods, which results in large vocabulary size and severe out-of-vocabulary (OOV) problem. Second, CFG-based methods typically use variants of graph convolutional networks, which only consider local structural information and discard long-distance dependencies between basic blocks.To address these challenges, this paper proposes Syntax Tree-based instruction embedding and introduces the acyclic graph neural network. The former decomposes assembly instructions into fine-grained tokens and employs a tree-structured neural network to generate vector representations for instructions. The latter transforms CFGs into directed acyclic graphs based on their reducibility, and further captures the dependency between basic blocks with a directed acyclic graph neural network. We implemented these two techniques in a prototype named RCFG2Vec and conducted comprehensive evaluation on two public datasets. The experiment results demonstrate that RCFG2Vec outperforms almost all baselines and achieves detection performance comparable with jTrans, a large model-based approach. Meanwhile, when integrated with our proposed techniques, several baseline approaches exhibit significant improvements in detection performance.CCS CONCEPTS* Security and privacy → Software reverse engineering; * Computing methodologies → Neural networks. |
---|---|
ISSN: | 2643-1572 |
DOI: | 10.1145/3691620.3695070 |