A Novel Vulnerability‐Detection Method Based on the Semantic Features of Source Code and the LLVM Intermediate Representation

ABSTRACT With the increasingly frequent attacks on software systems, software security is an issue that must be addressed. Within software security, automated detection of software vulnerabilities is an important subject. Most existing vulnerability detectors rely on the features of a single code ty...

Full description

Saved in:

Bibliographic Details
Published in	Journal of software : evolution and process Vol. 37; no. 5
Main Authors	Chen, Jinfu, Zhou, Jiapeng, Lin, Wei, Towey, Dave, Cai, Saihua, Chen, Haibo, Chen, Jingyi, Yin, Yemin
Format	Journal Article
Language	English
Published	Chichester Wiley Subscription Services, Inc 01.05.2025
Subjects	Artificial neural networks C++ (programming language) deep learning intermediate representation program representation Representations Security Semantics Software Source code Virtual environments vulnerability detection
Online Access	Get full text

Cover

Loading…

More Information
Summary:	ABSTRACT With the increasingly frequent attacks on software systems, software security is an issue that must be addressed. Within software security, automated detection of software vulnerabilities is an important subject. Most existing vulnerability detectors rely on the features of a single code type (e.g., source code or intermediate representation [IR]), which may lead to both the global features of the code slices and the memory operation information not being captured or considered. In particular, vulnerability detection based on source‐code features cannot usually include some macro or type definition content. In this paper, we propose a vulnerability‐detection method that combines the semantic features of source code and the low level virtual machine (LLVM) IR. Our proposed approach starts by slicing (C/C++) source files using improved slicing techniques to cover more comprehensive code information. It then extracts semantic information from the LLVM IR based on the executable source code. This can enrich the features fed to the artificial neural network (ANN) model for learning. We conducted an experimental evaluation using a publicly‐available dataset of 11,381 C/C++ programs. The experimental results show the vulnerability‐detection accuracy of our proposed method to reach over 96% for code slices generated according to four different slicing criteria. This outperforms most other compared detection methods. We propose a novel vulnerability detection model that integrates source code with LLVM IR's semantic features, incorporating new code information for enhanced detection. Simultaneously, the model leverages a deep residual network and nonlinear bidirectional feature fusion to reduce noise and improve learning ability. This approach can significantly boost detection accuracy.
Bibliography:	This work was partly supported by the National Natural Science Foundation of China (NSFC) (grant nos. 62172194, 62202206, and U1836116), Postgraduate Research & Practice Innovation Program of Jiangsu Province (grant no. SJCX24_2396), the Natural Science Foundation of Jiangsu Province (grant no. BK20220515), the Leading‐edge Technology Program of Jiangsu Natural Science Foundation (grant no. BK20202001), and the China Postdoctoral Science Foundation (grant no. 2023T160275). Funding ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	2047-7473 2047-7481
DOI:	10.1002/smr.70026