Analyzing source code vulnerabilities in the D2A dataset with ML ensembles and C-BERT

Static analysis tools are widely used for vulnerability detection as they can analyze programs with complex behavior and millions of lines of code. Despite their popularity, static analysis tools are known to generate an excess of false positives. The recent ability of Machine Learning models to lea...

Full description

Saved in:

Bibliographic Details
Published in	Empirical software engineering : an international journal Vol. 29; no. 2; p. 48
Main Authors	Pujar, Saurabh, Zheng, Yunhui, Buratti, Luca, Lewis, Burn, Chen, Yunchung, Laredo, Jim, Morari, Alessandro, Epstein, Edward, Lin, Tsungnan, Yang, Bo, Su, Zhong
Format	Journal Article
Language	English
Published	New York Springer US 01.03.2024 Springer Nature B.V
Subjects	Community participation Compilers Computer Science Datasets Deep learning False alarms Interpreters Machine learning Programming Languages Software Engineering/Programming and Operating Systems Source code Special Issue on Software Engineering in Practice Static code analysis AI D2A Bert Leaderboard Dataset Vulnerability detection
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Static analysis tools are widely used for vulnerability detection as they can analyze programs with complex behavior and millions of lines of code. Despite their popularity, static analysis tools are known to generate an excess of false positives. The recent ability of Machine Learning models to learn from programming language data opens new possibilities of reducing false positives when applied to static analysis. However, existing datasets to train models for vulnerability identification suffer from multiple limitations such as limited bug context, limited size, and synthetic and unrealistic source code. We propose Differential Dataset Analysis or D2A, a differential analysis based approach to label issues reported by static analysis tools. The dataset built with this approach is called the D2A dataset. The D2A dataset is built by analyzing version pairs from multiple open source projects. From each project, we select bug fixing commits and we run static analysis on the versions before and after such commits. If some issues detected in a before-commit version disappear in the corresponding after-commit version, they are very likely to be real bugs that got fixed by the commit. We use D2A to generate a large labeled dataset. We then train both classic machine learning models and deep learning models for vulnerability identification using the D2A dataset. We show that the dataset can be used to build a classifier to identify possible false alarms among the issues reported by static analysis, hence helping developers prioritize and investigate potential true positives first. To facilitate future research and contribute to the community, we make the dataset generation pipeline and the dataset publicly available. We have also created a leaderboard based on the D2A dataset, which has already attracted attention and participation from the community.
ISSN:	1382-3256 1573-7616
DOI:	10.1007/s10664-023-10405-9