Analyzing source code vulnerabilities in the D2A dataset with ML ensembles and C-BERT

Static analysis tools are widely used for vulnerability detection as they can analyze programs with complex behavior and millions of lines of code. Despite their popularity, static analysis tools are known to generate an excess of false positives. The recent ability of Machine Learning models to lea...

Full description

Saved in:
Bibliographic Details
Published inEmpirical software engineering : an international journal Vol. 29; no. 2; p. 48
Main Authors Pujar, Saurabh, Zheng, Yunhui, Buratti, Luca, Lewis, Burn, Chen, Yunchung, Laredo, Jim, Morari, Alessandro, Epstein, Edward, Lin, Tsungnan, Yang, Bo, Su, Zhong
Format Journal Article
LanguageEnglish
Published New York Springer US 01.03.2024
Springer Nature B.V
Subjects
Online AccessGet full text

Cover

Loading…
Abstract Static analysis tools are widely used for vulnerability detection as they can analyze programs with complex behavior and millions of lines of code. Despite their popularity, static analysis tools are known to generate an excess of false positives. The recent ability of Machine Learning models to learn from programming language data opens new possibilities of reducing false positives when applied to static analysis. However, existing datasets to train models for vulnerability identification suffer from multiple limitations such as limited bug context, limited size, and synthetic and unrealistic source code. We propose Differential Dataset Analysis or D2A, a differential analysis based approach to label issues reported by static analysis tools. The dataset built with this approach is called the D2A dataset. The D2A dataset is built by analyzing version pairs from multiple open source projects. From each project, we select bug fixing commits and we run static analysis on the versions before and after such commits. If some issues detected in a before-commit version disappear in the corresponding after-commit version, they are very likely to be real bugs that got fixed by the commit. We use D2A to generate a large labeled dataset. We then train both classic machine learning models and deep learning models for vulnerability identification using the D2A dataset. We show that the dataset can be used to build a classifier to identify possible false alarms among the issues reported by static analysis, hence helping developers prioritize and investigate potential true positives first. To facilitate future research and contribute to the community, we make the dataset generation pipeline and the dataset publicly available. We have also created a leaderboard based on the D2A dataset, which has already attracted attention and participation from the community.
AbstractList Static analysis tools are widely used for vulnerability detection as they can analyze programs with complex behavior and millions of lines of code. Despite their popularity, static analysis tools are known to generate an excess of false positives. The recent ability of Machine Learning models to learn from programming language data opens new possibilities of reducing false positives when applied to static analysis. However, existing datasets to train models for vulnerability identification suffer from multiple limitations such as limited bug context, limited size, and synthetic and unrealistic source code. We propose Differential Dataset Analysis or D2A, a differential analysis based approach to label issues reported by static analysis tools. The dataset built with this approach is called the D2A dataset. The D2A dataset is built by analyzing version pairs from multiple open source projects. From each project, we select bug fixing commits and we run static analysis on the versions before and after such commits. If some issues detected in a before-commit version disappear in the corresponding after-commit version, they are very likely to be real bugs that got fixed by the commit. We use D2A to generate a large labeled dataset. We then train both classic machine learning models and deep learning models for vulnerability identification using the D2A dataset. We show that the dataset can be used to build a classifier to identify possible false alarms among the issues reported by static analysis, hence helping developers prioritize and investigate potential true positives first. To facilitate future research and contribute to the community, we make the dataset generation pipeline and the dataset publicly available. We have also created a leaderboard based on the D2A dataset, which has already attracted attention and participation from the community.
Abstract Static analysis tools are widely used for vulnerability detection as they can analyze programs with complex behavior and millions of lines of code. Despite their popularity, static analysis tools are known to generate an excess of false positives. The recent ability of Machine Learning models to learn from programming language data opens new possibilities of reducing false positives when applied to static analysis. However, existing datasets to train models for vulnerability identification suffer from multiple limitations such as limited bug context, limited size, and synthetic and unrealistic source code. We propose Differential Dataset Analysis or D2A, a differential analysis based approach to label issues reported by static analysis tools. The dataset built with this approach is called the D2A dataset. The D2A dataset is built by analyzing version pairs from multiple open source projects. From each project, we select bug fixing commits and we run static analysis on the versions before and after such commits. If some issues detected in a before-commit version disappear in the corresponding after-commit version, they are very likely to be real bugs that got fixed by the commit. We use D2A to generate a large labeled dataset. We then train both classic machine learning models and deep learning models for vulnerability identification using the D2A dataset. We show that the dataset can be used to build a classifier to identify possible false alarms among the issues reported by static analysis, hence helping developers prioritize and investigate potential true positives first. To facilitate future research and contribute to the community, we make the dataset generation pipeline and the dataset publicly available. We have also created a leaderboard based on the D2A dataset, which has already attracted attention and participation from the community.
ArticleNumber 48
Author Chen, Yunchung
Zheng, Yunhui
Lewis, Burn
Morari, Alessandro
Pujar, Saurabh
Epstein, Edward
Lin, Tsungnan
Yang, Bo
Laredo, Jim
Buratti, Luca
Su, Zhong
Author_xml – sequence: 1
  givenname: Saurabh
  orcidid: 0000-0002-9772-3162
  surname: Pujar
  fullname: Pujar, Saurabh
  email: saurabh.pujar@ibm.com
  organization: IBM T. J. Watson Research Center
– sequence: 2
  givenname: Yunhui
  surname: Zheng
  fullname: Zheng, Yunhui
  organization: IBM T. J. Watson Research Center
– sequence: 3
  givenname: Luca
  surname: Buratti
  fullname: Buratti, Luca
  organization: IBM T. J. Watson Research Center
– sequence: 4
  givenname: Burn
  surname: Lewis
  fullname: Lewis, Burn
  organization: IBM T. J. Watson Research Center
– sequence: 5
  givenname: Yunchung
  surname: Chen
  fullname: Chen, Yunchung
  organization: National Taiwan University
– sequence: 6
  givenname: Jim
  surname: Laredo
  fullname: Laredo, Jim
  organization: IBM T. J. Watson Research Center
– sequence: 7
  givenname: Alessandro
  surname: Morari
  fullname: Morari, Alessandro
  organization: IBM T. J. Watson Research Center
– sequence: 8
  givenname: Edward
  surname: Epstein
  fullname: Epstein, Edward
  organization: IBM T. J. Watson Research Center
– sequence: 9
  givenname: Tsungnan
  surname: Lin
  fullname: Lin, Tsungnan
  organization: National Taiwan University
– sequence: 10
  givenname: Bo
  surname: Yang
  fullname: Yang, Bo
  organization: IBM Research
– sequence: 11
  givenname: Zhong
  surname: Su
  fullname: Su, Zhong
  organization: IBM Research
BookMark eNp9kMtOwzAQRS1UJErhB1hZYm0YP-Iky1LKQypCQu3acuxJmyp1SpyCyteTUiR2rGYW517pnnMyCE1AQq443HCA9DZy0FoxEJJxUJCw_IQMeZJKlmquB_0vM8GkSPQZOY9xDQB5qpIhWYyDrfdfVVjS2Oxah9Q1HunHrg7Y2qKqq67CSKtAuxXSezGm3nY2Ykc_q25FX2YUQ8RNUfeQDZ5O2N30bX5BTktbR7z8vSOyeJjOJ09s9vr4PBnPmJNcdazINfdlqkSKskyEywqviwyF04UvAYq08NzJXCDXXFqQ3mUqs9ynkCjJMZcjcn3s3bbN-w5jZ9b9iH5RNCKX_UaATPWUOFKubWJssTTbttrYdm84mIM-c9Rnen3mR585VMtjKPZwWGL7V_1P6hsQDXML
Cites_doi 10.1109/ICSE.2013.6606613
10.1109/ICMLA.2018.00120
10.1109/ICSE.2012.6227135
10.1016/S0893-6080(05)80023-1
10.1109/ICSE.2019.00024
10.1007/978-3-031-01587-8_4
10.18653/v1/D18-2012
10.1145/3192366.3192403
10.18653/v1/2021.emnlp-main.685
10.1145/3192366.3192417
10.1109/JPROC.2020.2993293
10.1145/1297846.1297897
10.24963/ijcai.2017/214
10.18653/v1/2020.findings-emnlp.139
10.1007/3-540-44898-5_16
10.1007/11547662_15
10.1145/2660267.2660339
10.1145/2884781.2884848
10.1145/2049697.2049700
10.1109/ICSE.2019.00025
10.1145/3088525.3088675
10.1109/SCAM.2013.6648191
10.1145/3088525.3088563
10.1145/2597073.2597100
10.1109/ICSM.2013.89
10.1145/3428301
10.1109/SCAM.2016.25
10.1109/TSE.2014.2357438
10.1109/SP.2014.44
10.1145/2001420.2001442
10.14722/ndss.2018.23158
10.1109/SP.2015.54
10.1145/1135777.1135834
10.1145/3212695
10.1109/ICSE-SEIP52600.2021.00020
10.1007/3-540-44802-0_1
10.1145/2939672.2939785
10.1109/SER-IP.2017..20
10.18653/v1/2022.acl-long.339
10.1109/MS.2008.130
10.1145/1134285.1134355
ContentType Journal Article
Copyright The Author(s) 2024
The Author(s) 2024. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Copyright_xml – notice: The Author(s) 2024
– notice: The Author(s) 2024. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
DBID C6C
AAYXX
CITATION
7SC
8FD
JQ2
L7M
L~C
L~D
DOI 10.1007/s10664-023-10405-9
DatabaseName Springer Open Access
CrossRef
Computer and Information Systems Abstracts
Technology Research Database
ProQuest Computer Science Collection
Advanced Technologies Database with Aerospace
Computer and Information Systems Abstracts – Academic
Computer and Information Systems Abstracts Professional
DatabaseTitle CrossRef
Computer and Information Systems Abstracts
Technology Research Database
Computer and Information Systems Abstracts – Academic
Advanced Technologies Database with Aerospace
ProQuest Computer Science Collection
Computer and Information Systems Abstracts Professional
DatabaseTitleList Computer and Information Systems Abstracts
CrossRef

Database_xml – sequence: 1
  dbid: C6C
  name: Springer Open Access
  url: http://www.springeropen.com/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISSN 1573-7616
ExternalDocumentID 10_1007_s10664_023_10405_9
GroupedDBID -4Z
-59
-5G
-BR
-EM
-Y2
-~C
.86
.DC
.VR
06D
0R~
0VY
199
1N0
1SB
2.D
203
28-
29G
2J2
2JN
2JY
2KG
2LR
2P1
2VQ
2~H
30V
4.4
406
408
409
40D
40E
5GY
5QI
5VS
67Z
6NX
78A
8FE
8FG
8TC
8UJ
95-
95.
95~
96X
AABHQ
AABYN
AAEOY
AAFGU
AAHNG
AAIAL
AAJKR
AANZL
AAOBN
AARHV
AARTL
AASML
AATNV
AATVU
AAUYE
AAWCG
AAWWR
AAYIU
AAYOK
AAYQN
AAYTO
ABAKF
ABBBX
ABBXA
ABDZT
ABECU
ABFGW
ABFTD
ABFTV
ABHLI
ABHQN
ABJCF
ABJNI
ABJOX
ABKAS
ABKCH
ABKTR
ABMNI
ABMQK
ABMYL
ABNWP
ABQBU
ABSXP
ABTEG
ABTHY
ABTKH
ABTMW
ABULA
ABWNU
ABXPI
ACAOD
ACBMV
ACBRV
ACBXY
ACBYP
ACGFS
ACHSB
ACHXU
ACIGE
ACIPQ
ACIWK
ACKNC
ACMDZ
ACMLO
ACOKC
ACOMO
ACSNA
ACTTH
ACVWB
ACWMK
ACZOJ
ADGRI
ADHHG
ADHIR
ADIMF
ADINQ
ADKNI
ADKPE
ADMDM
ADOXG
ADRFC
ADTPH
ADURQ
ADYFF
ADZKW
AEBTG
AEEQQ
AEFIE
AEFQL
AEFTE
AEGAL
AEGNC
AEJHL
AEJRE
AEKMD
AEMSY
AENEX
AEOHA
AEPYU
AESKC
AESTI
AETLH
AEVLU
AEVTX
AEXYK
AEYWE
AFBBN
AFEXP
AFGCZ
AFKRA
AFLOW
AFNRJ
AFQWF
AFWTZ
AFZKB
AGAYW
AGDGC
AGGBP
AGGDS
AGJBK
AGMZJ
AGQEE
AGQMX
AGRTI
AGWIL
AGWZB
AGYKE
AHAVH
AHBYD
AHKAY
AHSBF
AHYZX
AIAKS
AIGIU
AIIXL
AILAN
AIMYW
AITGF
AJBLW
AJDOV
AJRNO
AJZVZ
AKQUC
ALMA_UNASSIGNED_HOLDINGS
ALWAN
AMKLP
AMXSW
AMYLF
AMYQR
AOCGG
ARAPS
ARMRJ
ASPBG
AVWKF
AXYYD
AYJHY
AZFZN
B-.
BA0
BBWZM
BDATZ
BENPR
BGLVJ
BGNMA
C6C
CAG
CCPQU
COF
CS3
CSCUP
DDRTE
DL5
DNIVK
DPUIP
DU5
EBLON
EBS
EIOEI
EJD
ESBYG
FEDTE
FERAY
FFXSO
FIGPU
FINBP
FNLPD
FRRFC
FSGXE
FWDCC
GGCAI
GGRSB
GJIRD
GNWQR
GQ6
GQ7
GQ8
GXS
HCIFZ
HF~
HG5
HG6
HMJXF
HQYDN
HRMNR
HVGLF
HZ~
I09
IHE
IJ-
IKXTQ
ITM
IWAJR
IXC
IZIGR
IZQ
I~X
I~Z
J-C
J0Z
JBSCW
JCJTX
JZLTJ
KDC
KOV
KOW
L6V
LAK
LLZTM
M4Y
M7S
MA-
N2Q
NB0
NDZJH
NPVJJ
NQJWS
NU0
O9-
O93
O9G
O9I
O9J
OAM
P19
P62
P9O
PF0
PT4
PT5
PTHSS
Q2X
QOK
QOS
R4E
R89
R9I
RHV
RNI
RNS
ROL
RPX
RSV
RZC
RZE
RZK
S0W
S16
S1Z
S26
S27
S28
S3B
SAP
SCJ
SCLPG
SCO
SDH
SDM
SHX
SISQX
SJYHP
SNE
SNPRN
SNX
SOHCF
SOJ
SPISZ
SRMVM
SSLCW
STPWE
SZN
T13
T16
TSG
TSK
TSV
TUC
U2A
UG4
UNUBA
UOJIU
UTJUX
UZXMN
VC2
VFIZW
W23
W48
WK8
YLTOR
Z45
Z7R
Z7S
Z7V
Z7X
Z7Z
Z81
Z83
Z86
Z88
Z8M
Z8N
Z8P
Z8R
Z8T
Z8U
Z8W
Z92
ZMTXR
~EX
AACDK
AAJBT
AAYXX
ACDTI
CITATION
H13
7SC
8FD
JQ2
L7M
L~C
L~D
ID FETCH-LOGICAL-c314t-b961df7427e3f52c8bd6b8e2c6bdf00b7bd1c392e1613a03dc848a1d705431e93
IEDL.DBID AGYKE
ISSN 1382-3256
IngestDate Thu Oct 10 22:05:01 EDT 2024
Thu Sep 12 18:57:08 EDT 2024
Sat Mar 23 01:27:41 EDT 2024
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed true
IsScholarly true
Issue 2
Keywords AI
D2A
Bert
Leaderboard
Dataset
Vulnerability detection
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c314t-b961df7427e3f52c8bd6b8e2c6bdf00b7bd1c392e1613a03dc848a1d705431e93
ORCID 0000-0002-9772-3162
OpenAccessLink https://proxy.k.utb.cz/login?url=http://link.springer.com/10.1007/s10664-023-10405-9
PQID 2930090084
PQPubID 326341
ParticipantIDs proquest_journals_2930090084
crossref_primary_10_1007_s10664_023_10405_9
springer_journals_10_1007_s10664_023_10405_9
PublicationCentury 2000
PublicationDate 2024-03-01
PublicationDateYYYYMMDD 2024-03-01
PublicationDate_xml – month: 03
  year: 2024
  text: 2024-03-01
  day: 01
PublicationDecade 2020
PublicationPlace New York
PublicationPlace_xml – name: New York
– name: Dordrecht
PublicationSubtitle An International Journal
PublicationTitle Empirical software engineering : an international journal
PublicationTitleAbbrev Empir Software Eng
PublicationYear 2024
Publisher Springer US
Springer Nature B.V
Publisher_xml – name: Springer US
– name: Springer Nature B.V
References Devlin J, Chang M-W, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the north American chapter of the association for computational linguistics: human language technologies, vol 1 (Long and Short Papers), pp 4171–4186
Fan G, Wu R, Shi Q, Xiao X, Zhou J, Zhang C (2019) Smoke: scalable path-sensitive memory leak detection for millions of lines of code. In ICSE’19
Kremenek T, Engler DR (2003) Z-ranking: using statistical analysis to counter the impact of static analysis approximations. In: Cousot R (ed), Static analysis, 10th international symposium, SAS 2003
CalcagnoCDistefanoDO’HearnPWYangHCompositional shape analysis by means of bi-abductionJ ACM20115826166286339710.1145/2049697.2049700
Dorogush AV, Ershov V, Gulin A (2018) Catboost: gradient boosting with categorical features support. arXiv:1810.11363
Muske T, Serebrenik A (2016) Survey of approaches for handling static analysis alarms. In: 2016 IEEE 16th international working conference on source code analysis and manipulation (SCAM), pp 157–166
Du X, Chen B, Li Y, Guo J, Zhou Y, Liu Y, Jiang Y (2019) Leopard: identifying vulnerable code for vulnerability assessment through program metrics. In: 2019 IEEE/ACM 41st international conference on software engineering (ICSE). IEEE, pp 60–71
Kudo T, Richardson J (2018) Sentencepiece: a simple and language independent subword tokenizer and detokenizer for neural text processing. In EMNLP
Russell RL, Kim LY, Hamilton LH, Lazovich T, Harer J, Ozdemir O, Ellingwood PM, McConley MW (2018) Automated vulnerability detection in source code using deep representation learning. In ICMLA’18
Yüksel U, Sözer H (2013a) Automated classification of static code analysis alerts: a case study. In ICSM’13
Sestili CD, Snavely WS, VanHoudnos NM (2018) Towards security defect prediction with AI. CoRR, abs/1808.09897. http://arxiv.org/abs/1808.09897
Koc U, Saadatpanah P, Foster JS, Porter AA (2017a) Learning a classifier for false positive error reports emitted by static code analysis tools. In MAPL’17, pp 35–42
Puri R, Kung DS, Janssen G, Zhang W, Domeniconi G, Zolotov V, Dolby J, Chen J, Choudhury MR, Decker L, Thost V, Buratti L, Pujar S, Finkler U (2021) Project codenet: a large-scale AI for code dataset for learning a diversity of coding tasks. ArXiv, abs/2105.12655
Facebook (2023b) Infer reportdiff. https://fbinfer.com/docs/man-infer-reportdiff
Cppcheck-team (2023). Cppcheck. http://cppcheck.sourceforge.net
Hindle A, Barr ET, Su Z, Gabel M, Devanbu P (2012) On the naturalness of software. In: 2012 34th international conference on software engineering (ICSE), pp 837–847. https://doi.org/10.1109/ICSE.2012.6227135
Suneja S, Zheng Y, Zhuang Y, Laredo J, Morari A (2020) Learning to map source code to software vulnerability using code-as-a-graph. CoRR, abs/2006.08614
Wang Y, Wang W, Joty S, Hoi SCH (2021) CodeT5: identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. In: Proceedings of the 2021 conference on empirical methods in natural language processing, pp 8696–8708. Association for computational linguistics, November 2021. https://doi.org/10.18653/v1/2021.emnlp-main.685
Zheng Y, Pujar S, Lewis B, Buratti L, Epstein E, Yang B, Laredo J, Morari A, Su Z (2021) D2a: a dataset built for ai-based vulnerability detection methods using differential analysis. In: 2021 IEEE/ACM 43rd international conference on software engineering: software engineering in practice (ICSE-SEIP). IEEE, pp 111–120
Yamaguchi F, Maier A, Gascon H, Rieck K (2015) Automatic inference of search patterns for taint-style vulnerabilities. In: 2015 IEEE symposium on security and privacy
Murphy-HillEZimmermannTBirdCNagappanNThe design space of bug fixes and how developers navigate itIEEE Trans Software Eng2015411658110.1109/TSE.2014.2357438
Jung Y, Kim J, Shin J, Yi K (2013) Taming false alarms from a domain-unaware C analyzer by a bayesian statistical post analysis. In: Proceedings of the 12th international conference on static analysis, SAS’05, pp 203–217
WolpertDHStacked generalizationNeural Netw19925224125910.1016/S0893-6080(05)80023-1
CWE400 (2023). Cwe-400: uncontrolled resource consumption. https://cwe.mitre.org/data/definitions/400.html
Yamaguchi F, Golde N, Arp D, Rieck K (2014) Modeling and discovering vulnerabilities with code property graphs. In: 2014 IEEE symposium on security and privacy, pp 590–604. https://doi.org/10.1109/SP.2014.44
LaToza TD, Venolia G, DeLine R (2006) Maintaining mental models: a study of developer work habits. In: Proceedings of the 28th international conference on software engineering
Zhang X, Si X, Naik M (2017) Combining the logical and the probabilistic in program analysis. In: Proceedings of the 1st ACM SIGPLAN international workshop on machine learning and programming languages, MAPL 2017, pp 27–34
Villard J (2023). Infer is not deterministic, infer issue #1110. https://github.com/facebook/infer/issues/1110
Flynn L (2016) Prioritizing alerts from static analysis to find and fix code flaws. http://insights.sei.cmu.edu/blog/prioritizing-alerts-from-static-analysis-to-find-and-fix-code-flaws
Reynolds ZP, Jayanth AB, Koc U, Porter AA, Raje RR, Hill JH (2017) Identifying and documenting false positive patterns generated by static code analysis tools. In: 4th international workshop on software engineering research and industrial practice
Infer. Infer issue types. https://github.com/facebook/infer/blob/ea4f7cf/infer/man/man1/infer.txt#L370
MITRETop25. Cwe top 25 most dangerous software weaknessess. https://cwe.mitre.org/top25/archive/2022/2022_cwe_top25.html
Nie P, Zhang J, Li JJ, Mooney RJ, Gligoric M (2021) Impact of evaluation methodologies on code summarization. arXiv:2108.09619
Guo D, Ren S, Lu S, Feng Z, Tang D, Liu S, Zhou L, Duan N, Svyatkovskiy A, Fu S, Tufano M, Deng SK, Clement C, Drain D, Sundaresan N, Yin J, Jiang D, Zhou M (2021) Graphcodebert: pre-training code representations with data flow. In: International conference on learning representations
Yüksel U, Sözer H (2013b) Automated classification of static code analysis alerts: a case study. In: 2013 IEEE international conference on software maintenance, pp 532–535
Johnson B, Song Y, Murphy-Hill E, Bowdidge R (2013) Why don’t software developers use static analysis tools to find bugs? In ICSE’13, pp 672–681
Livshits VB, Lam MS (2005) Finding security vulnerabilities in java applications with static analysis. In: Proceedings of the 14th conference on USENIX security symposium
Flynn L, Snavely W, Kurtz Z (2018) Test suites as a source of training data for static analysis alert classifiers. SEI Blog. https://insights.sei.cmu.edu/sei_blog/2018/04/static-analysis-alert-test-suites-as-a-source-of-training-data-for-alert-classifiers.html
Sui Y, Cheng X, Zhang G, Wang H (2020) Flow2vec: value-flow-based precise code embedding. OOPSLA
AyewahNPughWHovemeyerDMorgenthalerJDPenixJUsing static analysis to find bugsIEEE Softw2008255222910.1109/MS.2008.130
Buratti L, Pujar S, Bornea M, McCarley JS, Zheng Y, Rossiello G, Morari A, Laredo J, Thost V, Zhuang Y, Domeniconi G (2020) Exploring software naturalness through neural language models. CoRR, abs/2006.12641
Feng Z, Guo D, Tang D, Duan N, Feng X, Gong M, Shou L, Qin B, Liu T, Jiang D, Zhou M (2020) CodeBERT: a pre-trained model for programming and natural languages. In: Findings of the association for computational linguistics: EMNLP 2020, pp 1536–1547. https://doi.org/10.18653/v1/2020.findings-emnlp.139
Guarnieri S, Pistoia M, Tripp O, Dolby J, Teilhet S, Berg R (2011) Saving the world wide web from vulnerable Javascript. In: Proceedings of the 2011 international symposium on software testing and analysis, ISSTA’11
Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, Ye Q, Liu T-Y (2017) Lightgbm: a highly efficient gradient boosting decision tree. Adv Neural Inf Process Syst 30:3146–3154
Sahami M, Heilman TD (2006) A web-based kernel function for measuring the similarity of short text snippets. In WWW ’06
Ray B, Hellendoorn V, Godhane S, Tu Z, Bacchelli A, Devanbu P (2016) On the naturalness of buggy code. ICSE ’16, pp 428–439
Zhou Y, Liu S, Siow JK, Du X, Liu Y (2019) Devign: effective vulnerability identification by learning comprehensive program semantics via graph neural networks. In NeurIPS’19
NIST (2023a) National vulnerability database. https://nvd.nist.gov
LLVM. The clang static analyzer. https://clang-analyzer.llvm.org
Tripp O, Guarnieri S, Pistoia M, Aravkin A (2014) ALETHEIA: improving the usability of static security analysis. In: Proceedings of the 2014 ACM SIGSAC conference on computer and communications security, pp 762–774
Ayewah N, Pugh W, Morgenthaler JD, Penix J, Zhou YQ (2007) Using find bugs on production software. In OOPSLA’07
Clang (2023). Clang tooling. https://clang.llvm.org/docs/Tooling.html
Facebook (2023a) Infer static analyzer. https://fbinfer.com
Muske TB, Baid A, Sanas T (2013) Review efforts reduction by partitioning of static analysis warnings. In: 13th international working conference on source code analysis and manipulation
Wiki (2023). Libav. https://en.wikipedia.org/wiki/Libav#Fork_from_FFmpeg
O’Hearn P, Reynolds J, Yang H (2001) Local reasoning about programs that alter data structures. LNCS 2142
Paletov R, Tsankov P, Raychev V, Vechev M (2018) Inferring crypto api rules from code changes. In: Proceedings of the 39th ACM SIGPLAN conference on programming language design and implementation, PLDI 2018, pp 450–464
Li Z, Zou D, Xu S, Ou X, Jin H, Wang S, Deng Z, Zhong Y (2018) Vuldeepecker: a deep learning-based system for vulnerability detection. In: 25th annual network and distributed system security symposium, NDSS’18
CWE457 (2023) Cwe-457: use of uninitialized variable. https://cwe.mitre.org/data/definitions/457.html
Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V (2019) Roberta: a robustly optimized bert pretraining approach. ArXiv, abs/1907.11692
Wheeler DA (2023). Flawfinder.
10405_CR46
10405_CR47
10405_CR44
10405_CR42
10405_CR43
10405_CR40
10405_CR41
10405_CR1
10405_CR3
C Calcagno (10405_CR5) 2011; 58
10405_CR48
10405_CR49
E Murphy-Hill (10405_CR45) 2015; 41
10405_CR13
10405_CR57
10405_CR14
10405_CR58
10405_CR11
10405_CR55
10405_CR12
10405_CR56
10405_CR53
10405_CR10
10405_CR54
10405_CR51
10405_CR52
10405_CR19
10405_CR17
10405_CR18
10405_CR15
10405_CR59
10405_CR16
10405_CR9
10405_CR8
10405_CR4
10405_CR7
10405_CR6
10405_CR50
G Lin (10405_CR38) 2020; 108
10405_CR24
10405_CR68
10405_CR25
10405_CR22
10405_CR66
10405_CR23
10405_CR67
10405_CR20
10405_CR64
10405_CR21
10405_CR65
10405_CR62
10405_CR63
10405_CR28
10405_CR29
10405_CR26
10405_CR27
DH Wolpert (10405_CR69) 1992; 5
G Ke (10405_CR31) 2017; 30
N Ayewah (10405_CR2) 2008; 25
10405_CR60
10405_CR61
10405_CR35
10405_CR36
10405_CR33
10405_CR77
10405_CR34
10405_CR78
10405_CR75
10405_CR32
10405_CR76
10405_CR73
10405_CR30
10405_CR74
10405_CR39
10405_CR37
10405_CR71
10405_CR72
10405_CR70
References_xml – ident: 10405_CR27
  doi: 10.1109/ICSE.2013.6606613
– ident: 10405_CR44
– ident: 10405_CR21
– ident: 10405_CR58
  doi: 10.1109/ICMLA.2018.00120
– ident: 10405_CR40
– ident: 10405_CR25
  doi: 10.1109/ICSE.2012.6227135
– volume: 5
  start-page: 241
  issue: 2
  year: 1992
  ident: 10405_CR69
  publication-title: Neural Netw
  doi: 10.1016/S0893-6080(05)80023-1
  contributor:
    fullname: DH Wolpert
– ident: 10405_CR29
– ident: 10405_CR67
– ident: 10405_CR16
  doi: 10.1109/ICSE.2019.00024
– ident: 10405_CR78
  doi: 10.1007/978-3-031-01587-8_4
– ident: 10405_CR35
  doi: 10.18653/v1/D18-2012
– ident: 10405_CR52
  doi: 10.1145/3192366.3192403
– ident: 10405_CR11
– ident: 10405_CR30
– ident: 10405_CR9
– ident: 10405_CR15
– ident: 10405_CR66
  doi: 10.18653/v1/2021.emnlp-main.685
– ident: 10405_CR55
  doi: 10.1145/3192366.3192417
– ident: 10405_CR72
– ident: 10405_CR53
– volume: 108
  start-page: 1825
  issue: 10
  year: 2020
  ident: 10405_CR38
  publication-title: Proc IEEE
  doi: 10.1109/JPROC.2020.2993293
  contributor:
    fullname: G Lin
– ident: 10405_CR76
– ident: 10405_CR68
– ident: 10405_CR3
  doi: 10.1145/1297846.1297897
– ident: 10405_CR43
– ident: 10405_CR8
  doi: 10.24963/ijcai.2017/214
– ident: 10405_CR20
  doi: 10.18653/v1/2020.findings-emnlp.139
– ident: 10405_CR26
– ident: 10405_CR34
  doi: 10.1007/3-540-44898-5_16
– ident: 10405_CR28
  doi: 10.1007/11547662_15
– ident: 10405_CR63
  doi: 10.1145/2660267.2660339
– ident: 10405_CR60
– ident: 10405_CR56
  doi: 10.1145/2884781.2884848
– ident: 10405_CR64
– volume: 58
  start-page: 1
  issue: 26
  year: 2011
  ident: 10405_CR5
  publication-title: J ACM
  doi: 10.1145/2049697.2049700
  contributor:
    fullname: C Calcagno
– ident: 10405_CR6
– ident: 10405_CR12
– ident: 10405_CR19
  doi: 10.1109/ICSE.2019.00025
– ident: 10405_CR32
  doi: 10.1145/3088525.3088675
– volume: 30
  start-page: 3146
  year: 2017
  ident: 10405_CR31
  publication-title: Adv Neural Inf Process Syst
  contributor:
    fullname: G Ke
– ident: 10405_CR47
  doi: 10.1109/SCAM.2013.6648191
– ident: 10405_CR50
– ident: 10405_CR75
  doi: 10.1145/3088525.3088563
– ident: 10405_CR54
– ident: 10405_CR24
  doi: 10.1145/2597073.2597100
– ident: 10405_CR17
– ident: 10405_CR42
– ident: 10405_CR23
– ident: 10405_CR65
– ident: 10405_CR73
  doi: 10.1109/ICSM.2013.89
– ident: 10405_CR61
  doi: 10.1145/3428301
– ident: 10405_CR46
  doi: 10.1109/SCAM.2016.25
– volume: 41
  start-page: 65
  issue: 1
  year: 2015
  ident: 10405_CR45
  publication-title: IEEE Trans Software Eng
  doi: 10.1109/TSE.2014.2357438
  contributor:
    fullname: E Murphy-Hill
– ident: 10405_CR13
– ident: 10405_CR70
  doi: 10.1109/SP.2014.44
– ident: 10405_CR22
  doi: 10.1145/2001420.2001442
– ident: 10405_CR37
  doi: 10.14722/ndss.2018.23158
– ident: 10405_CR71
  doi: 10.1109/SP.2015.54
– ident: 10405_CR59
  doi: 10.1145/1135777.1135834
– ident: 10405_CR18
– ident: 10405_CR1
  doi: 10.1145/3212695
– ident: 10405_CR62
– ident: 10405_CR77
  doi: 10.1109/ICSE-SEIP52600.2021.00020
– ident: 10405_CR41
– ident: 10405_CR51
  doi: 10.1007/3-540-44802-0_1
– ident: 10405_CR49
– ident: 10405_CR4
– ident: 10405_CR7
  doi: 10.1145/2939672.2939785
– ident: 10405_CR57
  doi: 10.1109/SER-IP.2017..20
– ident: 10405_CR48
  doi: 10.18653/v1/2022.acl-long.339
– ident: 10405_CR10
– ident: 10405_CR14
– volume: 25
  start-page: 22
  issue: 5
  year: 2008
  ident: 10405_CR2
  publication-title: IEEE Softw
  doi: 10.1109/MS.2008.130
  contributor:
    fullname: N Ayewah
– ident: 10405_CR74
  doi: 10.1109/ICSM.2013.89
– ident: 10405_CR36
  doi: 10.1145/1134285.1134355
– ident: 10405_CR33
  doi: 10.1145/3088525.3088675
– ident: 10405_CR39
SSID ssj0009745
Score 2.401814
Snippet Static analysis tools are widely used for vulnerability detection as they can analyze programs with complex behavior and millions of lines of code. Despite...
Abstract Static analysis tools are widely used for vulnerability detection as they can analyze programs with complex behavior and millions of lines of code....
SourceID proquest
crossref
springer
SourceType Aggregation Database
Publisher
StartPage 48
SubjectTerms Community participation
Compilers
Computer Science
Datasets
Deep learning
False alarms
Interpreters
Machine learning
Programming Languages
Software Engineering/Programming and Operating Systems
Source code
Special Issue on Software Engineering in Practice
Static code analysis
Title Analyzing source code vulnerabilities in the D2A dataset with ML ensembles and C-BERT
URI https://link.springer.com/article/10.1007/s10664-023-10405-9
https://www.proquest.com/docview/2930090084
Volume 29
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV07T8MwED4BXVh4I8qj8sAGRnk6zlhCoQLKgFoJpih-SYgSEE0Z-PWc81B4DixZ4ljKne37bN_3HcBhxkIV-zyjYaAcGsjI0Nh3BOXciDDKuJHKkpNHN2w4CS7vwruWx10muzc3kuVC_YnrxlhAMcTgyoEog8aL0KmJp53-xf3VoNXajcraxFZdj_oY0muuzO-9fI1HLcj8di9ahpvzVRg3pJ0qy-TxZF6IE_n-U8PxP3-yBis1_CT9arysw4LON2C1Ke1A6pm-CZNSrOQduyXV8T6x3HfyNp9aleoyoRa32OQhJwggyZnXJzbVdKYLYg92yeia4PZYP4kpNspyRRJ6Orgdb8HkfDBOhrSuwECl7wYFFTFzlcHdc6R9E3qSC8UE155kQhnHEZFQrkSEpRE3-pnjK8kDnrkqcizFXsf-Nizlz7neAcJwRc40fusZNwi15sJDqCJCEwgjEZV24ajxQ_pSCW2kraSytViKFktLi6VxF_YbV6X1pJuliFzQ-bZCQBeOG9O3r__ubfd_zfdg2UNoU2Wi7cNS8TrXBwhNCtGrh2IPFhOW4HPi9T8AhSnZTw
link.rule.ids 315,783,787,27936,27937,41093,41132,41535,42162,42201,42604,51588,52123,52246
linkProvider Springer Nature
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV07T8MwELagDLDwRhQKeGADS0nsJM5YSqsCbQfUSt2s-CUhlYBoysCv55yHAggG5jg3fLbvvrPvPiN0mUahTihPSci0R5iKLUmoJwnnVoZxyq3Srjl5PImGM3Y_D-dVU9iyrnavryQLT_2l2S2KGIEYA64DaAZJ1tGG01d3ivmzoNtI7cbF08ROXI9QiOhVq8zvNr6Ho4Zj_rgWLaLNYBdtVzQRd8t53UNrJttHO_UTDLjakQdoVoiKfIANXB7DY9ejjt9XC6cmXRS-QiqMnzIMRA_fBl3sSkKXJsfuABaPRxjSWPMsFzAozTTukZv-4_QQzQb9aW9IqpcSiKI-y4lMIl9byHJjQ20YKC51JLkJVCS19TwZS-0rYEIG-B1NPaoVZzz1dey5VniT0CPUyl4yc4xwBJ4zNfBvYH0WGsNlAJRChpZJq4A9ttFVDZh4LQUxRCN97OAVAK8o4BVJG3VqTEW1OZYCGAbMklPyb6PrGufm89_WTv43_AJtDqfjkRjdTR5O0VYAdKSsHuugVv62MmdAJ3J5XqyeT2Fyvmw
linkToPdf http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwpV3JTsMwEB1BkRAXdkRZfeAGhixO4hwLbSmrEKISnKJ4kxAlVDTlwNczzqICggPiHMdJ7MnMsz3vDcBeGgYq9nlKA6YcymRkaOw7gnJuRBCl3EhlyclX12Gvz87vg_tPLP4i270-kiw5DValKcuPhsocfSK-hSGjGG_QjSDkoPE0zDCrjNSAmdbpw0VnIrwbFYWKrdQe9TG-V8SZn3v5GpwmiPPbIWkRe7oLkNZvXaacPB2Oc3Eo378JOv7nsxZhvgKmpFVa0hJM6WwZFuqiD6TyASvQL2RM3vERpNz4J5YVT97GA6tfXaTa4uKbPGYEoSVpey1ik1BHOid2y5dcXRJcOOtnMcBGaabICT3u3N6tQr_buTvp0ao2A5W-y3Iq4tBVBtfVkfZN4EkuVCi49mQolHEcEQnlSsReGhGlnzq-kpzx1FWRY8n3OvbXoJG9ZHodSIi-OtV4r2dcFmjNhYcgRgSGCSMRrzZhv56UZFhKcCQTsWU7YgmOWFKMWBI3Yauet6T6HUcJYhq0BFs7oAkH9TRMLv_e28bfmu_C7E27m1yeXV9swpyH-KdMV9uCRv461tuIX3KxU5noByHD5EE
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Analyzing+source+code+vulnerabilities+in+the+D2A+dataset+with+ML+ensembles+and+C-BERT&rft.jtitle=Empirical+software+engineering+%3A+an+international+journal&rft.au=Pujar%2C+Saurabh&rft.au=Zheng%2C+Yunhui&rft.au=Buratti%2C+Luca&rft.au=Lewis%2C+Burn&rft.date=2024-03-01&rft.pub=Springer+US&rft.issn=1382-3256&rft.eissn=1573-7616&rft.volume=29&rft.issue=2&rft_id=info:doi/10.1007%2Fs10664-023-10405-9&rft.externalDocID=10_1007_s10664_023_10405_9
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1382-3256&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1382-3256&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1382-3256&client=summon