Analyzing source code vulnerabilities in the D2A dataset with ML ensembles and C-BERT

Static analysis tools are widely used for vulnerability detection as they can analyze programs with complex behavior and millions of lines of code. Despite their popularity, static analysis tools are known to generate an excess of false positives. The recent ability of Machine Learning models to lea...

Full description

Saved in:

Bibliographic Details
Published in	Empirical software engineering : an international journal Vol. 29; no. 2; p. 48
Main Authors	Pujar, Saurabh, Zheng, Yunhui, Buratti, Luca, Lewis, Burn, Chen, Yunchung, Laredo, Jim, Morari, Alessandro, Epstein, Edward, Lin, Tsungnan, Yang, Bo, Su, Zhong
Format	Journal Article
Language	English
Published	New York Springer US 01.03.2024 Springer Nature B.V
Subjects	Community participation Compilers Computer Science Datasets Deep learning False alarms Interpreters Machine learning Programming Languages Software Engineering/Programming and Operating Systems Source code Special Issue on Software Engineering in Practice Static code analysis AI D2A Bert Leaderboard Dataset Vulnerability detection
Online Access	Get full text

Cover

Loading…

Abstract	Static analysis tools are widely used for vulnerability detection as they can analyze programs with complex behavior and millions of lines of code. Despite their popularity, static analysis tools are known to generate an excess of false positives. The recent ability of Machine Learning models to learn from programming language data opens new possibilities of reducing false positives when applied to static analysis. However, existing datasets to train models for vulnerability identification suffer from multiple limitations such as limited bug context, limited size, and synthetic and unrealistic source code. We propose Differential Dataset Analysis or D2A, a differential analysis based approach to label issues reported by static analysis tools. The dataset built with this approach is called the D2A dataset. The D2A dataset is built by analyzing version pairs from multiple open source projects. From each project, we select bug fixing commits and we run static analysis on the versions before and after such commits. If some issues detected in a before-commit version disappear in the corresponding after-commit version, they are very likely to be real bugs that got fixed by the commit. We use D2A to generate a large labeled dataset. We then train both classic machine learning models and deep learning models for vulnerability identification using the D2A dataset. We show that the dataset can be used to build a classifier to identify possible false alarms among the issues reported by static analysis, hence helping developers prioritize and investigate potential true positives first. To facilitate future research and contribute to the community, we make the dataset generation pipeline and the dataset publicly available. We have also created a leaderboard based on the D2A dataset, which has already attracted attention and participation from the community.
AbstractList	Static analysis tools are widely used for vulnerability detection as they can analyze programs with complex behavior and millions of lines of code. Despite their popularity, static analysis tools are known to generate an excess of false positives. The recent ability of Machine Learning models to learn from programming language data opens new possibilities of reducing false positives when applied to static analysis. However, existing datasets to train models for vulnerability identification suffer from multiple limitations such as limited bug context, limited size, and synthetic and unrealistic source code. We propose Differential Dataset Analysis or D2A, a differential analysis based approach to label issues reported by static analysis tools. The dataset built with this approach is called the D2A dataset. The D2A dataset is built by analyzing version pairs from multiple open source projects. From each project, we select bug fixing commits and we run static analysis on the versions before and after such commits. If some issues detected in a before-commit version disappear in the corresponding after-commit version, they are very likely to be real bugs that got fixed by the commit. We use D2A to generate a large labeled dataset. We then train both classic machine learning models and deep learning models for vulnerability identification using the D2A dataset. We show that the dataset can be used to build a classifier to identify possible false alarms among the issues reported by static analysis, hence helping developers prioritize and investigate potential true positives first. To facilitate future research and contribute to the community, we make the dataset generation pipeline and the dataset publicly available. We have also created a leaderboard based on the D2A dataset, which has already attracted attention and participation from the community. Abstract Static analysis tools are widely used for vulnerability detection as they can analyze programs with complex behavior and millions of lines of code. Despite their popularity, static analysis tools are known to generate an excess of false positives. The recent ability of Machine Learning models to learn from programming language data opens new possibilities of reducing false positives when applied to static analysis. However, existing datasets to train models for vulnerability identification suffer from multiple limitations such as limited bug context, limited size, and synthetic and unrealistic source code. We propose Differential Dataset Analysis or D2A, a differential analysis based approach to label issues reported by static analysis tools. The dataset built with this approach is called the D2A dataset. The D2A dataset is built by analyzing version pairs from multiple open source projects. From each project, we select bug fixing commits and we run static analysis on the versions before and after such commits. If some issues detected in a before-commit version disappear in the corresponding after-commit version, they are very likely to be real bugs that got fixed by the commit. We use D2A to generate a large labeled dataset. We then train both classic machine learning models and deep learning models for vulnerability identification using the D2A dataset. We show that the dataset can be used to build a classifier to identify possible false alarms among the issues reported by static analysis, hence helping developers prioritize and investigate potential true positives first. To facilitate future research and contribute to the community, we make the dataset generation pipeline and the dataset publicly available. We have also created a leaderboard based on the D2A dataset, which has already attracted attention and participation from the community.
ArticleNumber	48
Author	Chen, Yunchung Zheng, Yunhui Lewis, Burn Morari, Alessandro Pujar, Saurabh Epstein, Edward Lin, Tsungnan Yang, Bo Laredo, Jim Buratti, Luca Su, Zhong
Author_xml	– sequence: 1 givenname: Saurabh orcidid: 0000-0002-9772-3162 surname: Pujar fullname: Pujar, Saurabh email: saurabh.pujar@ibm.com organization: IBM T. J. Watson Research Center – sequence: 2 givenname: Yunhui surname: Zheng fullname: Zheng, Yunhui organization: IBM T. J. Watson Research Center – sequence: 3 givenname: Luca surname: Buratti fullname: Buratti, Luca organization: IBM T. J. Watson Research Center – sequence: 4 givenname: Burn surname: Lewis fullname: Lewis, Burn organization: IBM T. J. Watson Research Center – sequence: 5 givenname: Yunchung surname: Chen fullname: Chen, Yunchung organization: National Taiwan University – sequence: 6 givenname: Jim surname: Laredo fullname: Laredo, Jim organization: IBM T. J. Watson Research Center – sequence: 7 givenname: Alessandro surname: Morari fullname: Morari, Alessandro organization: IBM T. J. Watson Research Center – sequence: 8 givenname: Edward surname: Epstein fullname: Epstein, Edward organization: IBM T. J. Watson Research Center – sequence: 9 givenname: Tsungnan surname: Lin fullname: Lin, Tsungnan organization: National Taiwan University – sequence: 10 givenname: Bo surname: Yang fullname: Yang, Bo organization: IBM Research – sequence: 11 givenname: Zhong surname: Su fullname: Su, Zhong organization: IBM Research
BookMark	eNp9kMtOwzAQRS1UJErhB1hZYm0YP-Iky1LKQypCQu3acuxJmyp1SpyCyteTUiR2rGYW517pnnMyCE1AQq443HCA9DZy0FoxEJJxUJCw_IQMeZJKlmquB_0vM8GkSPQZOY9xDQB5qpIhWYyDrfdfVVjS2Oxah9Q1HunHrg7Y2qKqq67CSKtAuxXSezGm3nY2Ykc_q25FX2YUQ8RNUfeQDZ5O2N30bX5BTktbR7z8vSOyeJjOJ09s9vr4PBnPmJNcdazINfdlqkSKskyEywqviwyF04UvAYq08NzJXCDXXFqQ3mUqs9ynkCjJMZcjcn3s3bbN-w5jZ9b9iH5RNCKX_UaATPWUOFKubWJssTTbttrYdm84mIM-c9Rnen3mR585VMtjKPZwWGL7V_1P6hsQDXML
Cites_doi	10.1109/ICSE.2013.6606613 10.1109/ICMLA.2018.00120 10.1109/ICSE.2012.6227135 10.1016/S0893-6080(05)80023-1 10.1109/ICSE.2019.00024 10.1007/978-3-031-01587-8_4 10.18653/v1/D18-2012 10.1145/3192366.3192403 10.18653/v1/2021.emnlp-main.685 10.1145/3192366.3192417 10.1109/JPROC.2020.2993293 10.1145/1297846.1297897 10.24963/ijcai.2017/214 10.18653/v1/2020.findings-emnlp.139 10.1007/3-540-44898-5_16 10.1007/11547662_15 10.1145/2660267.2660339 10.1145/2884781.2884848 10.1145/2049697.2049700 10.1109/ICSE.2019.00025 10.1145/3088525.3088675 10.1109/SCAM.2013.6648191 10.1145/3088525.3088563 10.1145/2597073.2597100 10.1109/ICSM.2013.89 10.1145/3428301 10.1109/SCAM.2016.25 10.1109/TSE.2014.2357438 10.1109/SP.2014.44 10.1145/2001420.2001442 10.14722/ndss.2018.23158 10.1109/SP.2015.54 10.1145/1135777.1135834 10.1145/3212695 10.1109/ICSE-SEIP52600.2021.00020 10.1007/3-540-44802-0_1 10.1145/2939672.2939785 10.1109/SER-IP.2017..20 10.18653/v1/2022.acl-long.339 10.1109/MS.2008.130 10.1145/1134285.1134355
ContentType	Journal Article
Copyright	The Author(s) 2024 The Author(s) 2024. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Copyright_xml	– notice: The Author(s) 2024 – notice: The Author(s) 2024. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
DBID	C6C AAYXX CITATION 7SC 8FD JQ2 L7M L~C L~D
DOI	10.1007/s10664-023-10405-9
DatabaseName	Springer Open Access CrossRef Computer and Information Systems Abstracts Technology Research Database ProQuest Computer Science Collection Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Academic Computer and Information Systems Abstracts Professional
DatabaseTitle	CrossRef Computer and Information Systems Abstracts Technology Research Database Computer and Information Systems Abstracts – Academic Advanced Technologies Database with Aerospace ProQuest Computer Science Collection Computer and Information Systems Abstracts Professional
DatabaseTitleList	Computer and Information Systems Abstracts CrossRef
Database_xml	– sequence: 1 dbid: C6C name: Springer Open Access url: http://www.springeropen.com/ sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
Discipline	Computer Science
EISSN	1573-7616
ExternalDocumentID	10_1007_s10664_023_10405_9
GroupedDBID	-4Z -59 -5G -BR -EM -Y2 -~C .86 .DC .VR 06D 0R~ 0VY 199 1N0 1SB 2.D 203 28- 29G 2J2 2JN 2JY 2KG 2LR 2P1 2VQ 2~H 30V 4.4 406 408 409 40D 40E 5GY 5QI 5VS 67Z 6NX 78A 8FE 8FG 8TC 8UJ 95- 95. 95~ 96X AABHQ AABYN AAEOY AAFGU AAHNG AAIAL AAJKR AANZL AAOBN AARHV AARTL AASML AATNV AATVU AAUYE AAWCG AAWWR AAYIU AAYOK AAYQN AAYTO ABAKF ABBBX ABBXA ABDZT ABECU ABFGW ABFTD ABFTV ABHLI ABHQN ABJCF ABJNI ABJOX ABKAS ABKCH ABKTR ABMNI ABMQK ABMYL ABNWP ABQBU ABSXP ABTEG ABTHY ABTKH ABTMW ABULA ABWNU ABXPI ACAOD ACBMV ACBRV ACBXY ACBYP ACGFS ACHSB ACHXU ACIGE ACIPQ ACIWK ACKNC ACMDZ ACMLO ACOKC ACOMO ACSNA ACTTH ACVWB ACWMK ACZOJ ADGRI ADHHG ADHIR ADIMF ADINQ ADKNI ADKPE ADMDM ADOXG ADRFC ADTPH ADURQ ADYFF ADZKW AEBTG AEEQQ AEFIE AEFQL AEFTE AEGAL AEGNC AEJHL AEJRE AEKMD AEMSY AENEX AEOHA AEPYU AESKC AESTI AETLH AEVLU AEVTX AEXYK AEYWE AFBBN AFEXP AFGCZ AFKRA AFLOW AFNRJ AFQWF AFWTZ AFZKB AGAYW AGDGC AGGBP AGGDS AGJBK AGMZJ AGQEE AGQMX AGRTI AGWIL AGWZB AGYKE AHAVH AHBYD AHKAY AHSBF AHYZX AIAKS AIGIU AIIXL AILAN AIMYW AITGF AJBLW AJDOV AJRNO AJZVZ AKQUC ALMA_UNASSIGNED_HOLDINGS ALWAN AMKLP AMXSW AMYLF AMYQR AOCGG ARAPS ARMRJ ASPBG AVWKF AXYYD AYJHY AZFZN B-. BA0 BBWZM BDATZ BENPR BGLVJ BGNMA C6C CAG CCPQU COF CS3 CSCUP DDRTE DL5 DNIVK DPUIP DU5 EBLON EBS EIOEI EJD ESBYG FEDTE FERAY FFXSO FIGPU FINBP FNLPD FRRFC FSGXE FWDCC GGCAI GGRSB GJIRD GNWQR GQ6 GQ7 GQ8 GXS HCIFZ HF~ HG5 HG6 HMJXF HQYDN HRMNR HVGLF HZ~ I09 IHE IJ- IKXTQ ITM IWAJR IXC IZIGR IZQ I~X I~Z J-C J0Z JBSCW JCJTX JZLTJ KDC KOV KOW L6V LAK LLZTM M4Y M7S MA- N2Q NB0 NDZJH NPVJJ NQJWS NU0 O9- O93 O9G O9I O9J OAM P19 P62 P9O PF0 PT4 PT5 PTHSS Q2X QOK QOS R4E R89 R9I RHV RNI RNS ROL RPX RSV RZC RZE RZK S0W S16 S1Z S26 S27 S28 S3B SAP SCJ SCLPG SCO SDH SDM SHX SISQX SJYHP SNE SNPRN SNX SOHCF SOJ SPISZ SRMVM SSLCW STPWE SZN T13 T16 TSG TSK TSV TUC U2A UG4 UNUBA UOJIU UTJUX UZXMN VC2 VFIZW W23 W48 WK8 YLTOR Z45 Z7R Z7S Z7V Z7X Z7Z Z81 Z83 Z86 Z88 Z8M Z8N Z8P Z8R Z8T Z8U Z8W Z92 ZMTXR ~EX AACDK AAJBT AAYXX ACDTI CITATION H13 7SC 8FD JQ2 L7M L~C L~D
ID	FETCH-LOGICAL-c314t-b961df7427e3f52c8bd6b8e2c6bdf00b7bd1c392e1613a03dc848a1d705431e93
IEDL.DBID	AGYKE
ISSN	1382-3256
IngestDate	Thu Oct 10 22:05:01 EDT 2024 Thu Sep 12 18:57:08 EDT 2024 Sat Mar 23 01:27:41 EDT 2024
IsDoiOpenAccess	true
IsOpenAccess	true
IsPeerReviewed	true
IsScholarly	true
Issue	2
Keywords	AI D2A Bert Leaderboard Dataset Vulnerability detection
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-c314t-b961df7427e3f52c8bd6b8e2c6bdf00b7bd1c392e1613a03dc848a1d705431e93
ORCID	0000-0002-9772-3162
OpenAccessLink	https://proxy.k.utb.cz/login?url=http://link.springer.com/10.1007/s10664-023-10405-9
PQID	2930090084
PQPubID	326341
ParticipantIDs	proquest_journals_2930090084 crossref_primary_10_1007_s10664_023_10405_9 springer_journals_10_1007_s10664_023_10405_9
PublicationCentury	2000
PublicationDate	2024-03-01
PublicationDateYYYYMMDD	2024-03-01
PublicationDate_xml	– month: 03 year: 2024 text: 2024-03-01 day: 01
PublicationDecade	2020
PublicationPlace	New York
PublicationPlace_xml	– name: New York – name: Dordrecht
PublicationSubtitle	An International Journal
PublicationTitle	Empirical software engineering : an international journal
PublicationTitleAbbrev	Empir Software Eng
PublicationYear	2024
Publisher	Springer US Springer Nature B.V
Publisher_xml	– name: Springer US – name: Springer Nature B.V
References	Devlin J, Chang M-W, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the north American chapter of the association for computational linguistics: human language technologies, vol 1 (Long and Short Papers), pp 4171–4186 Fan G, Wu R, Shi Q, Xiao X, Zhou J, Zhang C (2019) Smoke: scalable path-sensitive memory leak detection for millions of lines of code. In ICSE’19 Kremenek T, Engler DR (2003) Z-ranking: using statistical analysis to counter the impact of static analysis approximations. In: Cousot R (ed), Static analysis, 10th international symposium, SAS 2003 CalcagnoCDistefanoDO’HearnPWYangHCompositional shape analysis by means of bi-abductionJ ACM20115826166286339710.1145/2049697.2049700 Dorogush AV, Ershov V, Gulin A (2018) Catboost: gradient boosting with categorical features support. arXiv:1810.11363 Muske T, Serebrenik A (2016) Survey of approaches for handling static analysis alarms. In: 2016 IEEE 16th international working conference on source code analysis and manipulation (SCAM), pp 157–166 Du X, Chen B, Li Y, Guo J, Zhou Y, Liu Y, Jiang Y (2019) Leopard: identifying vulnerable code for vulnerability assessment through program metrics. In: 2019 IEEE/ACM 41st international conference on software engineering (ICSE). IEEE, pp 60–71 Kudo T, Richardson J (2018) Sentencepiece: a simple and language independent subword tokenizer and detokenizer for neural text processing. In EMNLP Russell RL, Kim LY, Hamilton LH, Lazovich T, Harer J, Ozdemir O, Ellingwood PM, McConley MW (2018) Automated vulnerability detection in source code using deep representation learning. In ICMLA’18 Yüksel U, Sözer H (2013a) Automated classification of static code analysis alerts: a case study. In ICSM’13 Sestili CD, Snavely WS, VanHoudnos NM (2018) Towards security defect prediction with AI. CoRR, abs/1808.09897. http://arxiv.org/abs/1808.09897 Koc U, Saadatpanah P, Foster JS, Porter AA (2017a) Learning a classifier for false positive error reports emitted by static code analysis tools. In MAPL’17, pp 35–42 Puri R, Kung DS, Janssen G, Zhang W, Domeniconi G, Zolotov V, Dolby J, Chen J, Choudhury MR, Decker L, Thost V, Buratti L, Pujar S, Finkler U (2021) Project codenet: a large-scale AI for code dataset for learning a diversity of coding tasks. ArXiv, abs/2105.12655 Facebook (2023b) Infer reportdiff. https://fbinfer.com/docs/man-infer-reportdiff Cppcheck-team (2023). Cppcheck. http://cppcheck.sourceforge.net Hindle A, Barr ET, Su Z, Gabel M, Devanbu P (2012) On the naturalness of software. In: 2012 34th international conference on software engineering (ICSE), pp 837–847. https://doi.org/10.1109/ICSE.2012.6227135 Suneja S, Zheng Y, Zhuang Y, Laredo J, Morari A (2020) Learning to map source code to software vulnerability using code-as-a-graph. CoRR, abs/2006.08614 Wang Y, Wang W, Joty S, Hoi SCH (2021) CodeT5: identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. In: Proceedings of the 2021 conference on empirical methods in natural language processing, pp 8696–8708. Association for computational linguistics, November 2021. https://doi.org/10.18653/v1/2021.emnlp-main.685 Zheng Y, Pujar S, Lewis B, Buratti L, Epstein E, Yang B, Laredo J, Morari A, Su Z (2021) D2a: a dataset built for ai-based vulnerability detection methods using differential analysis. In: 2021 IEEE/ACM 43rd international conference on software engineering: software engineering in practice (ICSE-SEIP). IEEE, pp 111–120 Yamaguchi F, Maier A, Gascon H, Rieck K (2015) Automatic inference of search patterns for taint-style vulnerabilities. In: 2015 IEEE symposium on security and privacy Murphy-HillEZimmermannTBirdCNagappanNThe design space of bug fixes and how developers navigate itIEEE Trans Software Eng2015411658110.1109/TSE.2014.2357438 Jung Y, Kim J, Shin J, Yi K (2013) Taming false alarms from a domain-unaware C analyzer by a bayesian statistical post analysis. In: Proceedings of the 12th international conference on static analysis, SAS’05, pp 203–217 WolpertDHStacked generalizationNeural Netw19925224125910.1016/S0893-6080(05)80023-1 CWE400 (2023). Cwe-400: uncontrolled resource consumption. https://cwe.mitre.org/data/definitions/400.html Yamaguchi F, Golde N, Arp D, Rieck K (2014) Modeling and discovering vulnerabilities with code property graphs. In: 2014 IEEE symposium on security and privacy, pp 590–604. https://doi.org/10.1109/SP.2014.44 LaToza TD, Venolia G, DeLine R (2006) Maintaining mental models: a study of developer work habits. In: Proceedings of the 28th international conference on software engineering Zhang X, Si X, Naik M (2017) Combining the logical and the probabilistic in program analysis. In: Proceedings of the 1st ACM SIGPLAN international workshop on machine learning and programming languages, MAPL 2017, pp 27–34 Villard J (2023). Infer is not deterministic, infer issue #1110. https://github.com/facebook/infer/issues/1110 Flynn L (2016) Prioritizing alerts from static analysis to find and fix code flaws. http://insights.sei.cmu.edu/blog/prioritizing-alerts-from-static-analysis-to-find-and-fix-code-flaws Reynolds ZP, Jayanth AB, Koc U, Porter AA, Raje RR, Hill JH (2017) Identifying and documenting false positive patterns generated by static code analysis tools. In: 4th international workshop on software engineering research and industrial practice Infer. Infer issue types. https://github.com/facebook/infer/blob/ea4f7cf/infer/man/man1/infer.txt#L370 MITRETop25. Cwe top 25 most dangerous software weaknessess. https://cwe.mitre.org/top25/archive/2022/2022_cwe_top25.html Nie P, Zhang J, Li JJ, Mooney RJ, Gligoric M (2021) Impact of evaluation methodologies on code summarization. arXiv:2108.09619 Guo D, Ren S, Lu S, Feng Z, Tang D, Liu S, Zhou L, Duan N, Svyatkovskiy A, Fu S, Tufano M, Deng SK, Clement C, Drain D, Sundaresan N, Yin J, Jiang D, Zhou M (2021) Graphcodebert: pre-training code representations with data flow. In: International conference on learning representations Yüksel U, Sözer H (2013b) Automated classification of static code analysis alerts: a case study. In: 2013 IEEE international conference on software maintenance, pp 532–535 Johnson B, Song Y, Murphy-Hill E, Bowdidge R (2013) Why don’t software developers use static analysis tools to find bugs? In ICSE’13, pp 672–681 Livshits VB, Lam MS (2005) Finding security vulnerabilities in java applications with static analysis. In: Proceedings of the 14th conference on USENIX security symposium Flynn L, Snavely W, Kurtz Z (2018) Test suites as a source of training data for static analysis alert classifiers. SEI Blog. https://insights.sei.cmu.edu/sei_blog/2018/04/static-analysis-alert-test-suites-as-a-source-of-training-data-for-alert-classifiers.html Sui Y, Cheng X, Zhang G, Wang H (2020) Flow2vec: value-flow-based precise code embedding. OOPSLA AyewahNPughWHovemeyerDMorgenthalerJDPenixJUsing static analysis to find bugsIEEE Softw2008255222910.1109/MS.2008.130 Buratti L, Pujar S, Bornea M, McCarley JS, Zheng Y, Rossiello G, Morari A, Laredo J, Thost V, Zhuang Y, Domeniconi G (2020) Exploring software naturalness through neural language models. CoRR, abs/2006.12641 Feng Z, Guo D, Tang D, Duan N, Feng X, Gong M, Shou L, Qin B, Liu T, Jiang D, Zhou M (2020) CodeBERT: a pre-trained model for programming and natural languages. In: Findings of the association for computational linguistics: EMNLP 2020, pp 1536–1547. https://doi.org/10.18653/v1/2020.findings-emnlp.139 Guarnieri S, Pistoia M, Tripp O, Dolby J, Teilhet S, Berg R (2011) Saving the world wide web from vulnerable Javascript. In: Proceedings of the 2011 international symposium on software testing and analysis, ISSTA’11 Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, Ye Q, Liu T-Y (2017) Lightgbm: a highly efficient gradient boosting decision tree. Adv Neural Inf Process Syst 30:3146–3154 Sahami M, Heilman TD (2006) A web-based kernel function for measuring the similarity of short text snippets. In WWW ’06 Ray B, Hellendoorn V, Godhane S, Tu Z, Bacchelli A, Devanbu P (2016) On the naturalness of buggy code. ICSE ’16, pp 428–439 Zhou Y, Liu S, Siow JK, Du X, Liu Y (2019) Devign: effective vulnerability identification by learning comprehensive program semantics via graph neural networks. In NeurIPS’19 NIST (2023a) National vulnerability database. https://nvd.nist.gov LLVM. The clang static analyzer. https://clang-analyzer.llvm.org Tripp O, Guarnieri S, Pistoia M, Aravkin A (2014) ALETHEIA: improving the usability of static security analysis. In: Proceedings of the 2014 ACM SIGSAC conference on computer and communications security, pp 762–774 Ayewah N, Pugh W, Morgenthaler JD, Penix J, Zhou YQ (2007) Using find bugs on production software. In OOPSLA’07 Clang (2023). Clang tooling. https://clang.llvm.org/docs/Tooling.html Facebook (2023a) Infer static analyzer. https://fbinfer.com Muske TB, Baid A, Sanas T (2013) Review efforts reduction by partitioning of static analysis warnings. In: 13th international working conference on source code analysis and manipulation Wiki (2023). Libav. https://en.wikipedia.org/wiki/Libav#Fork_from_FFmpeg O’Hearn P, Reynolds J, Yang H (2001) Local reasoning about programs that alter data structures. LNCS 2142 Paletov R, Tsankov P, Raychev V, Vechev M (2018) Inferring crypto api rules from code changes. In: Proceedings of the 39th ACM SIGPLAN conference on programming language design and implementation, PLDI 2018, pp 450–464 Li Z, Zou D, Xu S, Ou X, Jin H, Wang S, Deng Z, Zhong Y (2018) Vuldeepecker: a deep learning-based system for vulnerability detection. In: 25th annual network and distributed system security symposium, NDSS’18 CWE457 (2023) Cwe-457: use of uninitialized variable. https://cwe.mitre.org/data/definitions/457.html Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V (2019) Roberta: a robustly optimized bert pretraining approach. ArXiv, abs/1907.11692 Wheeler DA (2023). Flawfinder. 10405_CR46 10405_CR47 10405_CR44 10405_CR42 10405_CR43 10405_CR40 10405_CR41 10405_CR1 10405_CR3 C Calcagno (10405_CR5) 2011; 58 10405_CR48 10405_CR49 E Murphy-Hill (10405_CR45) 2015; 41 10405_CR13 10405_CR57 10405_CR14 10405_CR58 10405_CR11 10405_CR55 10405_CR12 10405_CR56 10405_CR53 10405_CR10 10405_CR54 10405_CR51 10405_CR52 10405_CR19 10405_CR17 10405_CR18 10405_CR15 10405_CR59 10405_CR16 10405_CR9 10405_CR8 10405_CR4 10405_CR7 10405_CR6 10405_CR50 G Lin (10405_CR38) 2020; 108 10405_CR24 10405_CR68 10405_CR25 10405_CR22 10405_CR66 10405_CR23 10405_CR67 10405_CR20 10405_CR64 10405_CR21 10405_CR65 10405_CR62 10405_CR63 10405_CR28 10405_CR29 10405_CR26 10405_CR27 DH Wolpert (10405_CR69) 1992; 5 G Ke (10405_CR31) 2017; 30 N Ayewah (10405_CR2) 2008; 25 10405_CR60 10405_CR61 10405_CR35 10405_CR36 10405_CR33 10405_CR77 10405_CR34 10405_CR78 10405_CR75 10405_CR32 10405_CR76 10405_CR73 10405_CR30 10405_CR74 10405_CR39 10405_CR37 10405_CR71 10405_CR72 10405_CR70
References_xml	– ident: 10405_CR27 doi: 10.1109/ICSE.2013.6606613 – ident: 10405_CR44 – ident: 10405_CR21 – ident: 10405_CR58 doi: 10.1109/ICMLA.2018.00120 – ident: 10405_CR40 – ident: 10405_CR25 doi: 10.1109/ICSE.2012.6227135 – volume: 5 start-page: 241 issue: 2 year: 1992 ident: 10405_CR69 publication-title: Neural Netw doi: 10.1016/S0893-6080(05)80023-1 contributor: fullname: DH Wolpert – ident: 10405_CR29 – ident: 10405_CR67 – ident: 10405_CR16 doi: 10.1109/ICSE.2019.00024 – ident: 10405_CR78 doi: 10.1007/978-3-031-01587-8_4 – ident: 10405_CR35 doi: 10.18653/v1/D18-2012 – ident: 10405_CR52 doi: 10.1145/3192366.3192403 – ident: 10405_CR11 – ident: 10405_CR30 – ident: 10405_CR9 – ident: 10405_CR15 – ident: 10405_CR66 doi: 10.18653/v1/2021.emnlp-main.685 – ident: 10405_CR55 doi: 10.1145/3192366.3192417 – ident: 10405_CR72 – ident: 10405_CR53 – volume: 108 start-page: 1825 issue: 10 year: 2020 ident: 10405_CR38 publication-title: Proc IEEE doi: 10.1109/JPROC.2020.2993293 contributor: fullname: G Lin – ident: 10405_CR76 – ident: 10405_CR68 – ident: 10405_CR3 doi: 10.1145/1297846.1297897 – ident: 10405_CR43 – ident: 10405_CR8 doi: 10.24963/ijcai.2017/214 – ident: 10405_CR20 doi: 10.18653/v1/2020.findings-emnlp.139 – ident: 10405_CR26 – ident: 10405_CR34 doi: 10.1007/3-540-44898-5_16 – ident: 10405_CR28 doi: 10.1007/11547662_15 – ident: 10405_CR63 doi: 10.1145/2660267.2660339 – ident: 10405_CR60 – ident: 10405_CR56 doi: 10.1145/2884781.2884848 – ident: 10405_CR64 – volume: 58 start-page: 1 issue: 26 year: 2011 ident: 10405_CR5 publication-title: J ACM doi: 10.1145/2049697.2049700 contributor: fullname: C Calcagno – ident: 10405_CR6 – ident: 10405_CR12 – ident: 10405_CR19 doi: 10.1109/ICSE.2019.00025 – ident: 10405_CR32 doi: 10.1145/3088525.3088675 – volume: 30 start-page: 3146 year: 2017 ident: 10405_CR31 publication-title: Adv Neural Inf Process Syst contributor: fullname: G Ke – ident: 10405_CR47 doi: 10.1109/SCAM.2013.6648191 – ident: 10405_CR50 – ident: 10405_CR75 doi: 10.1145/3088525.3088563 – ident: 10405_CR54 – ident: 10405_CR24 doi: 10.1145/2597073.2597100 – ident: 10405_CR17 – ident: 10405_CR42 – ident: 10405_CR23 – ident: 10405_CR65 – ident: 10405_CR73 doi: 10.1109/ICSM.2013.89 – ident: 10405_CR61 doi: 10.1145/3428301 – ident: 10405_CR46 doi: 10.1109/SCAM.2016.25 – volume: 41 start-page: 65 issue: 1 year: 2015 ident: 10405_CR45 publication-title: IEEE Trans Software Eng doi: 10.1109/TSE.2014.2357438 contributor: fullname: E Murphy-Hill – ident: 10405_CR13 – ident: 10405_CR70 doi: 10.1109/SP.2014.44 – ident: 10405_CR22 doi: 10.1145/2001420.2001442 – ident: 10405_CR37 doi: 10.14722/ndss.2018.23158 – ident: 10405_CR71 doi: 10.1109/SP.2015.54 – ident: 10405_CR59 doi: 10.1145/1135777.1135834 – ident: 10405_CR18 – ident: 10405_CR1 doi: 10.1145/3212695 – ident: 10405_CR62 – ident: 10405_CR77 doi: 10.1109/ICSE-SEIP52600.2021.00020 – ident: 10405_CR41 – ident: 10405_CR51 doi: 10.1007/3-540-44802-0_1 – ident: 10405_CR49 – ident: 10405_CR4 – ident: 10405_CR7 doi: 10.1145/2939672.2939785 – ident: 10405_CR57 doi: 10.1109/SER-IP.2017..20 – ident: 10405_CR48 doi: 10.18653/v1/2022.acl-long.339 – ident: 10405_CR10 – ident: 10405_CR14 – volume: 25 start-page: 22 issue: 5 year: 2008 ident: 10405_CR2 publication-title: IEEE Softw doi: 10.1109/MS.2008.130 contributor: fullname: N Ayewah – ident: 10405_CR74 doi: 10.1109/ICSM.2013.89 – ident: 10405_CR36 doi: 10.1145/1134285.1134355 – ident: 10405_CR33 doi: 10.1145/3088525.3088675 – ident: 10405_CR39
SSID	ssj0009745
Score	2.401814
Snippet	Static analysis tools are widely used for vulnerability detection as they can analyze programs with complex behavior and millions of lines of code. Despite... Abstract Static analysis tools are widely used for vulnerability detection as they can analyze programs with complex behavior and millions of lines of code....
SourceID	proquest crossref springer
SourceType	Aggregation Database Publisher
StartPage	48
SubjectTerms	Community participation Compilers Computer Science Datasets Deep learning False alarms Interpreters Machine learning Programming Languages Software Engineering/Programming and Operating Systems Source code Special Issue on Software Engineering in Practice Static code analysis
Title	Analyzing source code vulnerabilities in the D2A dataset with ML ensembles and C-BERT
URI	https://link.springer.com/article/10.1007/s10664-023-10405-9 https://www.proquest.com/docview/2930090084
Volume	29
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV07T8MwED4BXVh4I8qj8sAGRnk6zlhCoQLKgFoJpih-SYgSEE0Z-PWc81B4DixZ4ljKne37bN_3HcBhxkIV-zyjYaAcGsjI0Nh3BOXciDDKuJHKkpNHN2w4CS7vwruWx10muzc3kuVC_YnrxlhAMcTgyoEog8aL0KmJp53-xf3VoNXajcraxFZdj_oY0muuzO-9fI1HLcj8di9ahpvzVRg3pJ0qy-TxZF6IE_n-U8PxP3-yBis1_CT9arysw4LON2C1Ke1A6pm-CZNSrOQduyXV8T6x3HfyNp9aleoyoRa32OQhJwggyZnXJzbVdKYLYg92yeia4PZYP4kpNspyRRJ6Orgdb8HkfDBOhrSuwECl7wYFFTFzlcHdc6R9E3qSC8UE155kQhnHEZFQrkSEpRE3-pnjK8kDnrkqcizFXsf-Nizlz7neAcJwRc40fusZNwi15sJDqCJCEwgjEZV24ajxQ_pSCW2kraSytViKFktLi6VxF_YbV6X1pJuliFzQ-bZCQBeOG9O3r__ubfd_zfdg2UNoU2Wi7cNS8TrXBwhNCtGrh2IPFhOW4HPi9T8AhSnZTw
link.rule.ids	315,783,787,27936,27937,41093,41132,41535,42162,42201,42604,51588,52123,52246
linkProvider	Springer Nature
linkToHtml	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV07T8MwELagDLDwRhQKeGADS0nsJM5YSqsCbQfUSt2s-CUhlYBoysCv55yHAggG5jg3fLbvvrPvPiN0mUahTihPSci0R5iKLUmoJwnnVoZxyq3Srjl5PImGM3Y_D-dVU9iyrnavryQLT_2l2S2KGIEYA64DaAZJ1tGG01d3ivmzoNtI7cbF08ROXI9QiOhVq8zvNr6Ho4Zj_rgWLaLNYBdtVzQRd8t53UNrJttHO_UTDLjakQdoVoiKfIANXB7DY9ejjt9XC6cmXRS-QiqMnzIMRA_fBl3sSkKXJsfuABaPRxjSWPMsFzAozTTukZv-4_QQzQb9aW9IqpcSiKI-y4lMIl9byHJjQ20YKC51JLkJVCS19TwZS-0rYEIG-B1NPaoVZzz1dey5VniT0CPUyl4yc4xwBJ4zNfBvYH0WGsNlAJRChpZJq4A9ttFVDZh4LQUxRCN97OAVAK8o4BVJG3VqTEW1OZYCGAbMklPyb6PrGufm89_WTv43_AJtDqfjkRjdTR5O0VYAdKSsHuugVv62MmdAJ3J5XqyeT2Fyvmw
linkToPdf	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwpV3JTsMwEB1BkRAXdkRZfeAGhixO4hwLbSmrEKISnKJ4kxAlVDTlwNczzqICggPiHMdJ7MnMsz3vDcBeGgYq9nlKA6YcymRkaOw7gnJuRBCl3EhlyclX12Gvz87vg_tPLP4i270-kiw5DValKcuPhsocfSK-hSGjGG_QjSDkoPE0zDCrjNSAmdbpw0VnIrwbFYWKrdQe9TG-V8SZn3v5GpwmiPPbIWkRe7oLkNZvXaacPB2Oc3Eo378JOv7nsxZhvgKmpFVa0hJM6WwZFuqiD6TyASvQL2RM3vERpNz4J5YVT97GA6tfXaTa4uKbPGYEoSVpey1ik1BHOid2y5dcXRJcOOtnMcBGaabICT3u3N6tQr_buTvp0ao2A5W-y3Iq4tBVBtfVkfZN4EkuVCi49mQolHEcEQnlSsReGhGlnzq-kpzx1FWRY8n3OvbXoJG9ZHodSIi-OtV4r2dcFmjNhYcgRgSGCSMRrzZhv56UZFhKcCQTsWU7YgmOWFKMWBI3Yauet6T6HUcJYhq0BFs7oAkH9TRMLv_e28bfmu_C7E27m1yeXV9swpyH-KdMV9uCRv461tuIX3KxU5noByHD5EE
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Analyzing+source+code+vulnerabilities+in+the+D2A+dataset+with+ML+ensembles+and+C-BERT&rft.jtitle=Empirical+software+engineering+%3A+an+international+journal&rft.au=Pujar%2C+Saurabh&rft.au=Zheng%2C+Yunhui&rft.au=Buratti%2C+Luca&rft.au=Lewis%2C+Burn&rft.date=2024-03-01&rft.pub=Springer+US&rft.issn=1382-3256&rft.eissn=1573-7616&rft.volume=29&rft.issue=2&rft_id=info:doi/10.1007%2Fs10664-023-10405-9&rft.externalDocID=10_1007_s10664_023_10405_9
thumbnail_l	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1382-3256&client=summon
thumbnail_m	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1382-3256&client=summon
thumbnail_s	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1382-3256&client=summon