Leveraging Large Language Models to Detect NPM Malicious Packages

Existing malicious code detection techniques demand the integration of multiple tools to detect different malware patterns, often suffering from high misclassification rates. Therefore, malicious code detection techniques could be enhanced by adopting advanced, more automated approaches to achieve h...

Full description

Saved in:

Bibliographic Details
Published in	Proceedings / International Conference on Software Engineering pp. 2625 - 2637
Main Authors	Zahan, Nusrat, Burckhardt, Philipp, Lysenko, Mikola, Aboukhadijeh, Feross, Williams, Laurie
Format	Conference Proceeding
Language	English
Published	IEEE 26.04.2025
Subjects	Accuracy Benchmark testing Codes Costs Large language models malicious code detection malicious code-review workflow Malware npm packages Reviews Security Software engineering software supply chain security Static analysis
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Existing malicious code detection techniques demand the integration of multiple tools to detect different malware patterns, often suffering from high misclassification rates. Therefore, malicious code detection techniques could be enhanced by adopting advanced, more automated approaches to achieve high accuracy and a low misclassification rate. The goal of this study is to aid security analysts in detecting malicious packages by empirically studying the effectiveness of Large Language Models (LLMs) in detecting malicious code. We present SocketAI, a malicious code review workflow to detect malicious code. To evaluate the effectiveness SocketAI, we leverage a benchmark dataset of 5,115 \text{npm} packages, of which 2,180 packages have malicious code. We conducted a baseline comparison of GPT3 and GPT-4 models with the state-of-the-art CodeQL static analysis tool, using 39 custom CodeQL rules developed in prior research to detect malicious Javascript code. We also compare the effectiveness of static analysis as a pre-screener with SocketAI workflow, measuring the number of files that need to be analyzed and the associated costs. Additionally, we performed a qualitative study to understand the types of malicious packages detected or missed by our workflow. Our baseline comparison demonstrates a 16 % and 9 % improvement over static analysis in precision and F1 scores, respectively. GPT-4 achieves higher accuracy with 99% precision and 97% F1 scores, while GPT-3 offers a more cost-effective balance at 91 % precision and 94 % F1 scores. Prescreening files with a static analyzer reduces the number of files requiring LLM analysis by \mathbf{7 7. 9 \%} and decreases costs by \mathbf{6 0. 9 \%} for GPT-3 and \mathbf{7 6. 1 \%} for GPT-4. Our qualitative analysis identified data theft, execution of arbitrary code, and suspicious domain categories as the top detected malicious packages.
ISSN:	1558-1225
DOI:	10.1109/ICSE55347.2025.00146