Show Me Your Code! Kill Code Poisoning: A Lightweight Method Based on Code Naturalness
Neural code models (NCMs) have demonstrated extraordinary capabilities in code intelligence tasks. Meanwhile, the security of NCMs and NCMs-based systems has garnered increasing attention. In particular, NCMs are often trained on large-scale data from potentially untrustworthy sources, providing att...
Saved in:
Published in | Proceedings / International Conference on Software Engineering pp. 2663 - 2675 |
---|---|
Main Authors | , , , , , , , , |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
26.04.2025
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Abstract | Neural code models (NCMs) have demonstrated extraordinary capabilities in code intelligence tasks. Meanwhile, the security of NCMs and NCMs-based systems has garnered increasing attention. In particular, NCMs are often trained on large-scale data from potentially untrustworthy sources, providing attackers with the opportunity to manipulate them by inserting crafted samples into the data. This type of attack is called a code poisoning attack (also known as a backdoor attack). It allows attackers to implant backdoors in NCMs and thus control model behavior, which poses a significant security threat. However, there is still a lack of effective techniques for detecting various complex code poisoning attacks. In this paper, we propose an innovative and lightweight technique for code poisoning detection named KillbadCode. KillbadCode is designed based on our insight that code poisoning disrupts the naturalness of code. Specifically, KillBADCODE first builds a code language model (CodeLM) on a lightweight n -gram language model. Then, given poisoned data, KillbadCode utilizes CodeLM to identify those tokens in (poisoned) code snippets that will make the code snippets more natural after being deleted as trigger tokens. Considering that the removal of some normal tokens in a single sample might also enhance code naturalness, leading to a high false positive rate (FPR), we aggregate the cumulative improvement of each token across all samples. Finally, KillbadCode purifies the poisoned data by removing all poisoned samples containing the identified trigger tokens. We conduct extensive experiments to evaluate the effectiveness and efficiency of KillbadCode, involving two types of advanced code poisoning attacks (a total of five poisoning strategies) and datasets from four representative code intelligence tasks. The experimental results demonstrate that across 20 code poisoning detection scenarios, KillbadCode achieves an average FPR of 8.30 % and an average Recall of 100 %, significantly outperforming four baselines. More importantly, KillBadCode is very efficient, with a minimum time consumption of only 5 minutes, and is 25 times faster than the best baseline on average. |
---|---|
AbstractList | Neural code models (NCMs) have demonstrated extraordinary capabilities in code intelligence tasks. Meanwhile, the security of NCMs and NCMs-based systems has garnered increasing attention. In particular, NCMs are often trained on large-scale data from potentially untrustworthy sources, providing attackers with the opportunity to manipulate them by inserting crafted samples into the data. This type of attack is called a code poisoning attack (also known as a backdoor attack). It allows attackers to implant backdoors in NCMs and thus control model behavior, which poses a significant security threat. However, there is still a lack of effective techniques for detecting various complex code poisoning attacks. In this paper, we propose an innovative and lightweight technique for code poisoning detection named KillbadCode. KillbadCode is designed based on our insight that code poisoning disrupts the naturalness of code. Specifically, KillBADCODE first builds a code language model (CodeLM) on a lightweight n -gram language model. Then, given poisoned data, KillbadCode utilizes CodeLM to identify those tokens in (poisoned) code snippets that will make the code snippets more natural after being deleted as trigger tokens. Considering that the removal of some normal tokens in a single sample might also enhance code naturalness, leading to a high false positive rate (FPR), we aggregate the cumulative improvement of each token across all samples. Finally, KillbadCode purifies the poisoned data by removing all poisoned samples containing the identified trigger tokens. We conduct extensive experiments to evaluate the effectiveness and efficiency of KillbadCode, involving two types of advanced code poisoning attacks (a total of five poisoning strategies) and datasets from four representative code intelligence tasks. The experimental results demonstrate that across 20 code poisoning detection scenarios, KillbadCode achieves an average FPR of 8.30 % and an average Recall of 100 %, significantly outperforming four baselines. More importantly, KillBadCode is very efficient, with a minimum time consumption of only 5 minutes, and is 25 times faster than the best baseline on average. |
Author | Chen, Zhenyu Wang, Chong Sun, Weisong Liu, Yang Xu, Baowen Chen, Zhenpeng Chen, Yuchen Yuan, Mengzhe Fang, Chunrong |
Author_xml | – sequence: 1 givenname: Weisong surname: Sun fullname: Sun, Weisong email: weisong.sun@ntu.edu.sg organization: College of Computing and Data Science, Nanyang Technological University,Singapore – sequence: 2 givenname: Yuchen surname: Chen fullname: Chen, Yuchen email: yuc.chen@smail.nju.edu.cn organization: Nanjing University,State Key Laboratory for Novel Software Technology,Nanjing,China – sequence: 3 givenname: Mengzhe surname: Yuan fullname: Yuan, Mengzhe email: shiroha123321@gmail.com organization: Nanjing University,State Key Laboratory for Novel Software Technology,Nanjing,China – sequence: 4 givenname: Chunrong surname: Fang fullname: Fang, Chunrong email: fangchunrong@nju.edu.cn organization: Nanjing University,State Key Laboratory for Novel Software Technology,Nanjing,China – sequence: 5 givenname: Zhenpeng surname: Chen fullname: Chen, Zhenpeng email: zhenpeng.chen@ntu.edu.sg organization: College of Computing and Data Science, Nanyang Technological University,Singapore – sequence: 6 givenname: Chong surname: Wang fullname: Wang, Chong email: chong.wang@ntu.edu.sg organization: College of Computing and Data Science, Nanyang Technological University,Singapore – sequence: 7 givenname: Yang surname: Liu fullname: Liu, Yang email: yangliu@ntu.edu.sg organization: College of Computing and Data Science, Nanyang Technological University,Singapore – sequence: 8 givenname: Baowen surname: Xu fullname: Xu, Baowen email: bwxu@nju.edu.cn organization: Nanjing University,State Key Laboratory for Novel Software Technology,Nanjing,China – sequence: 9 givenname: Zhenyu surname: Chen fullname: Chen, Zhenyu email: zychen@nju.edu.cn organization: Nanjing University,State Key Laboratory for Novel Software Technology,Nanjing,China |
BookMark | eNotUM1OwzAYCwgktrE32CE8QEu-fE3bcBvVgInyI21C4jRl5OsWVBrUdJp4ezrGxbYs2wcP2VnjG2JsAiIGEPp6XixmSmGSxVJIFQshk-yEjXWmc0RQQqUaTtkAlMojkFJdsGEIn0KINNF6wN4WW7_nT8Tf_a7lhbd0xR9dXf9J_upd8I1rNjd8yku32XZ7OmBf6Lbe8lsTyHLfHNPPptu1pm4ohEt2Xpk60PifR2x5N1sWD1H5cj8vpmXkNHYRVqZKgYjWkFOewkcFBFYoSg3SwUdBWZIoqDJNqUREm1slsYK11ZAQjtjkOOv68Oq7dV-m_Vn1v0idiwx_AShTUo4 |
CODEN | IEEPAD |
ContentType | Conference Proceeding |
DBID | 6IE 6IH CBEJK RIE RIO |
DOI | 10.1109/ICSE55347.2025.00247 |
DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan (POP) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP) 1998-present |
DatabaseTitleList | |
Database_xml | – sequence: 1 dbid: RIE name: IEEE Xplore Digital Library url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Computer Science |
EISBN | 9798331505691 |
EISSN | 1558-1225 |
EndPage | 2675 |
ExternalDocumentID | 11029807 |
Genre | orig-research |
GrantInformation_xml | – fundername: Fundamental Research Funds for the Central Universities grantid: 14380029 funderid: 10.13039/501100012226 – fundername: Science, Technology and Innovation Commission of Shenzhen Municipality grantid: CJGJZD20200617103001003,2021Szvup057 funderid: 10.13039/501100010877 – fundername: National Research Foundation funderid: 10.13039/501100001321 – fundername: National Natural Science Foundation of China grantid: 61932012,62372228,U24A20337 funderid: 10.13039/501100001809 |
GroupedDBID | -~X .4S .DC 29O 5VS 6IE 6IF 6IH 6IK 6IL 6IM 6IN 8US AAJGR AAWTH ABLEC ADZIZ ALMA_UNASSIGNED_HOLDINGS ARCSS AVWKF BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO EDO FEDTE I-F IEGSK IJVOP IPLJI M43 OCL RIE RIL RIO |
ID | FETCH-LOGICAL-i93t-3faf61eeeb18e861cf1e1d05e6a3e1eee30e74451f79e62333d8d523f1bd914e3 |
IEDL.DBID | RIE |
IngestDate | Wed Aug 27 01:40:13 EDT 2025 |
IsPeerReviewed | false |
IsScholarly | true |
Language | English |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-i93t-3faf61eeeb18e861cf1e1d05e6a3e1eee30e74451f79e62333d8d523f1bd914e3 |
PageCount | 13 |
ParticipantIDs | ieee_primary_11029807 |
PublicationCentury | 2000 |
PublicationDate | 2025-April-26 |
PublicationDateYYYYMMDD | 2025-04-26 |
PublicationDate_xml | – month: 04 year: 2025 text: 2025-April-26 day: 26 |
PublicationDecade | 2020 |
PublicationTitle | Proceedings / International Conference on Software Engineering |
PublicationTitleAbbrev | ICSE |
PublicationYear | 2025 |
Publisher | IEEE |
Publisher_xml | – name: IEEE |
SSID | ssj0006499 |
Score | 2.3031695 |
Snippet | Neural code models (NCMs) have demonstrated extraordinary capabilities in code intelligence tasks. Meanwhile, the security of NCMs and NCMs-based systems has... |
SourceID | ieee |
SourceType | Publisher |
StartPage | 2663 |
SubjectTerms | Aggregates code intelligence code naturalness code poisoning attack and defense Codes Data models Implants neural code models Security Software engineering Training |
Title | Show Me Your Code! Kill Code Poisoning: A Lightweight Method Based on Code Naturalness |
URI | https://ieeexplore.ieee.org/document/11029807 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3da8IwEA-bT3tyH459k8Feq6ZJ02RvmyjuSwTd8E3a5sJk0g5RhP31u6R1A2Gwt3CkTchxuY_c746Qm1AlHFQkA7A6C4RCSU8gtYFA4zURUobozblsi4Hsv4rHSTSpwOoeCwMAPvkMmm7o3_JNka1cqKyFqirUymHHd9FzK8FaP9euRNu9wsaxtm49dEbdKOIiRh8w9HGTrQ4qXoH06mSwWbrMG_lorpZpM_vaqsr4773tk8YvVo8Of7TQAdmB_JDUN80aaCW7R-Rt9F6s6QtQlO8F7RQGrunTbD73QzosXFYR_uCW3tFn57CvfcwUP3Adpuk9KjtDi7ycPUh8tQ53SzbIuNcdd_pB1VQhmGm-DLhNrGS4-5QpUJJllgEz7Qgk8szReRtiV7TMxhrQNOLcKIPOqmWp0UwAPya1vMjhhNAEWCpSGYOBTMTKKhaj9YPyrlWMFH1KGu6Ypp9l2Yzp5oTO_qCfkz3HKvdUE8oLUlsuVnCJGn-ZXnlOfwNWx6qM |
linkProvider | IEEE |
linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3dS8MwEA8yH_Rpfkz8NoKv3ZamTVPfdDg298FgU_Y22uaCw9HK6Bj413tJuwkDwbdw9CPkuPzukvvdEfLgyoiD9IUDOkwcT6KlRxBrx0PnNfKEcDGaM9kWQ9F5816n_rQkq1suDADY5DOom6G9y1dZsjJHZQ2EKjeUhju-j8Dvs4Kutd14BXrvJTuONcNGtzV-8X3uBRgFuvbkZKeHioWQdpUMNz8vMkc-66s8riffO3UZ_z27I1L7ZevR0RaHjskepCekumnXQEvrPSXv449sTQdA0cKXtJUpuKe9-WJhh3SUmbwi_MAjfaJ9E7Kv7akpvmB6TNNnhDtFs7R4ehjZeh1mn6yRSftl0uo4ZVsFZx7y3OE60oLh7GMmQQqWaAZMNX0QqDUj500ITNkyHYSAzhHnSioMVzWLVcg84GekkmYpnBMaAYu9WASgIPECqSUL0P9Biw9lgJLwgtTMMs2-isIZs80KXf4hvyMHncmgP-t3h70rcmjUZi5uXHFNKvlyBTeI_3l8a7X-A20srdU |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+%2F+International+Conference+on+Software+Engineering&rft.atitle=Show+Me+Your+Code%21+Kill+Code+Poisoning%3A+A+Lightweight+Method+Based+on+Code+Naturalness&rft.au=Sun%2C+Weisong&rft.au=Chen%2C+Yuchen&rft.au=Yuan%2C+Mengzhe&rft.au=Fang%2C+Chunrong&rft.date=2025-04-26&rft.pub=IEEE&rft.eissn=1558-1225&rft.spage=2663&rft.epage=2675&rft_id=info:doi/10.1109%2FICSE55347.2025.00247&rft.externalDocID=11029807 |