Enabling Privacy-Preserving Cyber Threat Detection with Federated Learning

Despite achieving good performance and wide adoption, machine learning based security detection models (e.g., malware classifiers) are subject to concept drift and evasive evolution of attackers, which renders up-to-date threat data as a necessity. However, due to enforcement of various privacy prot...

Full description

Saved in:

Bibliographic Details
Main Authors	Bi, Yu, Li, Yekai, Feng, Xuan, Mi, Xianghang
Format	Journal Article
Language	English
Published	07.04.2024
Subjects	Computer Science - Cryptography and Security
Online Access	Get full text
DOI	10.48550/arxiv.2404.05130

Cover

Loading…

Abstract	Despite achieving good performance and wide adoption, machine learning based security detection models (e.g., malware classifiers) are subject to concept drift and evasive evolution of attackers, which renders up-to-date threat data as a necessity. However, due to enforcement of various privacy protection regulations (e.g., GDPR), it is becoming increasingly challenging or even prohibitive for security vendors to collect individual-relevant and privacy-sensitive threat datasets, e.g., SMS spam/non-spam messages from mobile devices. To address such obstacles, this study systematically profiles the (in)feasibility of federated learning for privacy-preserving cyber threat detection in terms of effectiveness, byzantine resilience, and efficiency. This is made possible by the build-up of multiple threat datasets and threat detection models, and more importantly, the design of realistic and security-specific experiments. We evaluate FL on two representative threat detection tasks, namely SMS spam detection and Android malware detection. It shows that FL-trained detection models can achieve a performance that is comparable to centrally trained counterparts. Also, most non-IID data distributions have either minor or negligible impact on the model performance, while a label-based non-IID distribution of a high extent can incur non-negligible fluctuation and delay in FL training. Then, under a realistic threat model, FL turns out to be adversary-resistant to attacks of both data poisoning and model poisoning. Particularly, the attacking impact of a practical data poisoning attack is no more than 0.14\% loss in model accuracy. Regarding FL efficiency, a bootstrapping strategy turns out to be effective to mitigate the training delay as observed in label-based non-IID scenarios.
AbstractList	Despite achieving good performance and wide adoption, machine learning based security detection models (e.g., malware classifiers) are subject to concept drift and evasive evolution of attackers, which renders up-to-date threat data as a necessity. However, due to enforcement of various privacy protection regulations (e.g., GDPR), it is becoming increasingly challenging or even prohibitive for security vendors to collect individual-relevant and privacy-sensitive threat datasets, e.g., SMS spam/non-spam messages from mobile devices. To address such obstacles, this study systematically profiles the (in)feasibility of federated learning for privacy-preserving cyber threat detection in terms of effectiveness, byzantine resilience, and efficiency. This is made possible by the build-up of multiple threat datasets and threat detection models, and more importantly, the design of realistic and security-specific experiments. We evaluate FL on two representative threat detection tasks, namely SMS spam detection and Android malware detection. It shows that FL-trained detection models can achieve a performance that is comparable to centrally trained counterparts. Also, most non-IID data distributions have either minor or negligible impact on the model performance, while a label-based non-IID distribution of a high extent can incur non-negligible fluctuation and delay in FL training. Then, under a realistic threat model, FL turns out to be adversary-resistant to attacks of both data poisoning and model poisoning. Particularly, the attacking impact of a practical data poisoning attack is no more than 0.14\% loss in model accuracy. Regarding FL efficiency, a bootstrapping strategy turns out to be effective to mitigate the training delay as observed in label-based non-IID scenarios.
Author	Li, Yekai Mi, Xianghang Bi, Yu Feng, Xuan
Author_xml	– sequence: 1 givenname: Yu surname: Bi fullname: Bi, Yu – sequence: 2 givenname: Yekai surname: Li fullname: Li, Yekai – sequence: 3 givenname: Xuan surname: Feng fullname: Feng, Xuan – sequence: 4 givenname: Xianghang surname: Mi fullname: Mi, Xianghang
BackLink	https://doi.org/10.48550/arXiv.2404.05130$$DView paper in arXiv
BookMark	eNqFzb0OgjAUhuEOOvh3AU72BqhFaOKOEGMcGNibAxylCRZzaNDevULcnb7ky5s8SzaznUXGtqEU8VEpuQd6m0EcYhkLqcJILtgltVC2xt55TmaAygc5YY80jFfiSyReNITg-AkdVs50lr-Ma3iGNRI4rPkVgew3X7P5DdoeN79dsV2WFsk5mFT9JPMA8nrU9aRH_4sPvJg8EQ
ContentType	Journal Article
Copyright	http://arxiv.org/licenses/nonexclusive-distrib/1.0
Copyright_xml	– notice: http://arxiv.org/licenses/nonexclusive-distrib/1.0
DBID	AKY GOX
DOI	10.48550/arxiv.2404.05130
DatabaseName	arXiv Computer Science arXiv.org
DatabaseTitleList
Database_xml	– sequence: 1 dbid: GOX name: arXiv.org url: http://arxiv.org/find sourceTypes: Open Access Repository
DeliveryMethod	fulltext_linktorsrc
ExternalDocumentID	2404_05130
GroupedDBID	AKY GOX
ID	FETCH-arxiv_primary_2404_051303
IEDL.DBID	GOX
IngestDate	Tue Jul 22 23:01:29 EDT 2025
IsDoiOpenAccess	true
IsOpenAccess	true
IsPeerReviewed	false
IsScholarly	false
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-arxiv_primary_2404_051303
OpenAccessLink	https://arxiv.org/abs/2404.05130
ParticipantIDs	arxiv_primary_2404_05130
PublicationCentury	2000
PublicationDate	2024-04-07
PublicationDateYYYYMMDD	2024-04-07
PublicationDate_xml	– month: 04 year: 2024 text: 2024-04-07 day: 07
PublicationDecade	2020
PublicationYear	2024
Score	3.7370791
SecondaryResourceType	preprint
Snippet	Despite achieving good performance and wide adoption, machine learning based security detection models (e.g., malware classifiers) are subject to concept drift...
SourceID	arxiv
SourceType	Open Access Repository
SubjectTerms	Computer Science - Cryptography and Security
Title	Enabling Privacy-Preserving Cyber Threat Detection with Federated Learning
URI	https://arxiv.org/abs/2404.05130
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwdVw9T8MwED2VTiwIBKh838AaiNw6TkZUmlaVWhiKlC3yBQe6VCikFf333DlBdOlqW7bls_3e2XcP4L60iaXImCAsSLODwrYgnSRBWZgwLmNrE5IE59k8mrwNppnOOoB_uTC2-lluGn1g-n5kuBk88Lbps1N-oJSEbI1fsuZz0ktxte3_2zHH9EU7IJEew1HL7vCpMccJdNzqFKYjyU9iiMDXarmxxTaQuAc5o1w03JKrcPEp3A2fXe0jo1Yoz6OYitADc8F3bFVQP87gLh0thpPAj55_NVIRuUws9xPrn0OXHXrXAwyVKxl6rdYx83NnyGrHvqKJ-BI0ivQF9Pb1crm_6goOFQOujyox19Ctq7W7YcCs6dav2i_rdnAY
linkProvider	Cornell University
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Enabling+Privacy-Preserving+Cyber+Threat+Detection+with+Federated+Learning&rft.au=Bi%2C+Yu&rft.au=Li%2C+Yekai&rft.au=Feng%2C+Xuan&rft.au=Mi%2C+Xianghang&rft.date=2024-04-07&rft_id=info:doi/10.48550%2Farxiv.2404.05130&rft.externalDocID=2404_05130