Enabling Privacy-Preserving Cyber Threat Detection with Federated Learning

Despite achieving good performance and wide adoption, machine learning based security detection models (e.g., malware classifiers) are subject to concept drift and evasive evolution of attackers, which renders up-to-date threat data as a necessity. However, due to enforcement of various privacy prot...

Full description

Saved in:
Bibliographic Details
Main Authors Bi, Yu, Li, Yekai, Feng, Xuan, Mi, Xianghang
Format Journal Article
LanguageEnglish
Published 07.04.2024
Subjects
Online AccessGet full text
DOI10.48550/arxiv.2404.05130

Cover

Loading…
Abstract Despite achieving good performance and wide adoption, machine learning based security detection models (e.g., malware classifiers) are subject to concept drift and evasive evolution of attackers, which renders up-to-date threat data as a necessity. However, due to enforcement of various privacy protection regulations (e.g., GDPR), it is becoming increasingly challenging or even prohibitive for security vendors to collect individual-relevant and privacy-sensitive threat datasets, e.g., SMS spam/non-spam messages from mobile devices. To address such obstacles, this study systematically profiles the (in)feasibility of federated learning for privacy-preserving cyber threat detection in terms of effectiveness, byzantine resilience, and efficiency. This is made possible by the build-up of multiple threat datasets and threat detection models, and more importantly, the design of realistic and security-specific experiments. We evaluate FL on two representative threat detection tasks, namely SMS spam detection and Android malware detection. It shows that FL-trained detection models can achieve a performance that is comparable to centrally trained counterparts. Also, most non-IID data distributions have either minor or negligible impact on the model performance, while a label-based non-IID distribution of a high extent can incur non-negligible fluctuation and delay in FL training. Then, under a realistic threat model, FL turns out to be adversary-resistant to attacks of both data poisoning and model poisoning. Particularly, the attacking impact of a practical data poisoning attack is no more than 0.14\% loss in model accuracy. Regarding FL efficiency, a bootstrapping strategy turns out to be effective to mitigate the training delay as observed in label-based non-IID scenarios.
AbstractList Despite achieving good performance and wide adoption, machine learning based security detection models (e.g., malware classifiers) are subject to concept drift and evasive evolution of attackers, which renders up-to-date threat data as a necessity. However, due to enforcement of various privacy protection regulations (e.g., GDPR), it is becoming increasingly challenging or even prohibitive for security vendors to collect individual-relevant and privacy-sensitive threat datasets, e.g., SMS spam/non-spam messages from mobile devices. To address such obstacles, this study systematically profiles the (in)feasibility of federated learning for privacy-preserving cyber threat detection in terms of effectiveness, byzantine resilience, and efficiency. This is made possible by the build-up of multiple threat datasets and threat detection models, and more importantly, the design of realistic and security-specific experiments. We evaluate FL on two representative threat detection tasks, namely SMS spam detection and Android malware detection. It shows that FL-trained detection models can achieve a performance that is comparable to centrally trained counterparts. Also, most non-IID data distributions have either minor or negligible impact on the model performance, while a label-based non-IID distribution of a high extent can incur non-negligible fluctuation and delay in FL training. Then, under a realistic threat model, FL turns out to be adversary-resistant to attacks of both data poisoning and model poisoning. Particularly, the attacking impact of a practical data poisoning attack is no more than 0.14\% loss in model accuracy. Regarding FL efficiency, a bootstrapping strategy turns out to be effective to mitigate the training delay as observed in label-based non-IID scenarios.
Author Li, Yekai
Mi, Xianghang
Bi, Yu
Feng, Xuan
Author_xml – sequence: 1
  givenname: Yu
  surname: Bi
  fullname: Bi, Yu
– sequence: 2
  givenname: Yekai
  surname: Li
  fullname: Li, Yekai
– sequence: 3
  givenname: Xuan
  surname: Feng
  fullname: Feng, Xuan
– sequence: 4
  givenname: Xianghang
  surname: Mi
  fullname: Mi, Xianghang
BackLink https://doi.org/10.48550/arXiv.2404.05130$$DView paper in arXiv
BookMark eNqFzb0OgjAUhuEOOvh3AU72BqhFaOKOEGMcGNibAxylCRZzaNDevULcnb7ky5s8SzaznUXGtqEU8VEpuQd6m0EcYhkLqcJILtgltVC2xt55TmaAygc5YY80jFfiSyReNITg-AkdVs50lr-Ma3iGNRI4rPkVgew3X7P5DdoeN79dsV2WFsk5mFT9JPMA8nrU9aRH_4sPvJg8EQ
ContentType Journal Article
Copyright http://arxiv.org/licenses/nonexclusive-distrib/1.0
Copyright_xml – notice: http://arxiv.org/licenses/nonexclusive-distrib/1.0
DBID AKY
GOX
DOI 10.48550/arxiv.2404.05130
DatabaseName arXiv Computer Science
arXiv.org
DatabaseTitleList
Database_xml – sequence: 1
  dbid: GOX
  name: arXiv.org
  url: http://arxiv.org/find
  sourceTypes: Open Access Repository
DeliveryMethod fulltext_linktorsrc
ExternalDocumentID 2404_05130
GroupedDBID AKY
GOX
ID FETCH-arxiv_primary_2404_051303
IEDL.DBID GOX
IngestDate Tue Jul 22 23:01:29 EDT 2025
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-arxiv_primary_2404_051303
OpenAccessLink https://arxiv.org/abs/2404.05130
ParticipantIDs arxiv_primary_2404_05130
PublicationCentury 2000
PublicationDate 2024-04-07
PublicationDateYYYYMMDD 2024-04-07
PublicationDate_xml – month: 04
  year: 2024
  text: 2024-04-07
  day: 07
PublicationDecade 2020
PublicationYear 2024
Score 3.7370791
SecondaryResourceType preprint
Snippet Despite achieving good performance and wide adoption, machine learning based security detection models (e.g., malware classifiers) are subject to concept drift...
SourceID arxiv
SourceType Open Access Repository
SubjectTerms Computer Science - Cryptography and Security
Title Enabling Privacy-Preserving Cyber Threat Detection with Federated Learning
URI https://arxiv.org/abs/2404.05130
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwdVw9T8MwED2VTiwIBKh838AaiNw6TkZUmlaVWhiKlC3yBQe6VCikFf333DlBdOlqW7bls_3e2XcP4L60iaXImCAsSLODwrYgnSRBWZgwLmNrE5IE59k8mrwNppnOOoB_uTC2-lluGn1g-n5kuBk88Lbps1N-oJSEbI1fsuZz0ktxte3_2zHH9EU7IJEew1HL7vCpMccJdNzqFKYjyU9iiMDXarmxxTaQuAc5o1w03JKrcPEp3A2fXe0jo1Yoz6OYitADc8F3bFVQP87gLh0thpPAj55_NVIRuUws9xPrn0OXHXrXAwyVKxl6rdYx83NnyGrHvqKJ-BI0ivQF9Pb1crm_6goOFQOujyox19Ctq7W7YcCs6dav2i_rdnAY
linkProvider Cornell University
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Enabling+Privacy-Preserving+Cyber+Threat+Detection+with+Federated+Learning&rft.au=Bi%2C+Yu&rft.au=Li%2C+Yekai&rft.au=Feng%2C+Xuan&rft.au=Mi%2C+Xianghang&rft.date=2024-04-07&rft_id=info:doi/10.48550%2Farxiv.2404.05130&rft.externalDocID=2404_05130