Privacy-preserving Federated Learning and its application to natural language processing

State-of-the-art edge devices are capable of not only inferring machine learning (ML) models but also training them on the device with local data. When this local data is sensitive, privacy becomes a crucial property that must be addressed. This implies that sharing data with a server for training a...

Full description

Saved in:

Bibliographic Details
Published in	Knowledge-based systems Vol. 268; p. 110475
Main Authors	Nagy, Balázs, Hegedűs, István, Sándor, Noémi, Egedi, Balázs, Mehmood, Haaris, Saravanan, Karthikeyan, Lóki, Gábor, Kiss, Ákos
Format	Journal Article
Language	English
Published	Elsevier B.V 23.05.2023
Subjects	Federated Learning Local differential privacy NLP Randomized response String hashing Federated Learning Local differential privacy NLP Randomized response String hashing
Online Access	Get full text

Cover

Loading…

More Information
Summary:	State-of-the-art edge devices are capable of not only inferring machine learning (ML) models but also training them on the device with local data. When this local data is sensitive, privacy becomes a crucial property that must be addressed. This implies that sharing data with a server for training a model is undesirable and should be avoided. The Federated Learning (FL) approach can help in these situations, however, FL alone is still not the ultimate tool to solve all challenges, especially when privacy is a major concern. We propose a privacy-preserving FL framework, which leverages the concepts of bitwise quantization, local differential privacy (LDP), and feature hashing for input representation in the collaborative training of ML models. In our approach, the local model updates are first quantized, then a randomized-response technique is applied on the resulting update vector. Although our proposed framework functions with arbitrary types of input features, we emphasize its usability with natural language data. The text input on the client-side is encoded using a rolling-hash-based representation, which provides a combined solution for the high resource demands of embedding algorithms and the privacy concerns of sharing sensitive data. We evaluate our method in a sentiment analysis task using the IMDB Movie Reviews dataset as well as a rating prediction task with the MovieLens dataset augmented with additional movie keywords. We demonstrate that our approach is a feasible solution for private language processing tasks on edge devices without the use of resource-hungry language models or privacy-violating collection of client data.
ISSN:	0950-7051 1872-7409
DOI:	10.1016/j.knosys.2023.110475