Privacy-preserving Federated Learning and its application to natural language processing
State-of-the-art edge devices are capable of not only inferring machine learning (ML) models but also training them on the device with local data. When this local data is sensitive, privacy becomes a crucial property that must be addressed. This implies that sharing data with a server for training a...
Saved in:
Published in | Knowledge-based systems Vol. 268; p. 110475 |
---|---|
Main Authors | , , , , , , , |
Format | Journal Article |
Language | English |
Published |
Elsevier B.V
23.05.2023
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | State-of-the-art edge devices are capable of not only inferring machine learning (ML) models but also training them on the device with local data. When this local data is sensitive, privacy becomes a crucial property that must be addressed. This implies that sharing data with a server for training a model is undesirable and should be avoided. The Federated Learning (FL) approach can help in these situations, however, FL alone is still not the ultimate tool to solve all challenges, especially when privacy is a major concern. We propose a privacy-preserving FL framework, which leverages the concepts of bitwise quantization, local differential privacy (LDP), and feature hashing for input representation in the collaborative training of ML models. In our approach, the local model updates are first quantized, then a randomized-response technique is applied on the resulting update vector.
Although our proposed framework functions with arbitrary types of input features, we emphasize its usability with natural language data. The text input on the client-side is encoded using a rolling-hash-based representation, which provides a combined solution for the high resource demands of embedding algorithms and the privacy concerns of sharing sensitive data. We evaluate our method in a sentiment analysis task using the IMDB Movie Reviews dataset as well as a rating prediction task with the MovieLens dataset augmented with additional movie keywords. We demonstrate that our approach is a feasible solution for private language processing tasks on edge devices without the use of resource-hungry language models or privacy-violating collection of client data. |
---|---|
ISSN: | 0950-7051 1872-7409 |
DOI: | 10.1016/j.knosys.2023.110475 |