Impact of Data Perturbation for Statistical Disclosure Control on the Predictive Performance of Machine Learning Techniques
The rapid accumulation and release of data have fueled research across various fields. While numerous methods exist for data collection and storage, data distribution presents challenges, as some datasets are restricted, and certain subsets may compromise privacy if released unaltered. Statistical d...
Saved in:
Published in | Journal of Data Science Vol. 23; no. 2; pp. 312 - 331 |
---|---|
Main Authors | , |
Format | Journal Article |
Language | English |
Published |
中華資料採礦協會
01.04.2025
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | The rapid accumulation and release of data have fueled research across various fields. While numerous methods exist for data collection and storage, data distribution presents challenges, as some datasets are restricted, and certain subsets may compromise privacy if released unaltered. Statistical disclosure control (SDC) aims to maximize data utility while minimizing the disclosure risk, i.e., the risk of individual identification. A key SDC method is data perturbation, with General Additive Data Perturbation (GADP) and Copula General Additive Data Perturbation (CGADP) being two prominent approaches. Both leverage multivariate normal distributions to generate synthetic data while preserving statistical properties of the original dataset. Given the increasing use of machine learning for data modeling, this study compares the performance of various machine learning models on GADP- and CGADP-perturbed data. Using Monte Carlo simulations with three data-generating models and a real dataset, we evaluate the predictive performance and robustness of ten machine learning techniques under data perturbation. Our findings provide insights into the machine learning techniques that perform robustly on GADP-and CGADP-perturbed datasets, extending previous research that primarily focused on simple statistics such as means, variances, and correlations. |
---|---|
ISSN: | 1683-8602 1680-743X 1683-8602 |
DOI: | 10.6339/25-JDS1186 |