Crowdsourced data leaking user's privacy while using anonymization technique

Due to the tremendous value embedded in big educational data, numerous research institutes have collected large volumes of student behavioral data. To fully utilize the underlying values, the collected data may be shared with third parties, such as worldwide intelligent data experts. However, this m...

Full description

Saved in:

Bibliographic Details
Published in	Mehran University research journal of engineering and technology Vol. 44; no. 2; pp. 93 - 116
Main Authors	Mirbahar, Naadiya Mirbahar, Kumar, Kamlesh, Laghari, Asif Ali, Khuhro, Mansoor Ahmed
Format	Journal Article
Language	English
Published	Mehran University of Engineering and Technology 01.04.2025
Subjects	anonymization classification Crowdsourcing machine learning Methods Privacy privacy leakage Smart phones Pakistan
Online Access	Get full text
ISSN	0254-7821 2413-7219
DOI	10.22581/muet1982.2954

Cover

Loading…

More Information
Summary:	Due to the tremendous value embedded in big educational data, numerous research institutes have collected large volumes of student behavioral data. To fully utilize the underlying values, the collected data may be shared with third parties, such as worldwide intelligent data experts. However, this may pose privacy risks to data owners, even though the data collectors usually anonymize the data before crowdsourcing. To demonstrate that anonymization alone is insufficient to protect user privacy, we conducted an experimental study using offline and online behavioral traces collected through campus cards and smartphones. Our study demonstrates that a student’s identity can be identified with high probability based on anonymized behavior payment traces. The analysis of results demonstrates that only ten features, i.e., Transmission Control Protocol (TCP), synchronization attempts, content length, downlink traffic, last acknowledgement packet delay, uplink traffic, cell ID, base station ID, day, hour (offline payment, time) day, hour, minute (online payment time), and point of sale ID (POS_ID) are sufficient to uniquely identify an individual. Five supervised standard learning algorithm classifiers have been utilized to predict the user identity i.e., Extra Tree, Bagging, Decision Tree, Nearest Neighbor (KNN), and Random Forest Tree classifiers. The evaluation results showed that the achieved accuracy reached 99.99%, 99.95%, 99.02%, 98.84%, and 99.56%, respectively.
ISSN:	0254-7821 2413-7219
DOI:	10.22581/muet1982.2954