Toward a new approach to author profiling based on the extraction of statistical features

Recently, author profiling on social media and on online platforms, characterized by a huge volumes of data, has become more than a critical issue. This issue is of increasing interest in various fields related to forensic medicine, security, marketing, education, etc. The main objective of author p...

Full description

Saved in:

Bibliographic Details
Published in	Social network analysis and mining Vol. 11; no. 1; p. 59
Main Authors	Ouni, Sarra, Fkih, Fethi, Omri, Mohamed Nazih
Format	Journal Article
Language	English
Published	Vienna Springer Vienna 01.12.2021 Springer Nature B.V
Subjects	Accuracy Algorithms Applications of Graph Theory and Complex Networks Classification Computer Science Cybernetics Data Mining and Knowledge Discovery Datasets Economics Extraction False information Forensic medicine Game Theory Gender Humanities Law Machine learning Marketing Methodology of the Social Sciences Original Article Performance evaluation Perpetrators Privacy Profiles Social and Behav. Sciences Social media Social networks Software agents Statistics for Social Sciences User behavior Bot detection Twitter Gender detection Features extraction Machine learning
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Recently, author profiling on social media and on online platforms, characterized by a huge volumes of data, has become more than a critical issue. This issue is of increasing interest in various fields related to forensic medicine, security, marketing, education, etc. The main objective of author profiling is to identify the type of writer of the messages, whether it is a human or a bot with a very strong presence. These bots have the task of drawing the attention of browsers to specific events, often used to disseminate incorrect and/or false information. In this work, we offer a new approach to detect these bots and the kind of anonymous perpetrators on these social networks. Our approach, purely statistical, is based on digital features (APSF), extracted from users’ tweets, and on the technique of random forests. A total of 17 stylometry-based features were used to train the model. To assess the performance of our approach, we considered different standard measures, namely accuracy, precision, recall and F1-score. The results obtained show that our approach gives the best performance for both English and Spanish languages. For the English dataset, we achieved an accuracy of 92.45% for the bot detection task and 90.36% for the gender classification; similarly, we obtained accuracy values of 89.68% and 88.88% for the Spanish dataset.
ISSN:	1869-5450 1869-5469
DOI:	10.1007/s13278-021-00768-6