Toward a new approach to author profiling based on the extraction of statistical features

Recently, author profiling on social media and on online platforms, characterized by a huge volumes of data, has become more than a critical issue. This issue is of increasing interest in various fields related to forensic medicine, security, marketing, education, etc. The main objective of author p...

Full description

Saved in:
Bibliographic Details
Published inSocial network analysis and mining Vol. 11; no. 1; p. 59
Main Authors Ouni, Sarra, Fkih, Fethi, Omri, Mohamed Nazih
Format Journal Article
LanguageEnglish
Published Vienna Springer Vienna 01.12.2021
Springer Nature B.V
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Recently, author profiling on social media and on online platforms, characterized by a huge volumes of data, has become more than a critical issue. This issue is of increasing interest in various fields related to forensic medicine, security, marketing, education, etc. The main objective of author profiling is to identify the type of writer of the messages, whether it is a human or a bot with a very strong presence. These bots have the task of drawing the attention of browsers to specific events, often used to disseminate incorrect and/or false information. In this work, we offer a new approach to detect these bots and the kind of anonymous perpetrators on these social networks. Our approach, purely statistical, is based on digital features (APSF), extracted from users’ tweets, and on the technique of random forests. A total of 17 stylometry-based features were used to train the model. To assess the performance of our approach, we considered different standard measures, namely accuracy, precision, recall and F1-score. The results obtained show that our approach gives the best performance for both English and Spanish languages. For the English dataset, we achieved an accuracy of 92.45% for the bot detection task and 90.36% for the gender classification; similarly, we obtained accuracy values of 89.68% and 88.88% for the Spanish dataset.
ISSN:1869-5450
1869-5469
DOI:10.1007/s13278-021-00768-6