Levantine hate speech detection in twitter

Nowadays, people use Online Social Networks to express feelings and ideas and to communicate and share information. With the freedom space provided by such networks, some people tend to propagate hate speech and insults. An early detection of such content is crucial for predicting conflicts and coul...

Full description

Saved in:
Bibliographic Details
Published inSocial network analysis and mining Vol. 12; no. 1; p. 121
Main Authors AbdelHamid, Medyan, Jafar, Assef, Rahal, Yasser
Format Journal Article
LanguageEnglish
Published Vienna Springer Vienna 01.12.2022
Springer Nature B.V
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Nowadays, people use Online Social Networks to express feelings and ideas and to communicate and share information. With the freedom space provided by such networks, some people tend to propagate hate speech and insults. An early detection of such content is crucial for predicting conflicts and could prevent the emotions from becoming actions or spreading widely. Hate speech detection work on the Arabic text is sparse and scant compared to other languages like English. Furthermore, the Arabic corpora of short texts in the Levantine dialect for hate speech are also scant. In this paper, we build our dataset of Arabic tweets from Syria and its neighbors with annotations of Normal and Hate. Therefore, word embedding with a combination of term frequency and inverse document frequency (TF-IDF) was concatenated for text representation in traditional classifiers; we used multiple classification algorithms, including Random Forest, Support Vector Machines, and three deep learning classifiers (AraBERT, ArabicBERT, and GigaBERT) which provide on our dataset to validate the effectiveness of our augmented dataset and different used feature representations. The experiment results show that concatenating word embedding and TF-IDF can improve the classification performance; besides, deep learning classifiers show better results compared to traditional ones. Our best model with GigaBERT significantly outperforms other used models with a 94.6% under the AUC-ROC curve (0.81 macro F1-score). These tests were made against other several datasets, and we got the best results for our dataset.
ISSN:1869-5450
1869-5469
DOI:10.1007/s13278-022-00950-4