A Deep Dive into Multilingual Hate Speech Classification

Hate speech is a serious issue that is currently plaguing the society and has been responsible for severe incidents such as the genocide of the Rohingya community in Myanmar. Social media has allowed people to spread such hateful content even faster. This is especially concerning for countries which...

Full description

Saved in:

Bibliographic Details
Published in	Machine Learning and Knowledge Discovery in Databases. Applied Data Science and Demo Track Vol. 12461; pp. 423 - 439
Main Authors	Aluru, Sai Saketh, Mathew, Binny, Saha, Punyajoy, Mukherjee, Animesh
Format	Book Chapter
Language	English
Published	Switzerland Springer International Publishing AG 2021 Springer International Publishing
Series	Lecture Notes in Computer Science
Subjects	BERT Classification Embeddings Hate speech Multilingual
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Hate speech is a serious issue that is currently plaguing the society and has been responsible for severe incidents such as the genocide of the Rohingya community in Myanmar. Social media has allowed people to spread such hateful content even faster. This is especially concerning for countries which lack hate speech detection systems. In this paper, using hate speech dataset in 9 languages from 16 different sources, we perform the first extensive evaluation of multilingual hate speech detection. We analyze the performance of different deep learning models in various scenarios. We observe that in low resource scenario LASER embedding with Logistic regression perform the best, whereas in high resource scenario, BERT based models perform much better. We also observe that simple techniques such as translating to English and using BERT, achieves competitive results in several languages. For cross-lingual classification, we observe that data from other languages seem to improve the performance, especially in the low resource settings. Further, in case of zero-shot classification, evaluation on Italian and Portuguese dataset achieve good results. Our proposed framework could be used as an efficient solution for low-resource languages. These models could also act as good baselines for future multilingual hate speech detection tasks. Our code (Code: https://github.com/punyajoy/DE-LIMIT) and models (Models: https://huggingface.co/Hate-speech-CNERG) are available online. Warning: contains material that many will find offensive or hateful.
Bibliography:	S. S. Aluru and B. Mathew—Equal Contribution.
ISBN:	9783030676698 3030676692
ISSN:	0302-9743 1611-3349
DOI:	10.1007/978-3-030-67670-4_26