Investigating cross-lingual training for offensive language detection

Platforms that feature user-generated content (social media, online forums, newspaper comment sections etc.) have to detect and filter offensive speech within large, fast-changing datasets. While many automatic methods have been proposed and achieve good accuracies, most of these focus on the Englis...

Full description

Saved in:

Bibliographic Details
Published in	PeerJ. Computer science Vol. 7; p. e559
Main Authors	Pelicon, Andraž, Shekhar, Ravi, Škrlj, Blaž, Purver, Matthew, Pollak, Senja
Format	Journal Article
Language	English
Published	San Diego PeerJ. Ltd 25.06.2021 PeerJ, Inc PeerJ Inc
Subjects	Analysis Computational Linguistics Cross-lingual models Data Mining and Machine Learning Datasets Deep learning English language Hate speech Intermediate training Investigations Language Languages Multilingualism Natural Language and Speech Offensive language detection Social media Social networks Training Transfer learning User generated content United Kingdom Slovenia Germany Croatia
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Platforms that feature user-generated content (social media, online forums, newspaper comment sections etc.) have to detect and filter offensive speech within large, fast-changing datasets. While many automatic methods have been proposed and achieve good accuracies, most of these focus on the English language, and are hard to apply directly to languages in which few labeled datasets exist. Recent work has therefore investigated the use of cross-lingual transfer learning to solve this problem, training a model in a well-resourced language and transferring to a less-resourced target language; but performance has so far been significantly less impressive. In this paper, we investigate the reasons for this performance drop, via a systematic comparison of pre-trained models and intermediate training regimes on five different languages. We show that using a better pre-trained language model results in a large gain in overall performance and in zero-shot transfer, and that intermediate training on other languages is effective when little target-language data is available. We then use multiple analyses of classifier confidence and language model vocabulary to shed light on exactly where these gains come from and gain insight into the sources of the most typical mistakes.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23
ISSN:	2376-5992 2376-5992
DOI:	10.7717/peerj-cs.559