A survey of pre-processing techniques to improve short-text quality: a case study on hate speech detection on twitter

Pre-processing plays an essential role in disambiguating the meaning of short-texts, not only in applications that classify short-texts but also for clustering and anomaly detection. Pre-processing can have a considerable impact on overall system performance; however, it is less explored in the lite...

Full description

Saved in:
Bibliographic Details
Published inMultimedia tools and applications Vol. 80; no. 28-29; pp. 35239 - 35266
Main Authors Naseem, Usman, Razzak, Imran, Eklund, Peter W.
Format Journal Article
LanguageEnglish
Published New York Springer US 01.11.2021
Springer Nature B.V
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Pre-processing plays an essential role in disambiguating the meaning of short-texts, not only in applications that classify short-texts but also for clustering and anomaly detection. Pre-processing can have a considerable impact on overall system performance; however, it is less explored in the literature in comparison to feature extraction and classification. This paper analyzes twelve different pre-processing techniques on three pre-classified Twitter datasets on hate speech and observes their impact on the classification tasks they support. It also proposes a systematic approach to text pre-processing to apply different pre-processing techniques in order to retain features without information loss. In this paper, two different word-level feature extraction models are used, and the performance of the proposed package is compared with state-of-the-art methods. To validate gains in performance, both traditional and deep learning classifiers are used. The experimental results suggest that some pre-processing techniques impact negatively on performance, and these are identified, along with the best performing combination of pre-processing techniques.
Bibliography:ObjectType-Case Study-2
SourceType-Scholarly Journals-1
content type line 14
ObjectType-Feature-4
ObjectType-Report-1
ObjectType-Article-3
ISSN:1380-7501
1573-7721
DOI:10.1007/s11042-020-10082-6