The study of the effect of preprocessing techniques for emotion detection on Amazon product review dataset

Emotion detection (ED) from noisy or unstructured text data is a challenging and active area of research in natural language processing as the text contains irrelevant information like repeating characters, slang words, abbreviated words, acronyms, etc., hence text preprocessing is essential to conv...

Full description

Saved in:
Bibliographic Details
Published inSocial network analysis and mining Vol. 14; no. 1; p. 191
Format Journal Article
LanguageEnglish
Published Heidelberg Springer Nature B.V 23.09.2024
Subjects
Online AccessGet full text
ISSN1869-5450
1869-5469
DOI10.1007/s13278-024-01352-4

Cover

More Information
Summary:Emotion detection (ED) from noisy or unstructured text data is a challenging and active area of research in natural language processing as the text contains irrelevant information like repeating characters, slang words, abbreviated words, acronyms, etc., hence text preprocessing is essential to convert unstructured text data into a structured format for any task related to text classification. The performance of the classification method is greatly affected by these preprocessing techniques. However, very limited studies evaluated the impact of these preprocessing on model performance. Hence, this paper investigate the effect of 13 commonly used techniques such as lowercasing, stemming, lemmatization, stop words removal, etc. 'on the accuracy of ED classifiers. In our experiment we apply various machine learning (ML) and deep learning (DL) classifiers such as logistic regression (LR), support vector machine (SVM), multinomial Naïve Bayes (MNB), decision tree (DT), random forest (RF), bi-directional LSTM (Bi-LSTM), bidirectional encoder representation from transformers (BERT) on amazon product review dataset to analyze the effectiveness of these techniques. Our experimental result shows that some preprocessing techniques can help in increasing the accuracy of the classifier while others have no significant impact on the classification accuracy, our study also reveals that the effectiveness of these techniques depends on the type of the selected classifier. We also evaluate the combination of the techniques and our results show that effective technique combination works better for LR, DT, and BiLSTM models. At last, based on our experimental results, the BERT model achieves the highest weighted F1_score of 97%.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:1869-5450
1869-5469
DOI:10.1007/s13278-024-01352-4