The study of the effect of preprocessing techniques for emotion detection on Amazon product review dataset

Emotion detection (ED) from noisy or unstructured text data is a challenging and active area of research in natural language processing as the text contains irrelevant information like repeating characters, slang words, abbreviated words, acronyms, etc., hence text preprocessing is essential to conv...

Full description

Saved in:

Bibliographic Details
Published in	Social network analysis and mining Vol. 14; no. 1; p. 191
Format	Journal Article
Language	English
Published	Heidelberg Springer Nature B.V 23.09.2024
Subjects	Abbreviations Accuracy Acronyms Algorithms Anger Bidirectionality Classification Data mining Datasets Decision trees Deep learning Effectiveness Emotion recognition Emotions Impact analysis Machine learning Natural language processing Performance evaluation Preprocessing Product reviews Sentiment analysis Slang Social networks Support vector machines Unstructured data Word sense disambiguation Words Words (language)
Online Access	Get full text
ISSN	1869-5450 1869-5469
DOI	10.1007/s13278-024-01352-4

Cover

More Information
Summary:	Emotion detection (ED) from noisy or unstructured text data is a challenging and active area of research in natural language processing as the text contains irrelevant information like repeating characters, slang words, abbreviated words, acronyms, etc., hence text preprocessing is essential to convert unstructured text data into a structured format for any task related to text classification. The performance of the classification method is greatly affected by these preprocessing techniques. However, very limited studies evaluated the impact of these preprocessing on model performance. Hence, this paper investigate the effect of 13 commonly used techniques such as lowercasing, stemming, lemmatization, stop words removal, etc. 'on the accuracy of ED classifiers. In our experiment we apply various machine learning (ML) and deep learning (DL) classifiers such as logistic regression (LR), support vector machine (SVM), multinomial Naïve Bayes (MNB), decision tree (DT), random forest (RF), bi-directional LSTM (Bi-LSTM), bidirectional encoder representation from transformers (BERT) on amazon product review dataset to analyze the effectiveness of these techniques. Our experimental result shows that some preprocessing techniques can help in increasing the accuracy of the classifier while others have no significant impact on the classification accuracy, our study also reveals that the effectiveness of these techniques depends on the type of the selected classifier. We also evaluate the combination of the techniques and our results show that effective technique combination works better for LR, DT, and BiLSTM models. At last, based on our experimental results, the BERT model achieves the highest weighted F1_score of 97%.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	1869-5450 1869-5469
DOI:	10.1007/s13278-024-01352-4