Multi-label emotion classification of Urdu tweets

Urdu is a widely used language in South Asia and worldwide. While there are similar datasets available in English, we created the first multi-label emotion dataset consisting of 6,043 tweets and six basic emotions in the Urdu Nastalíq script. A multi-label (ML) classification approach was adopted to...

Full description

Saved in:

Bibliographic Details
Published in	PeerJ. Computer science Vol. 8; p. e896
Main Authors	Ashraf, Noman, Khan, Lal, Butt, Sabur, Chang, Hsien-Tsung, Sidorov, Grigori, Gelbukh, Alexander
Format	Journal Article
Language	English
Published	United States PeerJ. Ltd 22.04.2022 PeerJ, Inc PeerJ Inc
Subjects	Algorithms Analysis Annotations Artificial neural networks Classification Computational Linguistics Data mining Data Mining and Machine Learning Data Science Datasets Decision trees Deep learning Emotion classification in Urdu Emotion detection Emotion recognition Emotions Language processing Machine learning Multi-label emotion detection Natural language interfaces Natural language processing Neural networks Optimization Sentiment analysis Social networks United Kingdom Deep learning Multi-label emotion detection Emotion classification in Urdu Natural language processing Emotion detection Machine learning
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Urdu is a widely used language in South Asia and worldwide. While there are similar datasets available in English, we created the first multi-label emotion dataset consisting of 6,043 tweets and six basic emotions in the Urdu Nastalíq script. A multi-label (ML) classification approach was adopted to detect emotions from Urdu. The morphological and syntactic structure of Urdu makes it a challenging problem for multi-label emotion detection. In this paper, we build a set of baseline classifiers such as machine learning algorithms (Random forest (RF), Decision tree (J48), Sequential minimal optimization (SMO), AdaBoostM1, and Bagging), deep-learning algorithms (Convolutional Neural Networks (1D-CNN), Long short-term memory (LSTM), and LSTM with CNN features) and transformer-based baseline (BERT). We used a combination of text representations: stylometric-based features, pre-trained word embedding, word-based n-grams, and character-based n-grams. The paper highlights the annotation guidelines, dataset characteristics and insights into different methodologies used for Urdu based emotion classification. We present our best results using micro-averaged F1, macro-averaged F1, accuracy, Hamming loss (HL) and exact match (EM) for all tested methods.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23
ISSN:	2376-5992 2376-5992
DOI:	10.7717/peerj-cs.896