Bangla Language Dialect Classification using Machine Learning

Dialect classification of a language is a complex work as it is the variation of the same language. This paper classifies dialect based on local Bengali text. The classification becomes harder when it is about a language that is not very much available in written format or stored in any other way ex...

Full description

Saved in:
Bibliographic Details
Published in2022 4th International Conference on Electrical, Computer & Telecommunication Engineering (ICECTE) pp. 1 - 4
Main Authors Tomal, Md Raihanul Islam, Kader, Tanveer, Masum, Abdul Kadar Muhammad, Chy, Md. Kalim Amzad
Format Conference Proceeding
LanguageEnglish
Published IEEE 29.12.2022
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Dialect classification of a language is a complex work as it is the variation of the same language. This paper classifies dialect based on local Bengali text. The classification becomes harder when it is about a language that is not very much available in written format or stored in any other way except spoken among local people. For natural language processing (NLP) a good amount of data is essential to get the job done. It focuses on generating an enriched dataset of the local Bangla language. The dataset introduces two popular dialects which are Chatgaiya and Pabna which are spoken by a large number of people. It comprises about 5000 data regarding these local languages which are annotated with their respective dialects. A five-step Exploratory Data Analysis (EDA) is carried out. Feature extraction is conducted using three different techniques like CountVectorizer, Term Frequency-Inverse Document Frequency (TF-IDF) and Word2vec. With this huge amount of data, it worked on classifying Bangla language dialect using machine learning algorithms such as Support Vector Machine (SVM), Multinomial Naïve Bayes (MNB), Logistic Regression (LR), Random Forest (RF), Decision Tree (DT), K Nearest Neighbor (KNN). This study obtained the highest 96% accuracy.
DOI:10.1109/ICECTE57896.2022.10114552