Using CNN and LSTM neural networks for Arkhangelsk dialect word identification and classification

The study of dialects provides an opportunity to gain an understanding of the culture and history of a people, which are reflected in language. Dialectal vocabulary differs from standard vocabulary in terms of both meaning and pronunciation, as well as word formation and grammatical structures, prim...

Full description

Saved in:
Bibliographic Details
Published inResearch Result. Theoretical and Applied Linguistics Vol. 10; no. 4; pp. 106 - 125
Main Authors Shurykina, Lyudmila S., Latukhina, Ekaterina A., Petrova, Tatiana V.
Format Journal Article
LanguageEnglish
Published 30.12.2024
Online AccessGet full text

Cover

Loading…
More Information
Summary:The study of dialects provides an opportunity to gain an understanding of the culture and history of a people, which are reflected in language. Dialectal vocabulary differs from standard vocabulary in terms of both meaning and pronunciation, as well as word formation and grammatical structures, primarily in morphology. Similar patterns can also be observed in the Arkhangelsk dialects. The aim of this paper is to develop a dialect words classifier, which can be used to identify dialect words within a given text and categorize them into one of the predefined groups. The novelty of this research lies in the lack of an automated system for classifying dialect words based on Arkhangelsk dialect materials. The article describes the development of a neural network for dialect words identification and classification. Dialect words were identified from dialect texts gathered during dialectological research conducted between the 1960s and the present day. LSTM (long short-term memory) and CNN (convolutional neural network) architectures are compared. One of the main challenges in the task of dialect word classification is that the neural network is trained using a limited amount of data. To overcome these limitations, we are investigating the possibility of using a bigram-based approach in addition to the unigram-based words encoding. A trained model that demonstrated the best results was integrated into our application for dialect words processing and analysis. Confusion matrix was constructed for the best model which demonstrates the highest performance for the derivational class and the lowest for the lexical class.
ISSN:2313-8912
2313-8912
DOI:10.18413/2313-8912-2024-10-4-0-6