Romanized Tunisian dialect transliteration using sequence labelling techniques

In recent years, social web users in Arabic countries have been resorting to the dialects as a written language in their social exchanges. Arabic dialects derive from modern standard Arabic (MSA) and differ significantly from one country to another and one region to another. The use of these dialect...

Full description

Saved in:
Bibliographic Details
Published inJournal of King Saud University. Computer and information sciences Vol. 34; no. 3; pp. 982 - 992
Main Authors Younes, Jihene, Achour, Hadhemi, Souissi, Emna, Ferchichi, Ahmed
Format Journal Article
LanguageEnglish
Published Elsevier B.V 01.03.2022
Elsevier
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:In recent years, social web users in Arabic countries have been resorting to the dialects as a written language in their social exchanges. Arabic dialects derive from modern standard Arabic (MSA) and differ significantly from one country to another and one region to another. The use of these dialects has led to an increase of interest in the specificities of such informal languages and their automatic processing within the NLP community. In this work, we deal with the Tunisian dialect (TD) in particular. We address the issue of the automatic Latin to Arabic transliteration of TD language productions on the social web and propose an approach that models the transliteration as a sequence labeling task. At a word level, several techniques, based on machine and deep learning, have been tested for this study, using real word messages extracted from social networks. We experiment and compare three transliteration models: A Conditional Random Fields-based model (CRF), a Bidirectional Long Short-Term Memory based model (BLSTM), and a BLSTM based model with CRF decoding (BLSTM-CRF). The obtained results show that BLSTM-CRF, leads to the best performance, reaching 96.78% of correctly transliterated words. We also evaluate the BLSTM-CRF transliteration approach in context on a set of random TD messages extracted from the social web. We obtained a total error rate of 2.7%. 25% of which are context errors.
ISSN:1319-1578
2213-1248
DOI:10.1016/j.jksuci.2020.03.008