Romanized Tunisian dialect transliteration using sequence labelling techniques

In recent years, social web users in Arabic countries have been resorting to the dialects as a written language in their social exchanges. Arabic dialects derive from modern standard Arabic (MSA) and differ significantly from one country to another and one region to another. The use of these dialect...

Full description

Saved in:

Bibliographic Details
Published in	Journal of King Saud University. Computer and information sciences Vol. 34; no. 3; pp. 982 - 992
Main Authors	Younes, Jihene, Achour, Hadhemi, Souissi, Emna, Ferchichi, Ahmed
Format	Journal Article
Language	English
Published	Elsevier B.V 01.03.2022 Elsevier
Subjects	Arabic transcription BLSTM CRF Latin transcription Machine learning Natural Language Processing Sequence labelling Transliteration Tunisian dialect Transliteration Latin transcription CRF Machine learning BLSTM Natural Language Processing Tunisian dialect Arabic transcription Sequence labelling
Online Access	Get full text

Cover

Loading…

More Information
Summary:	In recent years, social web users in Arabic countries have been resorting to the dialects as a written language in their social exchanges. Arabic dialects derive from modern standard Arabic (MSA) and differ significantly from one country to another and one region to another. The use of these dialects has led to an increase of interest in the specificities of such informal languages and their automatic processing within the NLP community. In this work, we deal with the Tunisian dialect (TD) in particular. We address the issue of the automatic Latin to Arabic transliteration of TD language productions on the social web and propose an approach that models the transliteration as a sequence labeling task. At a word level, several techniques, based on machine and deep learning, have been tested for this study, using real word messages extracted from social networks. We experiment and compare three transliteration models: A Conditional Random Fields-based model (CRF), a Bidirectional Long Short-Term Memory based model (BLSTM), and a BLSTM based model with CRF decoding (BLSTM-CRF). The obtained results show that BLSTM-CRF, leads to the best performance, reaching 96.78% of correctly transliterated words. We also evaluate the BLSTM-CRF transliteration approach in context on a set of random TD messages extracted from the social web. We obtained a total error rate of 2.7%. 25% of which are context errors.
ISSN:	1319-1578 2213-1248
DOI:	10.1016/j.jksuci.2020.03.008