Automatic diacritization of Arabic text using recurrent neural networks

This paper presents a sequence transcription approach for the automatic diacritization of Arabic text. A recurrent neural network is trained to transcribe undiacritized Arabic text with fully diacritized sentences. We use a deep bidirectional long short-term memory network that builds high-level lin...

Full description

Saved in:

Bibliographic Details
Published in	International journal on document analysis and recognition Vol. 18; no. 2; pp. 183 - 197
Main Authors	Abandah, Gheith A., Graves, Alex, Al-Shagoor, Balkees, Arabiyat, Alaa, Jamour, Fuad, Al-Taee, Majid
Format	Journal Article
Language	English
Published	Berlin/Heidelberg Springer Berlin Heidelberg 01.06.2015
Subjects	Computer Science Image Processing and Computer Vision Original Paper Pattern Recognition Recurrent neural networks Deep neural networks Machine learning Long short-term memory Automatic diacritization Sequence transcription Arabic text
Online Access	Get full text

Cover

Loading…

More Information
Summary:	This paper presents a sequence transcription approach for the automatic diacritization of Arabic text. A recurrent neural network is trained to transcribe undiacritized Arabic text with fully diacritized sentences. We use a deep bidirectional long short-term memory network that builds high-level linguistic abstractions of text and exploits long-range context in both input directions. This approach differs from previous approaches in that no lexical, morphological, or syntactical analysis is performed on the data before being processed by the net. Nonetheless, when the network is post-processed with our error correction techniques, it achieves state-of-the-art performance, yielding an average diacritic and word error rates of 2.09 and 5.82 %, respectively, on samples from 11 books. For the LDC ATB3 benchmark, this approach reduces the diacritic error rate by 25 %, the word error rate by 20 %, and the last-letter diacritization error rate by 33 % over the best published results.
ISSN:	1433-2833 1433-2825
DOI:	10.1007/s10032-015-0242-2