A Pointwise Approach for Vietnamese Diacritics Restoration

The automatic insertion of diacritics in electronic texts is necessary for a number of languages, including French, Romanian, Croatian, Sindhi, Vietnamese, etc. When diacritics are removed from a word and the resulting string of characters is not a word, it is easy to recover the diacritics. However...

Full description

Saved in:
Bibliographic Details
Published in2012 International Conference on Asian Language Processing (IALP) pp. 189 - 192
Main Authors Luu, T. A., Yamamoto, K.
Format Conference Proceeding
LanguageEnglish
Japanese
Published IEEE 01.11.2012
Subjects
Online AccessGet full text
ISBN9781467361132
1467361135
DOI10.1109/IALP.2012.18

Cover

More Information
Summary:The automatic insertion of diacritics in electronic texts is necessary for a number of languages, including French, Romanian, Croatian, Sindhi, Vietnamese, etc. When diacritics are removed from a word and the resulting string of characters is not a word, it is easy to recover the diacritics. However, sometimes the resulting string is also a word, possibly with different grammatical properties or a different meaning, and this makes recovery of the missing diacritics a difficult task for software as well as for human readers. This paper is the first to study automatic diacritic restoration in Vietnamese texts. Modern Vietnamese is a complex language with many diacritical marks, and white space does not always function as a word separator. This paper proposes a point wise approach for automatically recovering missing diacritics, using three features for classification: n-grams of syllables, n-grams of syllable types, and dictionary word features. Our experiments show that the proposed method can recover diacritics with a 94.7% accuracy rate.
ISBN:9781467361132
1467361135
DOI:10.1109/IALP.2012.18