A Pointwise Approach for Vietnamese Diacritics Restoration
The automatic insertion of diacritics in electronic texts is necessary for a number of languages, including French, Romanian, Croatian, Sindhi, Vietnamese, etc. When diacritics are removed from a word and the resulting string of characters is not a word, it is easy to recover the diacritics. However...
Saved in:
Published in | 2012 International Conference on Asian Language Processing (IALP) pp. 189 - 192 |
---|---|
Main Authors | , |
Format | Conference Proceeding |
Language | English Japanese |
Published |
IEEE
01.11.2012
|
Subjects | |
Online Access | Get full text |
ISBN | 9781467361132 1467361135 |
DOI | 10.1109/IALP.2012.18 |
Cover
Summary: | The automatic insertion of diacritics in electronic texts is necessary for a number of languages, including French, Romanian, Croatian, Sindhi, Vietnamese, etc. When diacritics are removed from a word and the resulting string of characters is not a word, it is easy to recover the diacritics. However, sometimes the resulting string is also a word, possibly with different grammatical properties or a different meaning, and this makes recovery of the missing diacritics a difficult task for software as well as for human readers. This paper is the first to study automatic diacritic restoration in Vietnamese texts. Modern Vietnamese is a complex language with many diacritical marks, and white space does not always function as a word separator. This paper proposes a point wise approach for automatically recovering missing diacritics, using three features for classification: n-grams of syllables, n-grams of syllable types, and dictionary word features. Our experiments show that the proposed method can recover diacritics with a 94.7% accuracy rate. |
---|---|
ISBN: | 9781467361132 1467361135 |
DOI: | 10.1109/IALP.2012.18 |