A Hybrid Approach to Word Segmentation of Vietnamese Texts
We present in this article a hybrid approach to automatically tokenize Vietnamese text. The approach combines both finite-state automata technique, regular expression parsing and the maximal-matching strategy which is augmented by statistical methods to resolve ambiguities of segmentation. The Vietn...
Saved in:
Published in | Language and Automata Theory and Applications Vol. 5196; pp. 240 - 249 |
---|---|
Main Authors | , , , |
Format | Book Chapter |
Language | English |
Published |
Germany
Springer Berlin / Heidelberg
2008
Springer Berlin Heidelberg |
Series | Lecture Notes in Computer Science |
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | We present in this article a hybrid approach to automatically tokenize Vietnamese text. The approach combines both finite-state automata technique, regular expression parsing and the maximal-matching strategy which is augmented by statistical methods to resolve ambiguities of segmentation. The Vietnamese lexicon in use is compactly represented by a minimal finite-state automaton. A text to be tokenized is first parsed into lexical phrases and other patterns using pre-defined regular expressions. The automaton is then deployed to build linear graphs corresponding to the phrases to be segmented. The application of a maximal- matching strategy on a graph results in all candidate segmentations of a phrase. It is the responsibility of an ambiguity resolver, which uses a smoothed bigram language model, to choose the most probable segmentation of the phrase. The hybrid approach is implemented to create vnTokenizer, a highly accurate tokenizer for Vietnamese texts. |
---|---|
ISBN: | 3540882812 9783540882817 |
ISSN: | 0302-9743 1611-3349 |
DOI: | 10.1007/978-3-540-88282-4_23 |