A Hybrid Approach to Word Segmentation of Vietnamese Texts

We present in this article a hybrid approach to automatically tokenize Vietnamese text. The approach combines both finite-state automata technique, regular expression parsing and the maximal-matching strategy which is augmented by statistical methods to resolve ambiguities of segmentation. The Vietn...

Full description

Saved in:

Bibliographic Details
Published in	Language and Automata Theory and Applications Vol. 5196; pp. 240 - 249
Main Authors	Hông Phuong, L ê, Thi Minh Huyên, Nguyên, Roussanaly, Azim, Vinh, Hô Tuòng
Format	Book Chapter
Language	English
Published	Germany Springer Berlin / Heidelberg 2008 Springer Berlin Heidelberg
Series	Lecture Notes in Computer Science
Subjects	Compound Word Hybrid Approach Lexical Unit Smoothing Technique Word Segmentation
Online Access	Get full text

Cover

Loading…

More Information
Summary:	We present in this article a hybrid approach to automatically tokenize Vietnamese text. The approach combines both finite-state automata technique, regular expression parsing and the maximal-matching strategy which is augmented by statistical methods to resolve ambiguities of segmentation. The Vietnamese lexicon in use is compactly represented by a minimal finite-state automaton. A text to be tokenized is first parsed into lexical phrases and other patterns using pre-defined regular expressions. The automaton is then deployed to build linear graphs corresponding to the phrases to be segmented. The application of a maximal- matching strategy on a graph results in all candidate segmentations of a phrase. It is the responsibility of an ambiguity resolver, which uses a smoothed bigram language model, to choose the most probable segmentation of the phrase. The hybrid approach is implemented to create vnTokenizer, a highly accurate tokenizer for Vietnamese texts.
ISBN:	3540882812 9783540882817
ISSN:	0302-9743 1611-3349
DOI:	10.1007/978-3-540-88282-4_23