A Hybrid Approach to Word Segmentation of Vietnamese Texts

We present in this article a hybrid approach to automatically tokenize Vietnamese text. The approach combines both finite-state automata technique, regular expression parsing and the maximal-matching strategy which is augmented by statistical methods to resolve ambiguities of segmentation. The Vietn...

Full description

Saved in:
Bibliographic Details
Published inLanguage and Automata Theory and Applications Vol. 5196; pp. 240 - 249
Main Authors Hông Phuong, L ê, Thi Minh Huyên, Nguyên, Roussanaly, Azim, Vinh, Hô Tuòng
Format Book Chapter
LanguageEnglish
Published Germany Springer Berlin / Heidelberg 2008
Springer Berlin Heidelberg
SeriesLecture Notes in Computer Science
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:We present in this article a hybrid approach to automatically tokenize Vietnamese text. The approach combines both finite-state automata technique, regular expression parsing and the maximal-matching strategy which is augmented by statistical methods to resolve ambiguities of segmentation. The Vietnamese lexicon in use is compactly represented by a minimal finite-state automaton. A text to be tokenized is first parsed into lexical phrases and other patterns using pre-defined regular expressions. The automaton is then deployed to build linear graphs corresponding to the phrases to be segmented. The application of a maximal- matching strategy on a graph results in all candidate segmentations of a phrase. It is the responsibility of an ambiguity resolver, which uses a smoothed bigram language model, to choose the most probable segmentation of the phrase. The hybrid approach is implemented to create vnTokenizer, a highly accurate tokenizer for Vietnamese texts.
ISBN:3540882812
9783540882817
ISSN:0302-9743
1611-3349
DOI:10.1007/978-3-540-88282-4_23