Two-Step Word Segmentation Without Prior Knowledge of a Small Amount of Text

This study investigates word segmentation method based on NPYLM for minority language texts. Applying conventional segmentation methods to text in minority languages is difficult due to the lack of knowledge for the language and insufficient quantity of the text. In particular, NPYLM, one of the con...

Full description

Saved in:

Bibliographic Details
Published in	Journal of Japan Society for Fuzzy Theory and Intelligent Informatics Vol. 36; no. 1; pp. 582 - 588
Main Authors	TAKANO, Toshiaki, TOMOTSUGU, Katsuko, MURASE, Ryotaro, TAKASE, Haruhiko, MATSUSHITA, Shinya
Format	Journal Article
Language	Japanese
Published	Iizuka Japan Society for Fuzzy Theory and Intelligent Informatics 15.02.2024 Japan Science and Technology Agency
Subjects	English language Languages minority languages natural language processing Position measurement Segmentation Texts unsupervised morphological analysis word extraction Words (language)
Online Access	Get full text
ISSN	1347-7986 1881-7203
DOI	10.3156/jsoft.36.1_582

Cover

More Information
Summary:	This study investigates word segmentation method based on NPYLM for minority language texts. Applying conventional segmentation methods to text in minority languages is difficult due to the lack of knowledge for the language and insufficient quantity of the text. In particular, NPYLM, one of the conventional methods, can segment texts without prior knowledge, but tends to cause over-segmentation in the case of insufficient data. In this paper, we propose a two-step NPYLM to improve the over-segmentation. First, the first NPYLM is trained a given text with NPYLM to obtain replacement candidates. Next, each candidate words is replaced by different single character. Then, the second NPYLM is trained replaced text. Finally, we get a segmentation result with less over-segmentation. Experimental results show that the proposed method improves the F-measure (based on segmentation position) and the average word length for texts in English, Japanese, and a minority language. Experimental results show that the proposed method improves the over-segmentation for texts in English, Japanese, and a minority language. We conclude that the proposed method brings effective performance for various languages.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	1347-7986 1881-7203
DOI:	10.3156/jsoft.36.1_582