Two-Step Word Segmentation Without Prior Knowledge of a Small Amount of Text

This study investigates word segmentation method based on NPYLM for minority language texts. Applying conventional segmentation methods to text in minority languages is difficult due to the lack of knowledge for the language and insufficient quantity of the text. In particular, NPYLM, one of the con...

Full description

Saved in:
Bibliographic Details
Published inJournal of Japan Society for Fuzzy Theory and Intelligent Informatics Vol. 36; no. 1; pp. 582 - 588
Main Authors TAKANO, Toshiaki, TOMOTSUGU, Katsuko, MURASE, Ryotaro, TAKASE, Haruhiko, MATSUSHITA, Shinya
Format Journal Article
LanguageJapanese
Published Iizuka Japan Society for Fuzzy Theory and Intelligent Informatics 15.02.2024
Japan Science and Technology Agency
Subjects
Online AccessGet full text
ISSN1347-7986
1881-7203
DOI10.3156/jsoft.36.1_582

Cover

More Information
Summary:This study investigates word segmentation method based on NPYLM for minority language texts. Applying conventional segmentation methods to text in minority languages is difficult due to the lack of knowledge for the language and insufficient quantity of the text. In particular, NPYLM, one of the conventional methods, can segment texts without prior knowledge, but tends to cause over-segmentation in the case of insufficient data. In this paper, we propose a two-step NPYLM to improve the over-segmentation. First, the first NPYLM is trained a given text with NPYLM to obtain replacement candidates. Next, each candidate words is replaced by different single character. Then, the second NPYLM is trained replaced text. Finally, we get a segmentation result with less over-segmentation. Experimental results show that the proposed method improves the F-measure (based on segmentation position) and the average word length for texts in English, Japanese, and a minority language. Experimental results show that the proposed method improves the over-segmentation for texts in English, Japanese, and a minority language. We conclude that the proposed method brings effective performance for various languages.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:1347-7986
1881-7203
DOI:10.3156/jsoft.36.1_582