Two-Step Word Segmentation Without Prior Knowledge of a Small Amount of Text
This study investigates word segmentation method based on NPYLM for minority language texts. Applying conventional segmentation methods to text in minority languages is difficult due to the lack of knowledge for the language and insufficient quantity of the text. In particular, NPYLM, one of the con...
Saved in:
Published in | Journal of Japan Society for Fuzzy Theory and Intelligent Informatics Vol. 36; no. 1; pp. 582 - 588 |
---|---|
Main Authors | , , , , |
Format | Journal Article |
Language | Japanese |
Published |
Iizuka
Japan Society for Fuzzy Theory and Intelligent Informatics
15.02.2024
Japan Science and Technology Agency |
Subjects | |
Online Access | Get full text |
ISSN | 1347-7986 1881-7203 |
DOI | 10.3156/jsoft.36.1_582 |
Cover
Summary: | This study investigates word segmentation method based on NPYLM for minority language texts. Applying conventional segmentation methods to text in minority languages is difficult due to the lack of knowledge for the language and insufficient quantity of the text. In particular, NPYLM, one of the conventional methods, can segment texts without prior knowledge, but tends to cause over-segmentation in the case of insufficient data. In this paper, we propose a two-step NPYLM to improve the over-segmentation. First, the first NPYLM is trained a given text with NPYLM to obtain replacement candidates. Next, each candidate words is replaced by different single character. Then, the second NPYLM is trained replaced text. Finally, we get a segmentation result with less over-segmentation. Experimental results show that the proposed method improves the F-measure (based on segmentation position) and the average word length for texts in English, Japanese, and a minority language. Experimental results show that the proposed method improves the over-segmentation for texts in English, Japanese, and a minority language. We conclude that the proposed method brings effective performance for various languages. |
---|---|
Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
ISSN: | 1347-7986 1881-7203 |
DOI: | 10.3156/jsoft.36.1_582 |