Improving Domain-Specific NER in the Indonesian Language Through Domain Transfer and Data Augmentation
Named entity recognition (NER) usually focuses on general domains. Specific domains beyond the English language have rarely been explored. In Indonesian NER, the available resources for specific domains are scarce and on small scales. Building a large dataset is time-consuming and costly, whereas a...
Saved in:
Published in | Journal of advanced computational intelligence and intelligent informatics Vol. 28; no. 6; pp. 1299 - 1312 |
---|---|
Main Authors | , , |
Format | Journal Article |
Language | English |
Published |
Tokyo
Fuji Technology Press Co. Ltd
01.11.2024
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Named entity recognition (NER) usually focuses on general domains. Specific domains beyond the English language have rarely been explored. In Indonesian NER, the available resources for specific domains are scarce and on small scales. Building a large dataset is time-consuming and costly, whereas a small dataset is practical. Motivated by this circumstance, we contribute to specific-domain NER in the Indonesian language by providing a small-scale specific-domain NER dataset, IDCrossNER, which is semi-automatically created via automatic translation and projection from English with manual correction for realistic Indonesian localization. With the help of the dataset, we could perform the following analyses: (1) cross-domain transfer learning from general domains and specific-domain augmentation utilizing GPT models to improve the performance of small-scale datasets, and (2) an evaluation of supervised approaches (i.e., in- and cross-domain learning) vs. GPT-4o on IDCrossNER. Our findings include the following. (1) Cross-domain transfer learning is effective. However, on the general domain side, the performance is more sensitive to the size of the pretrained language model (PLM) than to the size and quality of the source dataset in the general domain; on the specific-domain side, the improvement from GPT-based data augmentation becomes significant when only limited source data and a small PLM are available. (2) The evaluation of GPT-4o on our IDCrossNER demonstrates that it is a powerful tool for specific-domain Indonesian NER in a few-shot setting, although it underperforms in prediction in a zero-shot setting. Our dataset is publicly available at https://github.com/khairunnisaor/idcrossner. |
---|---|
Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
ISSN: | 1343-0130 1883-8014 |
DOI: | 10.20965/jaciii.2024.p1299 |