Improving Domain-Specific NER in the Indonesian Language Through Domain Transfer and Data Augmentation

Named entity recognition (NER) usually focuses on general domains. Specific domains beyond the English language have rarely been explored. In Indonesian NER, the available resources for specific domains are scarce and on small scales. Building a large dataset is time-consuming and costly, whereas a...

Full description

Saved in:

Bibliographic Details
Published in	Journal of advanced computational intelligence and intelligent informatics Vol. 28; no. 6; pp. 1299 - 1312
Main Authors	Khairunnisa, Siti Oryza, Chen, Zhousi, Komachi, Mamoru
Format	Journal Article
Language	English
Published	Tokyo Fuji Technology Press Co. Ltd 01.11.2024
Subjects	Data augmentation Datasets English language Learning Performance enhancement Performance evaluation
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Named entity recognition (NER) usually focuses on general domains. Specific domains beyond the English language have rarely been explored. In Indonesian NER, the available resources for specific domains are scarce and on small scales. Building a large dataset is time-consuming and costly, whereas a small dataset is practical. Motivated by this circumstance, we contribute to specific-domain NER in the Indonesian language by providing a small-scale specific-domain NER dataset, IDCrossNER, which is semi-automatically created via automatic translation and projection from English with manual correction for realistic Indonesian localization. With the help of the dataset, we could perform the following analyses: (1) cross-domain transfer learning from general domains and specific-domain augmentation utilizing GPT models to improve the performance of small-scale datasets, and (2) an evaluation of supervised approaches (i.e., in- and cross-domain learning) vs. GPT-4o on IDCrossNER. Our findings include the following. (1) Cross-domain transfer learning is effective. However, on the general domain side, the performance is more sensitive to the size of the pretrained language model (PLM) than to the size and quality of the source dataset in the general domain; on the specific-domain side, the improvement from GPT-based data augmentation becomes significant when only limited source data and a small PLM are available. (2) The evaluation of GPT-4o on our IDCrossNER demonstrates that it is a powerful tool for specific-domain Indonesian NER in a few-shot setting, although it underperforms in prediction in a zero-shot setting. Our dataset is publicly available at https://github.com/khairunnisaor/idcrossner.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	1343-0130 1883-8014
DOI:	10.20965/jaciii.2024.p1299