Towards Malay named entity recognition: an open-source dataset and a multi-task framework

Named entity recognition (NER) is a key component of many natural language processing (NLP) applications. The majority of advanced research, however, has not been widely applied to low-resource languages represented by Malay due to the data-hungry problem. In this paper, we present a system for buil...

Full description

Saved in:
Bibliographic Details
Published inConnection science Vol. 35; no. 1
Main Authors Fu, Yingwen, Lin, Nankai, Yang, Zhihe, Jiang, Shengyi
Format Journal Article
LanguageEnglish
Published Abingdon Taylor & Francis 31.12.2023
Taylor & Francis Ltd
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Named entity recognition (NER) is a key component of many natural language processing (NLP) applications. The majority of advanced research, however, has not been widely applied to low-resource languages represented by Malay due to the data-hungry problem. In this paper, we present a system for building a Malay NER dataset (MS-NER) of 20,146 sentences through labelled datasets of homologous languages and iterative optimisation. Additionally, we propose a Multi-Task framework, namely MTBR, to integrate boundary information more effectively for NER. Specifically, boundary detection is treated as an auxiliary task and an enhanced Bidirectional Revision module with a gated ignoring mechanism is proposed to undertake conditional label transfer. This can reduce error propagation by the auxiliary task. We conduct extensive experiments on Malay, Indonesian, and English. Experimental results show that MTBR could achieve competitive performance and tends to outperform multiple baselines. The constructed dataset and model would be made available to the public as a new, reliable benchmark for Malay NER.
ISSN:0954-0091
1360-0494
DOI:10.1080/09540091.2022.2159014