Novel data augmentation for named entity recognition

Named entity recognition (NER) is a crucial Natural language processing (NLP) task used in applications like voice assistants, search engines, customer support, etc. A lack of entities relevant to the use case makes the available datasets insufficient for training. Data augmentation is a method in w...

Full description

Saved in:

Bibliographic Details
Published in	International journal of speech technology Vol. 26; no. 4; pp. 869 - 878
Main Authors	Hemateja, Aluru V. N. M., Kondakath, Gopikrishnan, Das, Susruta, Kothandaraman, Mohanaprasad, Shoba, S., Pandey, Abhishek, Babu, Rajin, Jain, Abhinav
Format	Journal Article
Language	English
Published	New York Springer US 01.12.2023 Springer Nature B.V
Subjects	Algorithms Artificial Intelligence Data augmentation Datasets Engineering Natural language processing Performance enhancement Recognition Search engines Sentences Signal,Image and Speech Processing Social Sciences Synthetic data Transformers Augmentation Seq2Seq Named entity recognition (NER) RoBERTa Sanity checker Word2Vec ELMo BERT Natural language processing (NLP)
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Named entity recognition (NER) is a crucial Natural language processing (NLP) task used in applications like voice assistants, search engines, customer support, etc. A lack of entities relevant to the use case makes the available datasets insufficient for training. Data augmentation is a method in which synthetic data is fabricated from existing data to enhance the existing dataset. The existing data augmentation methods do not consider the grammatical and logical correctness of the fabricated sentences, resulting in a decrease in the performance of transformer-based NER models. This paper proposes a novel data augmentation method with a sanity-checker that checks the correctness of the augmented sentences and produces augmented data that improves the performance of transformer-based NER models. When the proposed augmentation algorithm was tested with the CoNLL-2003 dataset, a significant increase in the F1 score of BERT based NER from 94.73 to 95.37% and RoBERTa based NER from 94.13 to 95.14% was observed.
ISSN:	1381-2416 1572-8110
DOI:	10.1007/s10772-023-10055-8