Using Zero-Shot Transfer to Initialize azWikiNER, a Gold Standard Named Entity Corpus for the Azerbaijani Language

Named Entity Recognition (NER) is one of the primary fields of Natural Language Processing, focused on analyzing and determining the entities in a given text. In this paper, we present a gold standard named entity dataset for Azerbaijani created from the Azerbaijani portion of WikiAnn, a ‘silver sta...

Full description

Saved in:

Bibliographic Details
Published in	Text, Speech, and Dialogue pp. 305 - 317
Main Authors	Ibiyev, Kamran, Novak, Attila
Format	Book Chapter
Language	English
Published	Cham Springer International Publishing
Series	Lecture Notes in Computer Science
Subjects	Azerbaijani M-BERT Named Entity Recognition WikiAnn XLM-RoBERTa
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Named Entity Recognition (NER) is one of the primary fields of Natural Language Processing, focused on analyzing and determining the entities in a given text. In this paper, we present a gold standard named entity dataset for Azerbaijani created from the Azerbaijani portion of WikiAnn, a ‘silver standard’ NER dataset generated from Wikipedia. In a zero-shot cross-lingual transfer scenario, we used an M-BERT-based NER model trained on the English Ontonotes corpus to add new entity types to the corpus. The output of the model was then hand-corrected. We evaluate the accuracy of the original WikiAnn corpus, the zero-shot performance of two models trained on the Ontonotes corpus, and two transformer-based NER models trained on the training part of the final corpus: one based on M-BERT and another based XLM-RoBERTa. We release the corpus and the trained models to the public.
ISBN:	303083526X 9783030835262
ISSN:	0302-9743 1611-3349
DOI:	10.1007/978-3-030-83527-9_26