Using Zero-Shot Transfer to Initialize azWikiNER, a Gold Standard Named Entity Corpus for the Azerbaijani Language
Named Entity Recognition (NER) is one of the primary fields of Natural Language Processing, focused on analyzing and determining the entities in a given text. In this paper, we present a gold standard named entity dataset for Azerbaijani created from the Azerbaijani portion of WikiAnn, a ‘silver sta...
Saved in:
Published in | Text, Speech, and Dialogue pp. 305 - 317 |
---|---|
Main Authors | , |
Format | Book Chapter |
Language | English |
Published |
Cham
Springer International Publishing
|
Series | Lecture Notes in Computer Science |
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Named Entity Recognition (NER) is one of the primary fields of Natural Language Processing, focused on analyzing and determining the entities in a given text. In this paper, we present a gold standard named entity dataset for Azerbaijani created from the Azerbaijani portion of WikiAnn, a ‘silver standard’ NER dataset generated from Wikipedia. In a zero-shot cross-lingual transfer scenario, we used an M-BERT-based NER model trained on the English Ontonotes corpus to add new entity types to the corpus. The output of the model was then hand-corrected. We evaluate the accuracy of the original WikiAnn corpus, the zero-shot performance of two models trained on the Ontonotes corpus, and two transformer-based NER models trained on the training part of the final corpus: one based on M-BERT and another based XLM-RoBERTa. We release the corpus and the trained models to the public. |
---|---|
ISBN: | 303083526X 9783030835262 |
ISSN: | 0302-9743 1611-3349 |
DOI: | 10.1007/978-3-030-83527-9_26 |