Using Zero-Shot Transfer to Initialize azWikiNER, a Gold Standard Named Entity Corpus for the Azerbaijani Language

Named Entity Recognition (NER) is one of the primary fields of Natural Language Processing, focused on analyzing and determining the entities in a given text. In this paper, we present a gold standard named entity dataset for Azerbaijani created from the Azerbaijani portion of WikiAnn, a ‘silver sta...

Full description

Saved in:
Bibliographic Details
Published inText, Speech, and Dialogue pp. 305 - 317
Main Authors Ibiyev, Kamran, Novak, Attila
Format Book Chapter
LanguageEnglish
Published Cham Springer International Publishing
SeriesLecture Notes in Computer Science
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Named Entity Recognition (NER) is one of the primary fields of Natural Language Processing, focused on analyzing and determining the entities in a given text. In this paper, we present a gold standard named entity dataset for Azerbaijani created from the Azerbaijani portion of WikiAnn, a ‘silver standard’ NER dataset generated from Wikipedia. In a zero-shot cross-lingual transfer scenario, we used an M-BERT-based NER model trained on the English Ontonotes corpus to add new entity types to the corpus. The output of the model was then hand-corrected. We evaluate the accuracy of the original WikiAnn corpus, the zero-shot performance of two models trained on the Ontonotes corpus, and two transformer-based NER models trained on the training part of the final corpus: one based on M-BERT and another based XLM-RoBERTa. We release the corpus and the trained models to the public.
ISBN:303083526X
9783030835262
ISSN:0302-9743
1611-3349
DOI:10.1007/978-3-030-83527-9_26