Named Entity Recognition for Icelandic: Annotated Corpus and Models

Named entity recognition (NER) can be a challenging task, especially in highly inflected languages where each entity can have many different surface forms. We have created the first NER corpus for Icelandic by annotating 48,371 named entities (NEs) using eight NE types, in a text corpus of 1 million...

Full description

Saved in:
Bibliographic Details
Published inStatistical Language and Speech Processing Vol. 12379; pp. 46 - 57
Main Authors Ingólfsdóttir, Svanhvít L., Guðjónsson, Ásmundur A., Loftsson, Hrafn
Format Book Chapter
LanguageEnglish
Published Switzerland Springer International Publishing AG 2020
Springer International Publishing
SeriesLecture Notes in Computer Science
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Named entity recognition (NER) can be a challenging task, especially in highly inflected languages where each entity can have many different surface forms. We have created the first NER corpus for Icelandic by annotating 48,371 named entities (NEs) using eight NE types, in a text corpus of 1 million tokens. Furthermore, we have used the corpus to train three machine learning models: first, a CRF model that makes use of shallow word features and a gazetteer function; second, a perceptron model with shallow word features and externally trained word clusters; and third, a BiLSTM model with external word embeddings. Finally, we applied simple voting to combine the model outputs. The voting method obtains an \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$F_{1}$$\end{document} score of 85.79, gaining 1.89 points compared to the best performing individual model. The corpus and the models are publicly available.
ISBN:9783030594299
3030594297
ISSN:0302-9743
1611-3349
DOI:10.1007/978-3-030-59430-5_4