Entity Linking in 100 Languages

We propose a new formulation for multilingual entity linking, where language-specific mentions resolve to a language-agnostic Knowledge Base. We train a dual encoder in this new setting, building on prior work with improved feature representation, negative mining, and an auxiliary entity-pairing tas...

Full description

Saved in:

Bibliographic Details
Published in	arXiv.org
Main Authors	Botha, Jan A, Shan, Zifei, Gillick, Daniel
Format	Paper
Language	English
Published	Ithaca Cornell University Library, arXiv.org 05.11.2020
Subjects	Coders Datasets Frequency analysis Knowledge bases (artificial intelligence) Languages Multilingualism
Online Access	Get full text

Cover

Loading…

More Information
Summary:	We propose a new formulation for multilingual entity linking, where language-specific mentions resolve to a language-agnostic Knowledge Base. We train a dual encoder in this new setting, building on prior work with improved feature representation, negative mining, and an auxiliary entity-pairing task, to obtain a single entity retrieval model that covers 100+ languages and 20 million entities. The model outperforms state-of-the-art results from a far more limited cross-lingual linking task. Rare entities and low-resource languages pose challenges at this large-scale, so we advocate for an increased focus on zero- and few-shot evaluation. To this end, we provide Mewsli-9, a large new multilingual dataset (http://goo.gle/mewsli-dataset) matched to our setting, and show how frequency-based analysis provided key insights for our model and training enhancements.
ISSN:	2331-8422