Beyond the English Web: Zero-Shot Cross-Lingual and Lightweight Monolingual Classification of Registers
We explore cross-lingual transfer of register classification for web documents. Registers, that is, text varieties such as blogs or news are one of the primary predictors of linguistic variation and thus affect the automatic processing of language. We introduce two new register annotated corpora, Fr...
Saved in:
Main Authors | , , , , , , , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
15.02.2021
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | We explore cross-lingual transfer of register classification for web
documents. Registers, that is, text varieties such as blogs or news are one of
the primary predictors of linguistic variation and thus affect the automatic
processing of language. We introduce two new register annotated corpora,
FreCORE and SweCORE, for French and Swedish. We demonstrate that deep
pre-trained language models perform strongly in these languages and outperform
previous state-of-the-art in English and Finnish. Specifically, we show 1) that
zero-shot cross-lingual transfer from the large English CORE corpus can match
or surpass previously published monolingual models, and 2) that lightweight
monolingual classification requiring very little training data can reach or
surpass our zero-shot performance. We further analyse classification results
finding that certain registers continue to pose challenges in particular for
cross-lingual transfer. |
---|---|
DOI: | 10.48550/arxiv.2102.07396 |