Canine : Pre-training an Efficient Tokenization-Free Encoder for Language Representation
Pipelined NLP systems have largely been superseded by end-to-end neural modeling, yet nearly all commonly used models still require an explicit tokenization step. While recent tokenization approaches based on data-derived subword lexicons are less brittle than manually engineered tokenizers, these t...
Saved in:
Published in | Transactions of the Association for Computational Linguistics Vol. 10; pp. 73 - 91 |
---|---|
Main Authors | , , , |
Format | Journal Article |
Language | English |
Published |
One Rogers Street, Cambridge, MA 02142-1209, USA
MIT Press
31.01.2022
MIT Press Journals, The The MIT Press |
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Pipelined NLP systems have largely been superseded by end-to-end neural modeling, yet nearly all commonly used models still require an explicit tokenization step. While recent tokenization approaches based on data-derived subword lexicons are less brittle than manually engineered tokenizers, these techniques are not equally suited to all languages, and the use of any fixed vocabulary may limit a model’s ability to adapt. In this paper, we present
, a neural encoder that operates directly on character sequences—without explicit tokenization or vocabulary—and a pre-training strategy that operates either directly on characters or optionally uses subwords as a soft inductive bias. To use its finer-grained input effectively and efficiently,
combines downsampling, which reduces the input sequence length, with a deep transformer stack, which encodes context.
outperforms a comparable m
model by 5.7 F1 on
, a challenging multilingual benchmark, despite having fewer model parameters. |
---|---|
Bibliography: | 2022 ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
ISSN: | 2307-387X 2307-387X |
DOI: | 10.1162/tacl_a_00448 |