Canine : Pre-training an Efficient Tokenization-Free Encoder for Language Representation

Pipelined NLP systems have largely been superseded by end-to-end neural modeling, yet nearly all commonly used models still require an explicit tokenization step. While recent tokenization approaches based on data-derived subword lexicons are less brittle than manually engineered tokenizers, these t...

Full description

Saved in:

Bibliographic Details
Published in	Transactions of the Association for Computational Linguistics Vol. 10; pp. 73 - 91
Main Authors	Clark, Jonathan H., Garrette, Dan, Turc, Iulia, Wieting, John
Format	Journal Article
Language	English
Published	One Rogers Street, Cambridge, MA 02142-1209, USA MIT Press 31.01.2022 MIT Press Journals, The The MIT Press
Subjects	Algorithms Coders Language Linguistics Morphology Training Vocabulary
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Pipelined NLP systems have largely been superseded by end-to-end neural modeling, yet nearly all commonly used models still require an explicit tokenization step. While recent tokenization approaches based on data-derived subword lexicons are less brittle than manually engineered tokenizers, these techniques are not equally suited to all languages, and the use of any fixed vocabulary may limit a model’s ability to adapt. In this paper, we present , a neural encoder that operates directly on character sequences—without explicit tokenization or vocabulary—and a pre-training strategy that operates either directly on characters or optionally uses subwords as a soft inductive bias. To use its finer-grained input effectively and efficiently, combines downsampling, which reduces the input sequence length, with a deep transformer stack, which encodes context. outperforms a comparable m model by 5.7 F1 on , a challenging multilingual benchmark, despite having fewer model parameters.
Bibliography:	2022 ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	2307-387X 2307-387X
DOI:	10.1162/tacl_a_00448