Linguistically Motivated Vocabulary Reduction for Neural Machine Translation from Turkish to English
The Prague Bulletin of Mathematical Linguistics. No. 108, 2017, pp. 331-342 The necessity of using a fixed-size word vocabulary in order to control the model complexity in state-of-the-art neural machine translation (NMT) systems is an important bottleneck on performance, especially for morphologica...
Saved in:
Main Authors | , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
31.07.2017
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | The Prague Bulletin of Mathematical Linguistics. No. 108, 2017,
pp. 331-342 The necessity of using a fixed-size word vocabulary in order to control the
model complexity in state-of-the-art neural machine translation (NMT) systems
is an important bottleneck on performance, especially for morphologically rich
languages. Conventional methods that aim to overcome this problem by using
sub-word or character-level representations solely rely on statistics and
disregard the linguistic properties of words, which leads to interruptions in
the word structure and causes semantic and syntactic losses. In this paper, we
propose a new vocabulary reduction method for NMT, which can reduce the
vocabulary of a given input corpus at any rate while also considering the
morphological properties of the language. Our method is based on unsupervised
morphology learning and can be, in principle, used for pre-processing any
language pair. We also present an alternative word segmentation method based on
supervised morphological analysis, which aids us in measuring the accuracy of
our model. We evaluate our method in Turkish-to-English NMT task where the
input language is morphologically rich and agglutinative. We analyze different
representation methods in terms of translation accuracy as well as the semantic
and syntactic properties of the generated output. Our method obtains a
significant improvement of 2.3 BLEU points over the conventional vocabulary
reduction technique, showing that it can provide better accuracy in open
vocabulary translation of morphologically rich languages. |
---|---|
DOI: | 10.48550/arxiv.1707.09879 |