Source-Side Suffix Stripping for Bengali-to-English SMT

Data sparseness is a well-known problem for statistical machine translation (SMT) when morphologically rich and highly inflected languages are involved. This problem become worse in resource-scarce scenarios where sufficient parallel corpora are not available for model training. Recent research has...

Full description

Saved in:

Bibliographic Details
Published in	2012 International Conference on Asian Language Processing (IALP) pp. 193 - 196
Main Authors	Haque, R., Penkale, Sergio, Jie Jiang, Way, Andy
Format	Conference Proceeding
Language	English
Published	IEEE 01.11.2012
Subjects	Accuracy Computational linguistics morphological segmentation Morphology Separation processes statistical machine translation Surface morphology Training Vocabulary
Online Access	Get full text
ISBN	9781467361132 1467361135
DOI	10.1109/IALP.2012.61

Cover

More Information
Summary:	Data sparseness is a well-known problem for statistical machine translation (SMT) when morphologically rich and highly inflected languages are involved. This problem become worse in resource-scarce scenarios where sufficient parallel corpora are not available for model training. Recent research has shown that morphological segmentation can be employed on either side of the translation pair to reduce data sparsity. In this work, we consider a highly inflected Indian language as the source-side of the translation pair, Bengali. This paper presents study of morphological segmentation in SMT with a less explored translation pair, Bengali-to-English. We worked with a tiny training set available for this language-pair. We employ a simple suffix-stripping method for lemmatizing inflected Bengali words. We show that our morphological suffix separation process significantly reduces data sparseness. We also show that an SMT model trained on suffix-stripped (source) training data significantly outperforms the state-of-the-art phrase-based SMT (PB-SMT) baseline.
ISBN:	9781467361132 1467361135
DOI:	10.1109/IALP.2012.61