Data augmentation techniques in natural language processing

Data Augmentation (DA) methods – a family of techniques designed for synthetic generation of training data – have shown remarkable results in various Deep Learning and Machine Learning tasks. Despite its widespread and successful adoption within the computer vision community, DA techniques designed...

Full description

Saved in:

Bibliographic Details
Published in	Applied soft computing Vol. 132; p. 109803
Main Authors	Pellicer, Lucas Francisco Amaral Orosco, Ferreira, Taynan Maier, Costa, Anna Helena Reali
Format	Journal Article
Language	English
Published	Elsevier B.V 01.01.2023
Subjects	Back-translation Data augmentation Machine learning Natural language processing Back-translation Data augmentation Natural language processing Machine learning
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Data Augmentation (DA) methods – a family of techniques designed for synthetic generation of training data – have shown remarkable results in various Deep Learning and Machine Learning tasks. Despite its widespread and successful adoption within the computer vision community, DA techniques designed for natural language processing (NLP) tasks have exhibited much slower advances and limited success in achieving performance gains. As a consequence, with the exception of applications of back-translation to machine translation tasks, these techniques have not been as thoroughly explored by the wider NLP community. Recent research on the subject still lacks a proper practical understanding of the relationship between the various existing DA methods. The connection between DA methods and several important aspects of its outputs, such as lexical diversity and semantic fidelity, is also still poorly understood. In this work, we perform a comprehensive study of NLP DA techniques, comparing their relative performance under different settings. We analyze the quality of the synthetic data generated, evaluate its performance gains and compare all of these aspects to previous existing DA procedures. •This article compares Data Augmentation techniques for texts.•This article demonstrates lexical diversity and semantic fidelity in techniques.•Back Translation Algorithms and Paraphrasers exhibit similar behavior.•LAMBADA Data Augmentation leads to greater diversity generation and low fidelity.•With more data, heavy algorithms do not pay off compare to light ones.
ISSN:	1568-4946
DOI:	10.1016/j.asoc.2022.109803