CEval: A Benchmark for Evaluating Counterfactual Text Generation

INLG 2024 Counterfactual text generation aims to minimally change a text, such that it is classified differently. Judging advancements in method development for counterfactual text generation is hindered by a non-uniform usage of data sets and metrics in related work. We propose CEval, a benchmark f...

Full description

Saved in:

Bibliographic Details
Main Authors	Nguyen, Van Bach, Schlötterer, Jörg, Seifert, Christin
Format	Journal Article
Language	English
Published	26.04.2024
Subjects	Computer Science - Artificial Intelligence Computer Science - Computation and Language
Online Access	Get full text

Cover

Loading…

More Information
Summary:	INLG 2024 Counterfactual text generation aims to minimally change a text, such that it is classified differently. Judging advancements in method development for counterfactual text generation is hindered by a non-uniform usage of data sets and metrics in related work. We propose CEval, a benchmark for comparing counterfactual text generation methods. CEval unifies counterfactual and text quality metrics, includes common counterfactual datasets with human annotations, standard baselines (MICE, GDBA, CREST) and the open-source language model LLAMA-2. Our experiments found no perfect method for generating counterfactual text. Methods that excel at counterfactual metrics often produce lower-quality text while LLMs with simple prompts generate high-quality text but struggle with counterfactual criteria. By making CEval available as an open-source Python library, we encourage the community to contribute more methods and maintain consistent evaluation in future work.
DOI:	10.48550/arxiv.2404.17475