A Statistical Analysis of Summarization Evaluation Metrics Using Resampling Methods

The quality of a summarization evaluation metric is quantified by calculating the correlation between its scores and human annotations across a large number of summaries. Currently, it is unclear how precise these correlation estimates are, nor whether differences between two metrics’ correlations r...

Full description

Saved in:

Bibliographic Details
Published in	Transactions of the Association for Computational Linguistics Vol. 9; pp. 1132 - 1146
Main Authors	Deutsch, Daniel, Dror, Rotem, Roth, Dan
Format	Journal Article
Language	English
Published	One Rogers Street, Cambridge, MA 02142-1209, USA MIT Press 27.10.2021 MIT Press Journals, The The MIT Press
Subjects	Annotations Automatic summarization Bootstrapping Confidence intervals Correlation Mathematical analysis Permutations Resampling Simulation Statistical analysis
Online Access	Get full text

Cover

Loading…

More Information
Summary:	The quality of a summarization evaluation metric is quantified by calculating the correlation between its scores and human annotations across a large number of summaries. Currently, it is unclear how precise these correlation estimates are, nor whether differences between two metrics’ correlations reflect a true difference or if it is due to mere chance. In this work, we address these two problems by proposing methods for calculating confidence intervals and running hypothesis tests for correlations using two resampling methods, bootstrapping and permutation. After evaluating which of the proposed methods is most appropriate for summarization through two simulation experiments, we analyze the results of applying these methods to several different automatic evaluation metrics across three sets of human annotations. We find that the confidence intervals are rather wide, demonstrating high uncertainty in the reliability of automatic metrics. Further, although many metrics fail to show statistical improvements over ROUGE, two recent works, QAEval and BERTScore, do so in some evaluation settings.
Bibliography:	2021 ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	2307-387X 2307-387X
DOI:	10.1162/tacl_a_00417