MLSUM: The Multilingual Summarization Corpus

We present MLSUM, the first large-scale MultiLingual SUMmarization dataset. Obtained from online newspapers, it contains 1.5M+ article/summary pairs in five different languages -- namely, French, German, Spanish, Russian, Turkish. Together with English newspapers from the popular CNN/Daily mail data...

Full description

Saved in:

Bibliographic Details
Published in	arXiv.org
Main Authors	Scialom, Thomas, Paul-Alexis Dray, Lamprier, Sylvain, Piwowarski, Benjamin, Staiano, Jacopo
Format	Paper
Language	English
Published	Ithaca Cornell University Library, arXiv.org 30.04.2020
Subjects	Datasets Multilingualism
Online Access	Get full text

Cover

Loading…

More Information
Summary:	We present MLSUM, the first large-scale MultiLingual SUMmarization dataset. Obtained from online newspapers, it contains 1.5M+ article/summary pairs in five different languages -- namely, French, German, Spanish, Russian, Turkish. Together with English newspapers from the popular CNN/Daily mail dataset, the collected data form a large scale multilingual dataset which can enable new research directions for the text summarization community. We report cross-lingual comparative analyses based on state-of-the-art systems. These highlight existing biases which motivate the use of a multi-lingual dataset.
ISSN:	2331-8422