A Database of Slovak News Articles for Boilerplate Removal

Boilerplate text removal is a crucial preprocessing step in many machine-learning tasks, such as text analysis or information extraction. This paper introduces a new database created for boilerplate removal tasks, comprising Slovak articles sourced from thirty distinct news platforms. We present the...

Full description

Saved in:
Bibliographic Details
Published in2024 International Symposium ELMAR pp. 251 - 254
Main Authors Rabekova, Zuzana, Andicsova, Vanesa, Oravec, Milos, Pavlovicova, Jarmila, Hintos, Peter
Format Conference Proceeding
LanguageEnglish
Published IEEE 16.09.2024
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Boilerplate text removal is a crucial preprocessing step in many machine-learning tasks, such as text analysis or information extraction. This paper introduces a new database created for boilerplate removal tasks, comprising Slovak articles sourced from thirty distinct news platforms. We present the methodology used for data collection, labelling, and validation. We believe this dataset will facilitate advancements in boilerplate removal techniques and improve the quality and efficiency of various text processing tasks in Slovak language contexts and other low-resource languages.
ISSN:2835-3781
DOI:10.1109/ELMAR62909.2024.10694090