A Database of Slovak News Articles for Boilerplate Removal
Boilerplate text removal is a crucial preprocessing step in many machine-learning tasks, such as text analysis or information extraction. This paper introduces a new database created for boilerplate removal tasks, comprising Slovak articles sourced from thirty distinct news platforms. We present the...
Saved in:
Published in | 2024 International Symposium ELMAR pp. 251 - 254 |
---|---|
Main Authors | , , , , |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
16.09.2024
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Boilerplate text removal is a crucial preprocessing step in many machine-learning tasks, such as text analysis or information extraction. This paper introduces a new database created for boilerplate removal tasks, comprising Slovak articles sourced from thirty distinct news platforms. We present the methodology used for data collection, labelling, and validation. We believe this dataset will facilitate advancements in boilerplate removal techniques and improve the quality and efficiency of various text processing tasks in Slovak language contexts and other low-resource languages. |
---|---|
ISSN: | 2835-3781 |
DOI: | 10.1109/ELMAR62909.2024.10694090 |