A Dataset for Content Error Detection in Web Archives

Archiving modern web pages is challenging, and a clear concept of possible errors is still missing. To further improve current web archiving technology, this paper introduces the concept of content errors, which refers to web pages whose archived versions have unexpected content different from their...

Full description

Saved in:

Bibliographic Details
Published in	2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL) pp. 349 - 350
Main Authors	Kiesel, Johannes, Hubricht, Fabienne, Stein, Benno, Potthast, Martin
Format	Conference Proceeding
Language	English
Published	IEEE 01.06.2019
Subjects	captcha content error crowdsourcing dataset error detection loading pop up soft 404 web archive
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Archiving modern web pages is challenging, and a clear concept of possible errors is still missing. To further improve current web archiving technology, this paper introduces the concept of content errors, which refers to web pages whose archived versions have unexpected content different from their originals. This paper presents the first large scale analysis of a web crawl of 10.000 pages for content errors-the Webis Web Archive 2017. Using manual inspection and small annotation studies, we identified 5 different classes of content errors, and then annotated the entire crawl for these classes using crowdsourcing: error messages (4.5% of pages), pop-ups (3.9%), pages that largely consist of advertisements (1.1%), CAPTCHAs (0.8%), and loading indicators (0.5%). Combined, about 10% of pages are affected by content errors, which underlines the relevance of the problem.
DOI:	10.1109/JCDL.2019.00065