A Dataset for Content Error Detection in Web Archives

Archiving modern web pages is challenging, and a clear concept of possible errors is still missing. To further improve current web archiving technology, this paper introduces the concept of content errors, which refers to web pages whose archived versions have unexpected content different from their...

Full description

Saved in:
Bibliographic Details
Published in2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL) pp. 349 - 350
Main Authors Kiesel, Johannes, Hubricht, Fabienne, Stein, Benno, Potthast, Martin
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.06.2019
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Archiving modern web pages is challenging, and a clear concept of possible errors is still missing. To further improve current web archiving technology, this paper introduces the concept of content errors, which refers to web pages whose archived versions have unexpected content different from their originals. This paper presents the first large scale analysis of a web crawl of 10.000 pages for content errors-the Webis Web Archive 2017. Using manual inspection and small annotation studies, we identified 5 different classes of content errors, and then annotated the entire crawl for these classes using crowdsourcing: error messages (4.5% of pages), pop-ups (3.9%), pages that largely consist of advertisements (1.1%), CAPTCHAs (0.8%), and loading indicators (0.5%). Combined, about 10% of pages are affected by content errors, which underlines the relevance of the problem.
DOI:10.1109/JCDL.2019.00065