A Dataset for Content Error Detection in Web Archives
Archiving modern web pages is challenging, and a clear concept of possible errors is still missing. To further improve current web archiving technology, this paper introduces the concept of content errors, which refers to web pages whose archived versions have unexpected content different from their...
Saved in:
Published in | 2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL) pp. 349 - 350 |
---|---|
Main Authors | , , , |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
01.06.2019
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Archiving modern web pages is challenging, and a clear concept of possible errors is still missing. To further improve current web archiving technology, this paper introduces the concept of content errors, which refers to web pages whose archived versions have unexpected content different from their originals. This paper presents the first large scale analysis of a web crawl of 10.000 pages for content errors-the Webis Web Archive 2017. Using manual inspection and small annotation studies, we identified 5 different classes of content errors, and then annotated the entire crawl for these classes using crowdsourcing: error messages (4.5% of pages), pop-ups (3.9%), pages that largely consist of advertisements (1.1%), CAPTCHAs (0.8%), and loading indicators (0.5%). Combined, about 10% of pages are affected by content errors, which underlines the relevance of the problem. |
---|---|
DOI: | 10.1109/JCDL.2019.00065 |