Testing the validity of Wikipedia categories for subject matter labelling of open-domain corpus data

The Wikipedia category system was designed to enable browsing and navigation of Wikipedia. It is also a useful resource for knowledge organisation and document indexing, especially using automatic approaches. However, it has received little attention as a resource for manual indexing. In this articl...

Full description

Saved in:
Bibliographic Details
Published inJournal of information science Vol. 48; no. 5; pp. 686 - 700
Main Authors Aghaebrahimian, Ahmad, Stauder, Andy, Ustaszewski, Michael
Format Journal Article
LanguageEnglish
Published London, England SAGE Publications 01.10.2022
Bowker-Saur Ltd
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:The Wikipedia category system was designed to enable browsing and navigation of Wikipedia. It is also a useful resource for knowledge organisation and document indexing, especially using automatic approaches. However, it has received little attention as a resource for manual indexing. In this article, a hierarchical taxonomy of three-level depth is extracted from the Wikipedia category system. The resulting taxonomy is explored as a lightweight alternative to expert-created knowledge organisation systems (e.g. library classification systems) for the manual labelling of open-domain text corpora. Combining quantitative and qualitative data from a crowd-based text labelling study, the validity of the taxonomy is tested and the results quantified in terms of interrater agreement. While the usefulness of the Wikipedia category system for automatic document indexing is documented in the pertinent literature, our results suggest that at least the taxonomy we derived from it is not a valid instrument for manual subject matter labelling of open-domain text corpora.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:0165-5515
1741-6485
DOI:10.1177/0165551520977438