Testing the validity of Wikipedia categories for subject matter labelling of open-domain corpus data

The Wikipedia category system was designed to enable browsing and navigation of Wikipedia. It is also a useful resource for knowledge organisation and document indexing, especially using automatic approaches. However, it has received little attention as a resource for manual indexing. In this articl...

Full description

Saved in:

Bibliographic Details
Published in	Journal of information science Vol. 48; no. 5; pp. 686 - 700
Main Authors	Aghaebrahimian, Ahmad, Stauder, Andy, Ustaszewski, Michael
Format	Journal Article
Language	English
Published	London, England SAGE Publications 01.10.2022 Bowker-Saur Ltd
Subjects	Documents Domains Encyclopedias Indexing Knowledge organization Labeling Qualitative analysis Taxonomy Corpus labelling Wikipedia taxonomy social tagging
Online Access	Get full text

Cover

Loading…

More Information
Summary:	The Wikipedia category system was designed to enable browsing and navigation of Wikipedia. It is also a useful resource for knowledge organisation and document indexing, especially using automatic approaches. However, it has received little attention as a resource for manual indexing. In this article, a hierarchical taxonomy of three-level depth is extracted from the Wikipedia category system. The resulting taxonomy is explored as a lightweight alternative to expert-created knowledge organisation systems (e.g. library classification systems) for the manual labelling of open-domain text corpora. Combining quantitative and qualitative data from a crowd-based text labelling study, the validity of the taxonomy is tested and the results quantified in terms of interrater agreement. While the usefulness of the Wikipedia category system for automatic document indexing is documented in the pertinent literature, our results suggest that at least the taxonomy we derived from it is not a valid instrument for manual subject matter labelling of open-domain text corpora.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	0165-5515 1741-6485
DOI:	10.1177/0165551520977438