Testing the validity of Wikipedia categories for subject matter labelling of open-domain corpus data
The Wikipedia category system was designed to enable browsing and navigation of Wikipedia. It is also a useful resource for knowledge organisation and document indexing, especially using automatic approaches. However, it has received little attention as a resource for manual indexing. In this articl...
Saved in:
Published in | Journal of information science Vol. 48; no. 5; pp. 686 - 700 |
---|---|
Main Authors | , , |
Format | Journal Article |
Language | English |
Published |
London, England
SAGE Publications
01.10.2022
Bowker-Saur Ltd |
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | The Wikipedia category system was designed to enable browsing and navigation of Wikipedia. It is also a useful resource for knowledge organisation and document indexing, especially using automatic approaches. However, it has received little attention as a resource for manual indexing. In this article, a hierarchical taxonomy of three-level depth is extracted from the Wikipedia category system. The resulting taxonomy is explored as a lightweight alternative to expert-created knowledge organisation systems (e.g. library classification systems) for the manual labelling of open-domain text corpora. Combining quantitative and qualitative data from a crowd-based text labelling study, the validity of the taxonomy is tested and the results quantified in terms of interrater agreement. While the usefulness of the Wikipedia category system for automatic document indexing is documented in the pertinent literature, our results suggest that at least the taxonomy we derived from it is not a valid instrument for manual subject matter labelling of open-domain text corpora. |
---|---|
Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
ISSN: | 0165-5515 1741-6485 |
DOI: | 10.1177/0165551520977438 |