Using wavelet analysis for text categorization in digital libraries: a first experiment with Strathprints

Digital libraries increasingly benefit from research on automated text categorization for improved access. Such research is typically carried out by means of standard test collections. In this article, we present a pilot experiment of replacing such test collections by a set of 6,000 objects from a...

Full description

Saved in:
Bibliographic Details
Published inInternational journal on digital libraries Vol. 12; no. 1; pp. 3 - 12
Main Authors Darányi, Sándor, Wittek, Peter, Dobreva, Milena
Format Journal Article
LanguageEnglish
Published Berlin/Heidelberg Springer-Verlag 01.07.2012
Springer Nature B.V
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Digital libraries increasingly benefit from research on automated text categorization for improved access. Such research is typically carried out by means of standard test collections. In this article, we present a pilot experiment of replacing such test collections by a set of 6,000 objects from a real-world digital repository, indexed by Library of Congress Subject Headings, and test support vector machines in a supervised learning setting for their ability to reproduce the existing classification. To augment the standard approach, we introduce a combination of two novel elements: using functions for document content representation in Hilbert space, and adding extra semantics from lexical resources to the representation. Results suggest that wavelet-based kernels slightly outperformed traditional kernels on classification reconstruction from abstracts and vice versa from full-text documents, the latter outcome being due to word sense ambiguity. The practical implementation of our methodological framework enhances the analysis and representation of specific knowledge relevant to large-scale digital collections, in this case the thematic coverage of the collections. Representation of specific knowledge about digital collections is one of the basic elements of the persistent archives and the less studied one (compared to representations of digital objects and collections). Our research is an initial step in this direction developing further the methodological approach and demonstrating that text categorization can be applied to analyse the thematic coverage in digital repositories.
Bibliography:SourceType-Scholarly Journals-1
ObjectType-Feature-1
content type line 14
ObjectType-Article-2
content type line 23
ISSN:1432-5012
1432-1300
DOI:10.1007/s00799-012-0079-y