Using wavelet analysis for text categorization in digital libraries: a first experiment with Strathprints

Digital libraries increasingly benefit from research on automated text categorization for improved access. Such research is typically carried out by means of standard test collections. In this article, we present a pilot experiment of replacing such test collections by a set of 6,000 objects from a...

Full description

Saved in:

Bibliographic Details
Published in	International journal on digital libraries Vol. 12; no. 1; pp. 3 - 12
Main Authors	Darányi, Sándor, Wittek, Peter, Dobreva, Milena
Format	Journal Article
Language	English
Published	Berlin/Heidelberg Springer-Verlag 01.07.2012 Springer Nature B.V
Subjects	Artificial intelligence Classification Computer Science Database Management Digital libraries Information Systems and Communication Service Library collections Wavelet transforms Support vector machines Wavelet analysis Text categorization Digital libraries Analogical information representation Machine learning
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Digital libraries increasingly benefit from research on automated text categorization for improved access. Such research is typically carried out by means of standard test collections. In this article, we present a pilot experiment of replacing such test collections by a set of 6,000 objects from a real-world digital repository, indexed by Library of Congress Subject Headings, and test support vector machines in a supervised learning setting for their ability to reproduce the existing classification. To augment the standard approach, we introduce a combination of two novel elements: using functions for document content representation in Hilbert space, and adding extra semantics from lexical resources to the representation. Results suggest that wavelet-based kernels slightly outperformed traditional kernels on classification reconstruction from abstracts and vice versa from full-text documents, the latter outcome being due to word sense ambiguity. The practical implementation of our methodological framework enhances the analysis and representation of specific knowledge relevant to large-scale digital collections, in this case the thematic coverage of the collections. Representation of specific knowledge about digital collections is one of the basic elements of the persistent archives and the less studied one (compared to representations of digital objects and collections). Our research is an initial step in this direction developing further the methodological approach and demonstrating that text categorization can be applied to analyse the thematic coverage in digital repositories.
Bibliography:	SourceType-Scholarly Journals-1 ObjectType-Feature-1 content type line 14 ObjectType-Article-2 content type line 23
ISSN:	1432-5012 1432-1300
DOI:	10.1007/s00799-012-0079-y