Using wavelet analysis for text categorization in digital libraries a first experiment with Strathprints
Digital libraries increasingly bene t from re- search on automated text categorization for improved access. Such research is typically carried out by using standard test collections. In this paper we present a pilot experiment of replacing such test collections by a set of 6000 objects from a real-w...
Saved in:
Published in | International journal on digital libraries |
---|---|
Main Authors | , , |
Format | Journal Article |
Language | English |
Published |
2011
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Digital libraries increasingly bene t from re-
search on automated text categorization for improved
access. Such research is typically carried out by using
standard test collections. In this paper we present a
pilot experiment of replacing such test collections by
a set of 6000 objects from a real-world digital repos-
itory, indexed by Library of Congress Subject Head-
ings, and test support vector machines in a supervised
learning setting for their ability to reproduce the exist-
ing classi cation. To augment the standard approach,
we introduce a combination of two novel elements: us-
ing functions for document content representation in
Hilbert space, and adding extra semantics from lexical
resources to the representation. Results suggest that
wavelet-based kernels slightly outperformed traditional
kernels on classi cation reconstruction from abstracts
and vice versa from full-text documents, the latter out-
come due to word sense ambiguity. The practical imple-
mentation of our methodological framework enhances
the analysis and representation of speci c knowledge relevant to large-scale digital collections, in this case
the thematic coverage of the collections. Representation
of speci c knowledge about digital collections is one of
the basic elements of the persistent archives and the less
studied one (compared to representations of digital ob-
jects and collections). Our research is an initial step in
this direction developing further the methodological ap-
proach and demonstrating that text categorisation can
be applied to analyse the thematic coverage in digital
repositories. |
---|---|
ISSN: | 1432-1300 1432-5012 |