Privileged Information for Hierarchical Document Clustering: A Metric Learning Approach

Traditional hierarchical text clustering methods assume that the documents are represented only by "technical information", i.e., keywords, phrases, expressions and named entities that can be directly extracted from the texts. However, in many scenarios there is an additional and valuable...

Full description

Saved in:
Bibliographic Details
Published in2014 22nd International Conference on Pattern Recognition pp. 3636 - 3641
Main Authors Marcondes Marcacini, Ricardo, Domingues, Marcos Aurelio, Hruschka, Eduardo R., Oliveira Rezende, Solange
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.08.2014
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Traditional hierarchical text clustering methods assume that the documents are represented only by "technical information", i.e., keywords, phrases, expressions and named entities that can be directly extracted from the texts. However, in many scenarios there is an additional and valuable information about the documents which is usually disregarded during the clustering task, such as user-validated tags, annotations and comments from experts, dictionaries and domain ontologies. Recently, Vapnik introduced a new learning paradigm, called LUPI - Learning Using Privileged Information, which allows the incorporation of this additional (privileged) information in a supervised learning setting. We investigated the incorporation of privileged information in unsupervised setting. The key idea in our proposed approach is to extract important relationships among documents represented in the privileged information dimensional space to learn a more accurate metric for text clustering in the technical information space. A thorough experimental evaluation indicates that the incorporation of privileged information through metric learning significantly improves the hierarchical clustering accuracy.
ISSN:1051-4651
2831-7475
DOI:10.1109/ICPR.2014.625