Keyqueries for Clustering and Labeling

In this paper we revisit the document clustering problem from an information retrieval perspective. The idea is to use queries as features in the clustering process that finally also serve as descriptive cluster labels “for free.” Our novel perspective includes query constraints for clustering and c...

Full description

Saved in:

Bibliographic Details
Published in	Information Retrieval Technology Vol. 9994; pp. 42 - 55
Main Authors	Gollub, Tim, Busse, Matthias, Stein, Benno, Hagen, Matthias
Format	Book Chapter
Language	English
Published	Switzerland Springer International Publishing AG 2016 Springer International Publishing
Series	Lecture Notes in Computer Science
Subjects	Artificial intelligence Head Noun Information retrieval Noun Phrase Retrieval Model Search Query Vector Space Model
Online Access	Get full text
ISBN	9783319480503 3319480502
ISSN	0302-9743 1611-3349
DOI	10.1007/978-3-319-48051-0_4

Cover

Loading…

More Information
Summary:	In this paper we revisit the document clustering problem from an information retrieval perspective. The idea is to use queries as features in the clustering process that finally also serve as descriptive cluster labels “for free.” Our novel perspective includes query constraints for clustering and cluster labeling that ensure consistency with a keyword-based reference search engine. Our approach combines different methods in a three-step pipeline. Overall, a query-constrained variant of k-means using noun phrase queries against an ESA-based search engine performs best. In the evaluation, we introduce a soft clustering measure as well as a freely available extended version of the Ambient dataset. We compare our approach to two often-used baselines, descriptive k-means and k-means plus χ2 $$\chi ^2$$ . While the derived clusters are of comparable high quality, the evaluation of the corresponding cluster labels reveals a great diversity in the explanatory power. In a user study with 49 participants, the labels generated by our approach are of significantly higher discriminative power, leading to an increased human separability of the computed clusters.
Bibliography:	Original Abstract: In this paper we revisit the document clustering problem from an information retrieval perspective. The idea is to use queries as features in the clustering process that finally also serve as descriptive cluster labels “for free.” Our novel perspective includes query constraints for clustering and cluster labeling that ensure consistency with a keyword-based reference search engine. Our approach combines different methods in a three-step pipeline. Overall, a query-constrained variant of k-means using noun phrase queries against an ESA-based search engine performs best. In the evaluation, we introduce a soft clustering measure as well as a freely available extended version of the Ambient dataset. We compare our approach to two often-used baselines, descriptive k-means and k-means plus χ2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\chi ^2$$\end{document}. While the derived clusters are of comparable high quality, the evaluation of the corresponding cluster labels reveals a great diversity in the explanatory power. In a user study with 49 participants, the labels generated by our approach are of significantly higher discriminative power, leading to an increased human separability of the computed clusters.
ISBN:	9783319480503 3319480502
ISSN:	0302-9743 1611-3349
DOI:	10.1007/978-3-319-48051-0_4