Keyqueries for Clustering and Labeling

In this paper we revisit the document clustering problem from an information retrieval perspective. The idea is to use queries as features in the clustering process that finally also serve as descriptive cluster labels “for free.” Our novel perspective includes query constraints for clustering and c...

Full description

Saved in:
Bibliographic Details
Published inInformation Retrieval Technology Vol. 9994; pp. 42 - 55
Main Authors Gollub, Tim, Busse, Matthias, Stein, Benno, Hagen, Matthias
Format Book Chapter
LanguageEnglish
Published Switzerland Springer International Publishing AG 2016
Springer International Publishing
SeriesLecture Notes in Computer Science
Subjects
Online AccessGet full text
ISBN9783319480503
3319480502
ISSN0302-9743
1611-3349
DOI10.1007/978-3-319-48051-0_4

Cover

Abstract In this paper we revisit the document clustering problem from an information retrieval perspective. The idea is to use queries as features in the clustering process that finally also serve as descriptive cluster labels “for free.” Our novel perspective includes query constraints for clustering and cluster labeling that ensure consistency with a keyword-based reference search engine. Our approach combines different methods in a three-step pipeline. Overall, a query-constrained variant of k-means using noun phrase queries against an ESA-based search engine performs best. In the evaluation, we introduce a soft clustering measure as well as a freely available extended version of the Ambient dataset. We compare our approach to two often-used baselines, descriptive k-means and k-means plus χ2 $$\chi ^2$$ . While the derived clusters are of comparable high quality, the evaluation of the corresponding cluster labels reveals a great diversity in the explanatory power. In a user study with 49 participants, the labels generated by our approach are of significantly higher discriminative power, leading to an increased human separability of the computed clusters.
AbstractList In this paper we revisit the document clustering problem from an information retrieval perspective. The idea is to use queries as features in the clustering process that finally also serve as descriptive cluster labels “for free.” Our novel perspective includes query constraints for clustering and cluster labeling that ensure consistency with a keyword-based reference search engine. Our approach combines different methods in a three-step pipeline. Overall, a query-constrained variant of k-means using noun phrase queries against an ESA-based search engine performs best. In the evaluation, we introduce a soft clustering measure as well as a freely available extended version of the Ambient dataset. We compare our approach to two often-used baselines, descriptive k-means and k-means plus χ2 $$\chi ^2$$ . While the derived clusters are of comparable high quality, the evaluation of the corresponding cluster labels reveals a great diversity in the explanatory power. In a user study with 49 participants, the labels generated by our approach are of significantly higher discriminative power, leading to an increased human separability of the computed clusters.
Author Busse, Matthias
Hagen, Matthias
Stein, Benno
Gollub, Tim
Author_xml – sequence: 1
  givenname: Tim
  surname: Gollub
  fullname: Gollub, Tim
– sequence: 2
  givenname: Matthias
  surname: Busse
  fullname: Busse, Matthias
– sequence: 3
  givenname: Benno
  surname: Stein
  fullname: Stein, Benno
– sequence: 4
  givenname: Matthias
  surname: Hagen
  fullname: Hagen, Matthias
  email: matthias.hagen@uni-weimar.de
BookMark eNqNkD1PwzAQhg0URFr6C1gysRnOvtiOR1TxJSKxwGw56bl8REmJ04F_j9vSnVtOr--ek993yiZd3xFjlwKuBYC5sabkyFFYXpSgBAdXHLEppoedhmOWCS0ERyzsCZun9cMMcMIyQJDcmgLPWGZ1oS1oYc7ZPMZPABBGC9Q6Y1fP9PO9oeGDYh76IV-0mzgm2a1y3y3zytfUJnHBToNvI83_-oy93d-9Lh559fLwtLit-ApRj5wCNmFZSi9LA7IRJWHQtTReE9VSicKA90HKUCtLtQqWLDRNnUob1WiPMyb2d-N6-wcaXN33X9EJcNtQXHLp0CWfbheCS6EkRu6Z9dAnK3F0tIUa6sbBt827XydD0WkEAygdWqf-DSllFRQH6BexnHK9
ContentType Book Chapter
Copyright Springer International Publishing AG 2016
Copyright_xml – notice: Springer International Publishing AG 2016
DBID FFUUA
DEWEY 025.524
DOI 10.1007/978-3-319-48051-0_4
DatabaseName ProQuest Ebook Central - Book Chapters - Demo use only
DatabaseTitleList
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
Library & Information Science
EISBN 3319480510
9783319480510
EISSN 1611-3349
Editor Liu, Yiqun
Dou, Zhicheng
Zhao, Xin
Ma, Shaoping
Wen, Ji-Rong
Chang, Yi
Zhang, Min
Editor_xml – sequence: 1
  fullname: Liu, Yiqun
– sequence: 2
  fullname: Dou, Zhicheng
– sequence: 3
  fullname: Zhao, Xin
– sequence: 4
  fullname: Ma, Shaoping
– sequence: 5
  fullname: Wen, Ji-Rong
– sequence: 6
  fullname: Chang, Yi
– sequence: 7
  fullname: Zhang, Min
EndPage 55
ExternalDocumentID EBC6307032_39_54
EBC5595042_39_54
GroupedDBID 0D6
0DA
38.
AABBV
AAMCO
AAPIT
AAQZU
ABBVZ
ABMNI
ABOWU
ACLMJ
ADCXD
AEDXK
AEJGN
AEJLV
AEKFX
AEZAY
ALMA_UNASSIGNED_HOLDINGS
AORVH
AWFBM
AZZ
BBABE
CZZ
FFUUA
I4C
IEZ
SBO
SWNTM
TPJZQ
TSXQS
Z7R
Z7U
Z7Z
Z81
Z83
Z87
Z88
-DT
-GH
-~X
1SB
29L
2HA
2HV
5QI
875
AASHB
ACGFS
AEFIE
EJD
F5P
FEDTE
HVGLF
LAS
LDH
P2P
RIG
RNI
RSU
SVGTG
VI1
~02
ID FETCH-LOGICAL-g336t-ef3cfd82a28702c18e3f6b27a6eeb251470aaf22fb59eb5f9e90ccbbbb675c6a3
ISBN 9783319480503
3319480502
ISSN 0302-9743
IngestDate Tue Jul 29 20:16:08 EDT 2025
Thu May 29 17:26:44 EDT 2025
Wed May 28 23:39:59 EDT 2025
IsPeerReviewed true
IsScholarly true
LCCallNum QA75.5-76.95QA76.9.D
Language English
LinkModel OpenURL
MergedId FETCHMERGED-LOGICAL-g336t-ef3cfd82a28702c18e3f6b27a6eeb251470aaf22fb59eb5f9e90ccbbbb675c6a3
Notes Original Abstract: In this paper we revisit the document clustering problem from an information retrieval perspective. The idea is to use queries as features in the clustering process that finally also serve as descriptive cluster labels “for free.” Our novel perspective includes query constraints for clustering and cluster labeling that ensure consistency with a keyword-based reference search engine. Our approach combines different methods in a three-step pipeline. Overall, a query-constrained variant of k-means using noun phrase queries against an ESA-based search engine performs best. In the evaluation, we introduce a soft clustering measure as well as a freely available extended version of the Ambient dataset. We compare our approach to two often-used baselines, descriptive k-means and k-means plus χ2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\chi ^2$$\end{document}. While the derived clusters are of comparable high quality, the evaluation of the corresponding cluster labels reveals a great diversity in the explanatory power. In a user study with 49 participants, the labels generated by our approach are of significantly higher discriminative power, leading to an increased human separability of the computed clusters.
OCLC 964690617
PQID EBC5595042_39_54
PageCount 14
ParticipantIDs springer_books_10_1007_978_3_319_48051_0_4
proquest_ebookcentralchapters_6307032_39_54
proquest_ebookcentralchapters_5595042_39_54
PublicationCentury 2000
PublicationDate 2016
PublicationDateYYYYMMDD 2016-01-01
PublicationDate_xml – year: 2016
  text: 2016
PublicationDecade 2010
PublicationPlace Switzerland
PublicationPlace_xml – name: Switzerland
– name: Cham
PublicationSeriesSubtitle Information Systems and Applications, incl. Internet/Web, and HCI
PublicationSeriesTitle Lecture Notes in Computer Science
PublicationSeriesTitleAlternate Lect.Notes Computer
PublicationSubtitle 12th Asia Information Retrieval Societies Conference, AIRS 2016, Beijing, China, November 30 - December 2, 2016, Proceedings
PublicationTitle Information Retrieval Technology
PublicationYear 2016
Publisher Springer International Publishing AG
Springer International Publishing
Publisher_xml – name: Springer International Publishing AG
– name: Springer International Publishing
RelatedPersons Kleinberg, Jon M.
Mattern, Friedemann
Naor, Moni
Mitchell, John C.
Terzopoulos, Demetri
Steffen, Bernhard
Pandu Rangan, C.
Kanade, Takeo
Kittler, Josef
Weikum, Gerhard
Hutchison, David
Tygar, Doug
RelatedPersons_xml – sequence: 1
  givenname: David
  surname: Hutchison
  fullname: Hutchison, David
– sequence: 2
  givenname: Takeo
  surname: Kanade
  fullname: Kanade, Takeo
– sequence: 3
  givenname: Josef
  surname: Kittler
  fullname: Kittler, Josef
– sequence: 4
  givenname: Jon M.
  surname: Kleinberg
  fullname: Kleinberg, Jon M.
– sequence: 5
  givenname: Friedemann
  surname: Mattern
  fullname: Mattern, Friedemann
– sequence: 6
  givenname: John C.
  surname: Mitchell
  fullname: Mitchell, John C.
– sequence: 7
  givenname: Moni
  surname: Naor
  fullname: Naor, Moni
– sequence: 8
  givenname: C.
  surname: Pandu Rangan
  fullname: Pandu Rangan, C.
– sequence: 9
  givenname: Bernhard
  surname: Steffen
  fullname: Steffen, Bernhard
– sequence: 10
  givenname: Demetri
  surname: Terzopoulos
  fullname: Terzopoulos, Demetri
– sequence: 11
  givenname: Doug
  surname: Tygar
  fullname: Tygar, Doug
– sequence: 12
  givenname: Gerhard
  surname: Weikum
  fullname: Weikum, Gerhard
SSID ssj0001761366
ssj0002792
Score 2.0426838
Snippet In this paper we revisit the document clustering problem from an information retrieval perspective. The idea is to use queries as features in the clustering...
SourceID springer
proquest
SourceType Publisher
StartPage 42
SubjectTerms Artificial intelligence
Head Noun
Information retrieval
Noun Phrase
Retrieval Model
Search Query
Vector Space Model
Title Keyqueries for Clustering and Labeling
URI http://ebookcentral.proquest.com/lib/SITE_ID/reader.action?docID=5595042&ppg=54
http://ebookcentral.proquest.com/lib/SITE_ID/reader.action?docID=6307032&ppg=54
http://link.springer.com/10.1007/978-3-319-48051-0_4
Volume 9994
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV3NS8MwFA86L-LBb_ya5CA7KBWXrzZHHdOhw5PKbqHJEi8yQaegf70vabN2ZSDaQylpGh75ta_vvbzfC0InREtLXGYSaZ1MwP8iiZRjmWSCdS0XhptxyLa4F4NHdjvio6oAc2CXTPW5-V7IK_kPqtAGuHqW7B-QnQ0KDXAN-MIZEIZzw_idD7OW6YIz4iFMkt8X69Nz7Ruh8qLvnf0C_e-94pBW2Hv58PURIj9xmOvASq9HALrNCECMADZiiLUw1uXNnNdI4bNjmS8EU1eDYCmyhTq1nkbhKU_-URBAseoXEpfNeaMt_Cf7Vz0RNAtRVCrOltFymrEWWrns3w6fqoBYCpaFEJ5_E-UjRYWkSt5Z2aiiMnBDnjknobGuHcyFhw205ikk2HM7QMRNtGQnW2g9bqCBS326hdolawR3cA3NeH8bdSrcMNzGFW4YcMMRtx30eN1_6A2ScmOL5JlSMU2so8aNM5L7VWZiupmlTmiS5sJaDQYnSy_y3BHiNJdWcyetvDBGwwHunRE53UWtyevE7iEM9jATjnDDfenGHAxqS7PU2NySlHeF20dncU5UWH4vc35NMQPvCjxKDoq7gObX3nNA7qPTOMnKd35XsQY2gKOoAnBUAEcBOAd_GvoQrVZv-hFqTd8-bBusv6k-Lt-bH5R4Vx8
linkProvider Library Specific Holdings
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=bookitem&rft.title=Information+Retrieval+Technology&rft.atitle=Keyqueries+for+Clustering+and+Labeling&rft.date=2016-01-01&rft.pub=Springer+International+Publishing+AG&rft.isbn=9783319480503&rft.volume=9994&rft_id=info:doi/10.1007%2F978-3-319-48051-0_4&rft.externalDBID=54&rft.externalDocID=EBC6307032_39_54
thumbnail_s http://utb.summon.serialssolutions.com/2.0.0/image/custom?url=https%3A%2F%2Febookcentral.proquest.com%2Fcovers%2F5595042-l.jpg
http://utb.summon.serialssolutions.com/2.0.0/image/custom?url=https%3A%2F%2Febookcentral.proquest.com%2Fcovers%2F6307032-l.jpg