Keyqueries for Clustering and Labeling

In this paper we revisit the document clustering problem from an information retrieval perspective. The idea is to use queries as features in the clustering process that finally also serve as descriptive cluster labels “for free.” Our novel perspective includes query constraints for clustering and c...

Full description

Saved in:

Bibliographic Details
Published in	Information Retrieval Technology Vol. 9994; pp. 42 - 55
Main Authors	Gollub, Tim, Busse, Matthias, Stein, Benno, Hagen, Matthias
Format	Book Chapter
Language	English
Published	Switzerland Springer International Publishing AG 2016 Springer International Publishing
Series	Lecture Notes in Computer Science
Subjects	Artificial intelligence Head Noun Information retrieval Noun Phrase Retrieval Model Search Query Vector Space Model
Online Access	Get full text
ISBN	9783319480503 3319480502
ISSN	0302-9743 1611-3349
DOI	10.1007/978-3-319-48051-0_4

Cover

Abstract	In this paper we revisit the document clustering problem from an information retrieval perspective. The idea is to use queries as features in the clustering process that finally also serve as descriptive cluster labels “for free.” Our novel perspective includes query constraints for clustering and cluster labeling that ensure consistency with a keyword-based reference search engine. Our approach combines different methods in a three-step pipeline. Overall, a query-constrained variant of k-means using noun phrase queries against an ESA-based search engine performs best. In the evaluation, we introduce a soft clustering measure as well as a freely available extended version of the Ambient dataset. We compare our approach to two often-used baselines, descriptive k-means and k-means plus χ2 $$\chi ^2$$ . While the derived clusters are of comparable high quality, the evaluation of the corresponding cluster labels reveals a great diversity in the explanatory power. In a user study with 49 participants, the labels generated by our approach are of significantly higher discriminative power, leading to an increased human separability of the computed clusters.
AbstractList	In this paper we revisit the document clustering problem from an information retrieval perspective. The idea is to use queries as features in the clustering process that finally also serve as descriptive cluster labels “for free.” Our novel perspective includes query constraints for clustering and cluster labeling that ensure consistency with a keyword-based reference search engine. Our approach combines different methods in a three-step pipeline. Overall, a query-constrained variant of k-means using noun phrase queries against an ESA-based search engine performs best. In the evaluation, we introduce a soft clustering measure as well as a freely available extended version of the Ambient dataset. We compare our approach to two often-used baselines, descriptive k-means and k-means plus χ2 $$\chi ^2$$ . While the derived clusters are of comparable high quality, the evaluation of the corresponding cluster labels reveals a great diversity in the explanatory power. In a user study with 49 participants, the labels generated by our approach are of significantly higher discriminative power, leading to an increased human separability of the computed clusters.
Author	Busse, Matthias Hagen, Matthias Stein, Benno Gollub, Tim
Author_xml	– sequence: 1 givenname: Tim surname: Gollub fullname: Gollub, Tim – sequence: 2 givenname: Matthias surname: Busse fullname: Busse, Matthias – sequence: 3 givenname: Benno surname: Stein fullname: Stein, Benno – sequence: 4 givenname: Matthias surname: Hagen fullname: Hagen, Matthias email: matthias.hagen@uni-weimar.de
BookMark	eNqNkD1PwzAQhg0URFr6C1gysRnOvtiOR1TxJSKxwGw56bl8REmJ04F_j9vSnVtOr--ek993yiZd3xFjlwKuBYC5sabkyFFYXpSgBAdXHLEppoedhmOWCS0ERyzsCZun9cMMcMIyQJDcmgLPWGZ1oS1oYc7ZPMZPABBGC9Q6Y1fP9PO9oeGDYh76IV-0mzgm2a1y3y3zytfUJnHBToNvI83_-oy93d-9Lh559fLwtLit-ApRj5wCNmFZSi9LA7IRJWHQtTReE9VSicKA90HKUCtLtQqWLDRNnUob1WiPMyb2d-N6-wcaXN33X9EJcNtQXHLp0CWfbheCS6EkRu6Z9dAnK3F0tIUa6sbBt827XydD0WkEAygdWqf-DSllFRQH6BexnHK9
ContentType	Book Chapter
Copyright	Springer International Publishing AG 2016
Copyright_xml	– notice: Springer International Publishing AG 2016
DBID	FFUUA
DEWEY	025.524
DOI	10.1007/978-3-319-48051-0_4
DatabaseName	ProQuest Ebook Central - Book Chapters - Demo use only
DatabaseTitleList
DeliveryMethod	fulltext_linktorsrc
Discipline	Computer Science Library & Information Science
EISBN	3319480510 9783319480510
EISSN	1611-3349
Editor	Liu, Yiqun Dou, Zhicheng Zhao, Xin Ma, Shaoping Wen, Ji-Rong Chang, Yi Zhang, Min
Editor_xml	– sequence: 1 fullname: Liu, Yiqun – sequence: 2 fullname: Dou, Zhicheng – sequence: 3 fullname: Zhao, Xin – sequence: 4 fullname: Ma, Shaoping – sequence: 5 fullname: Wen, Ji-Rong – sequence: 6 fullname: Chang, Yi – sequence: 7 fullname: Zhang, Min
EndPage	55
ExternalDocumentID	EBC6307032_39_54 EBC5595042_39_54
GroupedDBID	0D6 0DA 38. AABBV AAMCO AAPIT AAQZU ABBVZ ABMNI ABOWU ACLMJ ADCXD AEDXK AEJGN AEJLV AEKFX AEZAY ALMA_UNASSIGNED_HOLDINGS AORVH AWFBM AZZ BBABE CZZ FFUUA I4C IEZ SBO SWNTM TPJZQ TSXQS Z7R Z7U Z7Z Z81 Z83 Z87 Z88 -DT -GH -~X 1SB 29L 2HA 2HV 5QI 875 AASHB ACGFS AEFIE EJD F5P FEDTE HVGLF LAS LDH P2P RIG RNI RSU SVGTG VI1 ~02
ID	FETCH-LOGICAL-g336t-ef3cfd82a28702c18e3f6b27a6eeb251470aaf22fb59eb5f9e90ccbbbb675c6a3
ISBN	9783319480503 3319480502
ISSN	0302-9743
IngestDate	Tue Jul 29 20:16:08 EDT 2025 Thu May 29 17:26:44 EDT 2025 Wed May 28 23:39:59 EDT 2025
IsPeerReviewed	true
IsScholarly	true
LCCallNum	QA75.5-76.95QA76.9.D
Language	English
LinkModel	OpenURL
MergedId	FETCHMERGED-LOGICAL-g336t-ef3cfd82a28702c18e3f6b27a6eeb251470aaf22fb59eb5f9e90ccbbbb675c6a3
Notes	Original Abstract: In this paper we revisit the document clustering problem from an information retrieval perspective. The idea is to use queries as features in the clustering process that finally also serve as descriptive cluster labels “for free.” Our novel perspective includes query constraints for clustering and cluster labeling that ensure consistency with a keyword-based reference search engine. Our approach combines different methods in a three-step pipeline. Overall, a query-constrained variant of k-means using noun phrase queries against an ESA-based search engine performs best. In the evaluation, we introduce a soft clustering measure as well as a freely available extended version of the Ambient dataset. We compare our approach to two often-used baselines, descriptive k-means and k-means plus χ2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\chi ^2$$\end{document}. While the derived clusters are of comparable high quality, the evaluation of the corresponding cluster labels reveals a great diversity in the explanatory power. In a user study with 49 participants, the labels generated by our approach are of significantly higher discriminative power, leading to an increased human separability of the computed clusters.
OCLC	964690617
PQID	EBC5595042_39_54
PageCount	14
ParticipantIDs	springer_books_10_1007_978_3_319_48051_0_4 proquest_ebookcentralchapters_6307032_39_54 proquest_ebookcentralchapters_5595042_39_54
PublicationCentury	2000
PublicationDate	2016
PublicationDateYYYYMMDD	2016-01-01
PublicationDate_xml	– year: 2016 text: 2016
PublicationDecade	2010
PublicationPlace	Switzerland
PublicationPlace_xml	– name: Switzerland – name: Cham
PublicationSeriesSubtitle	Information Systems and Applications, incl. Internet/Web, and HCI
PublicationSeriesTitle	Lecture Notes in Computer Science
PublicationSeriesTitleAlternate	Lect.Notes Computer
PublicationSubtitle	12th Asia Information Retrieval Societies Conference, AIRS 2016, Beijing, China, November 30 - December 2, 2016, Proceedings
PublicationTitle	Information Retrieval Technology
PublicationYear	2016
Publisher	Springer International Publishing AG Springer International Publishing
Publisher_xml	– name: Springer International Publishing AG – name: Springer International Publishing
RelatedPersons	Kleinberg, Jon M. Mattern, Friedemann Naor, Moni Mitchell, John C. Terzopoulos, Demetri Steffen, Bernhard Pandu Rangan, C. Kanade, Takeo Kittler, Josef Weikum, Gerhard Hutchison, David Tygar, Doug
RelatedPersons_xml	– sequence: 1 givenname: David surname: Hutchison fullname: Hutchison, David – sequence: 2 givenname: Takeo surname: Kanade fullname: Kanade, Takeo – sequence: 3 givenname: Josef surname: Kittler fullname: Kittler, Josef – sequence: 4 givenname: Jon M. surname: Kleinberg fullname: Kleinberg, Jon M. – sequence: 5 givenname: Friedemann surname: Mattern fullname: Mattern, Friedemann – sequence: 6 givenname: John C. surname: Mitchell fullname: Mitchell, John C. – sequence: 7 givenname: Moni surname: Naor fullname: Naor, Moni – sequence: 8 givenname: C. surname: Pandu Rangan fullname: Pandu Rangan, C. – sequence: 9 givenname: Bernhard surname: Steffen fullname: Steffen, Bernhard – sequence: 10 givenname: Demetri surname: Terzopoulos fullname: Terzopoulos, Demetri – sequence: 11 givenname: Doug surname: Tygar fullname: Tygar, Doug – sequence: 12 givenname: Gerhard surname: Weikum fullname: Weikum, Gerhard
SSID	ssj0001761366 ssj0002792
Score	2.0426838
Snippet	In this paper we revisit the document clustering problem from an information retrieval perspective. The idea is to use queries as features in the clustering...
SourceID	springer proquest
SourceType	Publisher
StartPage	42
SubjectTerms	Artificial intelligence Head Noun Information retrieval Noun Phrase Retrieval Model Search Query Vector Space Model
Title	Keyqueries for Clustering and Labeling
URI	http://ebookcentral.proquest.com/lib/SITE_ID/reader.action?docID=5595042&ppg=54 http://ebookcentral.proquest.com/lib/SITE_ID/reader.action?docID=6307032&ppg=54 http://link.springer.com/10.1007/978-3-319-48051-0_4
Volume	9994
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV3NS8MwFA86L-LBb_ya5CA7KBWXrzZHHdOhw5PKbqHJEi8yQaegf70vabN2ZSDaQylpGh75ta_vvbzfC0InREtLXGYSaZ1MwP8iiZRjmWSCdS0XhptxyLa4F4NHdjvio6oAc2CXTPW5-V7IK_kPqtAGuHqW7B-QnQ0KDXAN-MIZEIZzw_idD7OW6YIz4iFMkt8X69Nz7Ruh8qLvnf0C_e-94pBW2Hv58PURIj9xmOvASq9HALrNCECMADZiiLUw1uXNnNdI4bNjmS8EU1eDYCmyhTq1nkbhKU_-URBAseoXEpfNeaMt_Cf7Vz0RNAtRVCrOltFymrEWWrns3w6fqoBYCpaFEJ5_E-UjRYWkSt5Z2aiiMnBDnjknobGuHcyFhw205ikk2HM7QMRNtGQnW2g9bqCBS326hdolawR3cA3NeH8bdSrcMNzGFW4YcMMRtx30eN1_6A2ScmOL5JlSMU2so8aNM5L7VWZiupmlTmiS5sJaDQYnSy_y3BHiNJdWcyetvDBGwwHunRE53UWtyevE7iEM9jATjnDDfenGHAxqS7PU2NySlHeF20dncU5UWH4vc35NMQPvCjxKDoq7gObX3nNA7qPTOMnKd35XsQY2gKOoAnBUAEcBOAd_GvoQrVZv-hFqTd8-bBusv6k-Lt-bH5R4Vx8
linkProvider	Library Specific Holdings
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=bookitem&rft.title=Information+Retrieval+Technology&rft.atitle=Keyqueries+for+Clustering+and+Labeling&rft.date=2016-01-01&rft.pub=Springer+International+Publishing+AG&rft.isbn=9783319480503&rft.volume=9994&rft_id=info:doi/10.1007%2F978-3-319-48051-0_4&rft.externalDBID=54&rft.externalDocID=EBC6307032_39_54
thumbnail_s	http://utb.summon.serialssolutions.com/2.0.0/image/custom?url=https%3A%2F%2Febookcentral.proquest.com%2Fcovers%2F5595042-l.jpg http://utb.summon.serialssolutions.com/2.0.0/image/custom?url=https%3A%2F%2Febookcentral.proquest.com%2Fcovers%2F6307032-l.jpg