Large Language Models Enable Few-Shot Clustering

Unlike traditional unsupervised clustering, semi-supervised clustering allows users to provide meaningful structure to the data, which helps the clustering algorithm to match the user’s intent. Existing approaches to semi-supervised clustering require a significant amount of feedback from an expert...

Full description

Saved in:

Bibliographic Details
Published in	Transactions of the Association for Computational Linguistics Vol. 12; pp. 321 - 333
Main Authors	Viswanathan, Vijay, Gashteovski, Kiril, Lawrence, Carolin, Wu, Tongshuang, Neubig, Graham
Format	Journal Article
Language	English
Published	One Broadway, 12th Floor, Cambridge, Massachusetts 02142, USA MIT Press 05.04.2024 The MIT Press
Online Access	Get full text
ISSN	2307-387X 2307-387X
DOI	10.1162/tacl_a_00648

Cover

Loading…

Abstract	Unlike traditional unsupervised clustering, semi-supervised clustering allows users to provide meaningful structure to the data, which helps the clustering algorithm to match the user’s intent. Existing approaches to semi-supervised clustering require a significant amount of feedback from an expert to improve the clusters. In this paper, we ask whether a large language model (LLM) can amplify an expert’s guidance to enable query-efficient, few-shot semi-supervised text clustering. We show that LLMs are surprisingly effective at improving clustering. We explore three stages where LLMs can be incorporated into clustering: before clustering (improving input features), during clustering (by providing constraints to the clusterer), and after clustering (using LLMs post-correction). We find that incorporating LLMs in the first two stages routinely provides significant improvements in cluster quality, and that LLMs enable a user to make trade-offs between cost and accuracy to produce desired clusters. We release our code and LLM prompts for the public to use.
AbstractList	Unlike traditional unsupervised clustering, semi-supervised clustering allows users to provide meaningful structure to the data, which helps the clustering algorithm to match the user’s intent. Existing approaches to semi-supervised clustering require a significant amount of feedback from an expert to improve the clusters. In this paper, we ask whether a large language model (LLM) can amplify an expert’s guidance to enable query-efficient, few-shot semi-supervised text clustering. We show that LLMs are surprisingly effective at improving clustering. We explore three stages where LLMs can be incorporated into clustering: before clustering (improving input features), during clustering (by providing constraints to the clusterer), and after clustering (using LLMs post-correction). We find that incorporating LLMs in the first two stages routinely provides significant improvements in cluster quality, and that LLMs enable a user to make trade-offs between cost and accuracy to produce desired clusters. We release our code and LLM prompts for the public to use. Unlike traditional unsupervised clustering, semi-supervised clustering allows users to provide meaningful structure to the data, which helps the clustering algorithm to match the user’s intent. Existing approaches to semi-supervised clustering require a significant amount of feedback from an expert to improve the clusters. In this paper, we ask whether a large language model (LLM) can amplify an expert’s guidance to enable query-efficient, few-shot semi-supervised text clustering. We show that LLMs are surprisingly effective at improving clustering. We explore three stages where LLMs can be incorporated into clustering: before clustering (improving input features), during clustering (by providing constraints to the clusterer), and after clustering (using LLMs post-correction). We find that incorporating LLMs in the first two stages routinely provides significant improvements in cluster quality, and that LLMs enable a user to make trade-offs between cost and accuracy to produce desired clusters. We release our code and LLM prompts for the public to use.1
Author	Gashteovski, Kiril Viswanathan, Vijay Neubig, Graham Lawrence, Carolin Wu, Tongshuang
Author_xml	– sequence: 1 givenname: Vijay surname: Viswanathan fullname: Viswanathan, Vijay organization: Carnegie Mellon University, USA – sequence: 2 givenname: Kiril surname: Gashteovski fullname: Gashteovski, Kiril organization: NEC Laboratories Europe, Germany – sequence: 2 givenname: Kiril surname: Gashteovski fullname: Gashteovski, Kiril organization: Center for Advanced Interdisciplinary Research, Ss. Cyril and Methodius Uni. of Skopje, Germany – sequence: 4 givenname: Carolin surname: Lawrence fullname: Lawrence, Carolin organization: NEC Laboratories Europe, Germany – sequence: 5 givenname: Tongshuang surname: Wu fullname: Wu, Tongshuang organization: Carnegie Mellon University, USA – sequence: 6 givenname: Graham surname: Neubig fullname: Neubig, Graham organization: Carnegie Mellon University, USA
BookMark	eNp1kE9Lw0AUxBepYK29-QF69GB0_2WzOUpptRDxoIK35SV5G7ekWdlsEb-9qVUooqc3DL8ZHnNKRp3vkJBzRq8YU_w6QtUaMJQqqY_ImAuaJUJnL6MDfUKmfb-mlDLNNFV8TGgBocFZAV2zhUHc-xrbfrbooGxxtsT35PHVx9m83fYRg-uaM3Jsoe1x-n0n5Hm5eJrfJcXD7Wp-UySVyHhMVCaslZjWaDVPrbYMpcaq0oMvZa7zvBQSKLDMsmyQVtfAMKVlWadaKyEmZLXvrT2szVtwGwgfxoMzX4YPjYEQXdWiUVSlSlnMqxRkrUptaZlKQblQvMZ818X3XVXwfR_QmspFiM53MYBrDaNmN6E5nHAIXf4K_TzxD36xxzcumrXfhm5Y52_0EyoxgdE
CitedBy_id	crossref_primary_10_1109_RBME_2024_3492381 crossref_primary_10_3390_su17051896 crossref_primary_10_1016_j_jretconser_2024_104078 crossref_primary_10_3390_math12182928
Cites_doi	10.18653/v1/2023.nlp4convai-1.7 10.1145/2661829.2662073 10.18653/v1/2022.naacl-main.55 10.1145/3173574.3174023 10.1145/1458082.1458150 10.1145/3340960 10.18653/v1/D19-1131 10.1145/2505515.2514692 10.1137/1.9781611972740.31 10.1145/3534678.3539449 10.3386/w31122 10.1109/ICDE.2016.7498276 10.18653/v1/D19-1410 10.18653/v1/2021.emnlp-main.811 10.1007/BF01890115 10.1007/978-3-030-46150-8_4 10.18653/v1/D17-1278 10.1137/1.9781611974973.27 10.1613/jair.3003 10.18653/v1/2021.naacl-main.427 10.18653/v1/2020.nlp4convai-1.5 10.1007/978-1-4614-3223-4_4 10.1145/3178876.3186030 10.18653/v1/2023.emnlp-main.858 10.1109/TIT.1982.1056489
ContentType	Journal Article
DBID	AAYXX CITATION DOA
DOI	10.1162/tacl_a_00648
DatabaseName	CrossRef DOAJ (Directory of Open Access Journals)
DatabaseTitle	CrossRef
DatabaseTitleList	CrossRef
Database_xml	– sequence: 1 dbid: DOA name: DOAJ Directory of Open Access Journals (WRLC) url: https://www.doaj.org/ sourceTypes: Open Website
DeliveryMethod	fulltext_linktorsrc
EISSN	2307-387X
EndPage	333
ExternalDocumentID	oai_doaj_org_article_606566fe9c5a4d6b8f0b54302362de93 10_1162_tacl_a_00648 tacl_a_00648.pdf
GroupedDBID	AAFWJ ABUWG AFKRA AFPKN ALMA_UNASSIGNED_HOLDINGS ALSLI ARAPS BENPR BGLVJ CCPQU CPGLG CRLPW DWQXO EBS GROUPED_DOAJ HCIFZ JMNJE K7- M~E OJV OK1 PHGZT PIMPY RMI AAYXX CITATION PHGZM PQGLB PRQQA PUEGO
ID	FETCH-LOGICAL-c372t-673ff4e5def825f8f1e48ecc873f449899b34a0a17f17b34f8da1e50bbd588633
IEDL.DBID	DOA
ISSN	2307-387X
IngestDate	Wed Aug 27 01:13:01 EDT 2025 Thu Apr 24 22:53:39 EDT 2025 Tue Jul 01 03:28:36 EDT 2025 Thu Apr 10 09:09:01 EDT 2025
IsDoiOpenAccess	true
IsOpenAccess	true
IsPeerReviewed	true
IsScholarly	true
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-c372t-673ff4e5def825f8f1e48ecc873f449899b34a0a17f17b34f8da1e50bbd588633
Notes	2024
OpenAccessLink	https://doaj.org/article/606566fe9c5a4d6b8f0b54302362de93
PageCount	13
ParticipantIDs	crossref_citationtrail_10_1162_tacl_a_00648 mit_journals_10_1162_tacl_a_00648 crossref_primary_10_1162_tacl_a_00648 doaj_primary_oai_doaj_org_article_606566fe9c5a4d6b8f0b54302362de93
ProviderPackageCode	CITATION AAYXX
PublicationCentury	2000
PublicationDate	2024-04-05
PublicationDateYYYYMMDD	2024-04-05
PublicationDate_xml	– month: 04 year: 2024 text: 2024-04-05 day: 05
PublicationDecade	2020
PublicationPlace	One Broadway, 12th Floor, Cambridge, Massachusetts 02142, USA
PublicationPlace_xml	– name: One Broadway, 12th Floor, Cambridge, Massachusetts 02142, USA
PublicationTitle	Transactions of the Association for Computational Linguistics
PublicationYear	2024
Publisher	MIT Press The MIT Press
Publisher_xml	– name: MIT Press – name: The MIT Press
References	Awasthi (2024041219024513300_bib3) 2013; 18 Wagstaff (2024041219024513300_bib35) 2000 Aggarwal (2024041219024513300_bib1) 2012 Zhang (2024041219024513300_bib37) 2021 Zhang (2024041219024513300_bib38) 2019 Bae (2024041219024513300_bib4) 2020; 53 Park (2024041219024513300_bib28) 2023 Shen (2024041219024513300_bib32) 2022 Hara (2024041219024513300_bib22) 2017 Zhang (2024041219024513300_bib39) 2023 Gashteovski (2024041219024513300_bib20) 2017 2024041219024513300_bib23 Day (2024041219024513300_bib15) 1984; 1 Kuhn (2024041219024513300_bib24) 1955; 52 Banko (2024041219024513300_bib5) 2007 Hongjin (2024041219024513300_bib33) 2022 Devlin (2024041219024513300_bib16) 2019 De Raedt (2024041219024513300_bib29) 2023 Larson (2024041219024513300_bib25) 2019 Basu (2024041219024513300_bib6) 2002 Dash (2024041219024513300_bib14) 2020 Lloyd (2024041219024513300_bib26) 1982; 28 Zhou (2024041219024513300_bib40) 2022 Casanueva (2024041219024513300_bib11) 2020 Bordes (2024041219024513300_bib8) 2013 Bunescu (2024041219024513300_bib9) 2006 Galárraga (2024041219024513300_bib19) 2014 Arthur (2024041219024513300_bib2) 2007 Sanh (2024041219024513300_bib31) 2019 Gashteovski (2024041219024513300_bib21) 2019 Fader (2024041219024513300_bib17) 2011 Reimers (2024041219024513300_bib30) 2019 Dasgupta (2024041219024513300_bib13) 2010; 39 Jinlan (2024041219024513300_bib18) 2023; abs/2302.04166 Vashishth (2024041219024513300_bib34) 2018 Basu (2024041219024513300_bib7) 2004 Caruana (2024041219024513300_bib10) 2013 Coden (2024041219024513300_bib12) 2017 Milne (2024041219024513300_bib27) 2008 Yin (2024041219024513300_bib36) 2016
References_xml	– year: 2023 ident: 2024041219024513300_bib29 article-title: Idas: Intent discovery with abstractive summarization publication-title: ArXiv doi: 10.18653/v1/2023.nlp4convai-1.7 – start-page: 4171 volume-title: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) year: 2019 ident: 2024041219024513300_bib16 article-title: BERT: Pre-training of deep bidirectional transformers for language understanding – year: 2014 ident: 2024041219024513300_bib19 article-title: Canonicalizing open knowledge bases publication-title: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management doi: 10.1145/2661829.2662073 – start-page: 754 volume-title: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies year: 2022 ident: 2024041219024513300_bib40 article-title: Learning dialogue representations from consecutive utterances doi: 10.18653/v1/2022.naacl-main.55 – year: 2017 ident: 2024041219024513300_bib22 article-title: A data-driven analysis of workers’ earnings on Amazon Mechanical Turk publication-title: Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems doi: 10.1145/3173574.3174023 – volume-title: International Conference on Information and Knowledge Management year: 2008 ident: 2024041219024513300_bib27 article-title: Learning to link with wikipedia doi: 10.1145/1458082.1458150 – volume: 53 start-page: 1 issue: 1 year: 2020 ident: 2024041219024513300_bib4 article-title: Interactive clustering: A comprehensive review publication-title: ACM Computing Surveys doi: 10.1145/3340960 – volume-title: Conference of the European Chapter of the Association for Computational Linguistics year: 2006 ident: 2024041219024513300_bib9 article-title: Using encyclopedic knowledge for named entity disambiguation – start-page: 1311 volume-title: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) year: 2019 ident: 2024041219024513300_bib25 article-title: An evaluation dataset for intent classification and out-of-scope prediction doi: 10.18653/v1/D19-1131 – volume-title: arXiv year: 2022 ident: 2024041219024513300_bib33 article-title: One embedder, any task: Instruction-finetuned text embeddings – volume-title: CACM year: 2007 ident: 2024041219024513300_bib5 article-title: Open information extraction from the web – start-page: 1259 volume-title: Proceedings of the 22nd ACM International Conference on Information & Knowledge Management year: 2013 ident: 2024041219024513300_bib10 article-title: Clustering: Probably approximately useless? doi: 10.1145/2505515.2514692 – volume: 18 start-page: 3:1–3:35 year: 2013 ident: 2024041219024513300_bib3 article-title: Local algorithms for interactive clustering publication-title: Journal of Machine Learning Research – volume-title: SDM year: 2004 ident: 2024041219024513300_bib7 article-title: Active semi-supervision for pairwise constrained clustering doi: 10.1137/1.9781611972740.31 – volume-title: Proceedings of the Seventeenth International Conference on Machine Learning year: 2000 ident: 2024041219024513300_bib35 article-title: Clustering with instance-level constraints – start-page: 1578 volume-title: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining year: 2022 ident: 2024041219024513300_bib32 article-title: Multi-view clustering for open knowledge base canonicalization doi: 10.1145/3534678.3539449 – year: 2019 ident: 2024041219024513300_bib31 article-title: Distilbert, a distilled version of bert: Smaller, faster, cheaper and lighter publication-title: ArXiv – ident: 2024041219024513300_bib23 doi: 10.3386/w31122 – volume: abs/2302.04166 year: 2023 ident: 2024041219024513300_bib18 article-title: GPTscore: Evaluate as you desire publication-title: ArXiv – start-page: 2787 volume-title: Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2 year: 2013 ident: 2024041219024513300_bib8 article-title: Translating embeddings for modeling multi-relational data – start-page: 625 year: 2016 ident: 2024041219024513300_bib36 article-title: A model-based approach for text clustering with outlier detection publication-title: 2016 IEEE 32nd International Conference on Data Engineering (ICDE) doi: 10.1109/ICDE.2016.7498276 – year: 2023 ident: 2024041219024513300_bib28 article-title: Generative agents: Interactive simulacra of human behavior publication-title: arXiv preprint arXiv:2304.03442 – volume-title: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing year: 2019 ident: 2024041219024513300_bib30 article-title: Sentence-bert: Sentence embeddings using siamese bert-networks doi: 10.18653/v1/D19-1410 – volume-title: Conference on Empirical Methods in Natural Language Processing year: 2020 ident: 2024041219024513300_bib14 article-title: Open knowledge graphs canonicalization using variational autoencoders doi: 10.18653/v1/2021.emnlp-main.811 – volume: 1 start-page: 7 year: 1984 ident: 2024041219024513300_bib15 article-title: Efficient algorithms for agglomerative hierarchical clustering methods publication-title: Journal of Classification doi: 10.1007/BF01890115 – volume-title: ECML/PKDD year: 2019 ident: 2024041219024513300_bib38 article-title: A framework for deep constrained clustering - algorithms and advances doi: 10.1007/978-3-030-46150-8_4 – volume-title: Conference on Empirical Methods in Natural Language Processing year: 2017 ident: 2024041219024513300_bib20 article-title: Minie: Minimizing facts in open information extraction doi: 10.18653/v1/D17-1278 – volume: 52 year: 1955 ident: 2024041219024513300_bib24 article-title: The Hungarian method for the assignment problem publication-title: Naval Research Logistics (NRL) – volume-title: Proceedings of the 2017 SIAM International Conference on Data Mining year: 2017 ident: 2024041219024513300_bib12 article-title: A method to accelerate human in the loop clustering doi: 10.1137/1.9781611974973.27 – volume: 39 start-page: 581 year: 2010 ident: 2024041219024513300_bib13 article-title: Which clustering do you want? Inducing your ideal clustering with minimal feedback publication-title: Journal of Artificial Intelligence Research doi: 10.1613/jair.3003 – volume-title: Conference on Empirical Methods in Natural Language Processing year: 2011 ident: 2024041219024513300_bib17 article-title: Identifying relations for open information extraction – volume-title: Proceedings of the Conference on Automatic Knowledge Base Construction (AKBC) year: 2019 ident: 2024041219024513300_bib21 article-title: Opiec: An open information extraction corpus – start-page: 5419 volume-title: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies year: 2021 ident: 2024041219024513300_bib37 article-title: Supporting clustering with contrastive learning doi: 10.18653/v1/2021.naacl-main.427 – start-page: 38 volume-title: Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI year: 2020 ident: 2024041219024513300_bib11 article-title: Efficient intent detection with dual sentence encoders doi: 10.18653/v1/2020.nlp4convai-1.5 – volume-title: Mining Text Data year: 2012 ident: 2024041219024513300_bib1 article-title: A survey of text clustering algorithms doi: 10.1007/978-1-4614-3223-4_4 – volume-title: ACM-SIAM Symposium on Discrete Algorithms year: 2007 ident: 2024041219024513300_bib2 article-title: k-means++: the advantages of careful seeding – start-page: 1317 volume-title: Proceedings of the 2018 World Wide Web Conference year: 2018 ident: 2024041219024513300_bib34 article-title: Cesi: Canonicalizing open knowledge bases using embeddings and side information doi: 10.1145/3178876.3186030 – volume-title: International Conference on Machine Learning year: 2002 ident: 2024041219024513300_bib6 article-title: Semi-supervised clustering by seeding – year: 2023 ident: 2024041219024513300_bib39 article-title: Clusterllm: Large language models as a guide for text clustering publication-title: ArXiv doi: 10.18653/v1/2023.emnlp-main.858 – volume: 28 start-page: 129 issue: 2 year: 1982 ident: 2024041219024513300_bib26 article-title: Least squares quantization in pcm publication-title: IEEE Transactions on Information Theory doi: 10.1109/TIT.1982.1056489
SSID	ssj0001818062
Score	2.4334476
Snippet	Unlike traditional unsupervised clustering, semi-supervised clustering allows users to provide meaningful structure to the data, which helps the clustering...
SourceID	doaj crossref mit
SourceType	Open Website Enrichment Source Index Database Publisher
StartPage	321
Title	Large Language Models Enable Few-Shot Clustering
URI	https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00648 https://doaj.org/article/606566fe9c5a4d6b8f0b54302362de93
Volume	12
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV1LSwMxEA6iFy-iqFgfZQU9ydJ9JLPZoy0tRWoRtdBbyGOCh6UVu8W_b5JuSwXFi7clDGT3myzfTJj5hpAbrYvSyMTGymoeU6QQc4MsBtSpgsJoFrT0HscwnNCHKZtujfryNWEreeAVcB0XYLuIw2KpmaQGFLeJYtSPuoHMYBl0Ph3nbSVT4XbFtzBDtq50h6xTS10JKTwF828cFKT6HbNs2ucDswwOyUETEkb3q1c5Ijs4OybJyJdoR6PmOjHyM8uqRdQPrU7RAD_jl7d5HfWqpVc6cPxzQiaD_mtvGDfTDWKdF1nta-6tpcgMWpelWW5TpNwByt06paXLg1ROZSLTwqaFe7TcyBRZopRhnEOen5Ld2XyGZySiWvIkU7ygiDQHq0AWAAZBGsxdttwid-vvFbqR_vYTKCoRUgDIxDY6LXK7sX5fSV78Ytf10G1svFB1WHDuE437xF_ua5FrB7xofpzFjxud_8dGF2Q_c_FIKLphl2S3_ljilYsnatUme93--Om5HY7QF0WdyaE
linkProvider	Directory of Open Access Journals
linkToHtml	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1LS8NAEF5qe9CLKCrWZwQ9STSP3c0WvGhpqTXtxRZ6W_Yxq0JtpU3x77ubpKWKgrewmZDNl2y-mWXmG4QulUoaWgTGl0YxHwOmPtNAfAoqlDTRiuRaer0-7Qxxd0RGFXS3rIUpfuQ3729FFk0m1Pi2xHAlNhDSKD_BBXeMyjZQjVoessuy1u31u2t7LK6QmUbLfPcfl31jolyw3_LLqog-55f2DtouHUPvvpjFLqrAZA8FqUvU9tJyU9FzncvGc6-VFzx5bfj0n1-nmdccL5zegWWhfTRstwbNjl_2OPBVnESZy7w3BgPRYGysZpgJATMLK7PjGDdsNCRjLAIRJiZM7KFhWoRAAik1YYzG8QGqTqYTOEQeVoIFkWQJBsAxNZKKhFINVGiIbcxcR9fL5-WqFAB3fSjGPA8EaMTX0amjq5X1RyF88Yfdg4NuZePkqvOB6eyFl2-O2yjJuo0GGooIrKlkJpAEu35FNNLQsFO7sMDzcvnMf73R0T9sztFmZ9BLefrYfzpGW5F1QPIsG3KCqtlsAafWgcjkWfmdfAGyk8OY
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Large+Language+Models+Enable+Few-Shot+Clustering&rft.jtitle=Transactions+of+the+Association+for+Computational+Linguistics&rft.au=Vijay+Viswanathan&rft.au=Kiril+Gashteovski&rft.au=Carolin+Lawrence&rft.au=Tongshuang+Wu&rft.date=2024-04-05&rft.pub=The+MIT+Press&rft.eissn=2307-387X&rft.volume=12&rft_id=info:doi/10.1162%2Ftacl_a_00648&rft.externalDBID=DOA&rft.externalDocID=oai_doaj_org_article_606566fe9c5a4d6b8f0b54302362de93
thumbnail_l	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2307-387X&client=summon
thumbnail_m	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2307-387X&client=summon
thumbnail_s	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2307-387X&client=summon