Large Language Models Enable Few-Shot Clustering
Unlike traditional unsupervised clustering, semi-supervised clustering allows users to provide meaningful structure to the data, which helps the clustering algorithm to match the user’s intent. Existing approaches to semi-supervised clustering require a significant amount of feedback from an expert...
Saved in:
Published in | Transactions of the Association for Computational Linguistics Vol. 12; pp. 321 - 333 |
---|---|
Main Authors | , , , , |
Format | Journal Article |
Language | English |
Published |
One Broadway, 12th Floor, Cambridge, Massachusetts 02142, USA
MIT Press
05.04.2024
The MIT Press |
Online Access | Get full text |
ISSN | 2307-387X 2307-387X |
DOI | 10.1162/tacl_a_00648 |
Cover
Loading…
Abstract | Unlike traditional unsupervised clustering, semi-supervised clustering allows users to provide meaningful structure to the data, which helps the clustering algorithm to match the user’s intent. Existing approaches to semi-supervised clustering require a significant amount of feedback from an expert to improve the clusters. In this paper, we ask whether a large language model (LLM) can amplify an expert’s guidance to enable query-efficient, few-shot semi-supervised text clustering. We show that LLMs are surprisingly effective at improving clustering. We explore three stages where LLMs can be incorporated into clustering: before clustering (improving input features), during clustering (by providing constraints to the clusterer), and after clustering (using LLMs post-correction). We find that incorporating LLMs in the first two stages routinely provides significant improvements in cluster quality, and that LLMs enable a user to make trade-offs between cost and accuracy to produce desired clusters. We release our code and LLM prompts for the public to use. |
---|---|
AbstractList | Unlike traditional unsupervised clustering, semi-supervised clustering allows users to provide meaningful structure to the data, which helps the clustering algorithm to match the user’s intent. Existing approaches to semi-supervised clustering require a significant amount of feedback from an expert to improve the clusters. In this paper, we ask whether a large language model (LLM) can amplify an expert’s guidance to enable query-efficient, few-shot semi-supervised text clustering. We show that LLMs are surprisingly effective at improving clustering. We explore three stages where LLMs can be incorporated into clustering: before clustering (improving input features), during clustering (by providing constraints to the clusterer), and after clustering (using LLMs post-correction). We find that incorporating LLMs in the first two stages routinely provides significant improvements in cluster quality, and that LLMs enable a user to make trade-offs between cost and accuracy to produce desired clusters. We release our code and LLM prompts for the public to use. Unlike traditional unsupervised clustering, semi-supervised clustering allows users to provide meaningful structure to the data, which helps the clustering algorithm to match the user’s intent. Existing approaches to semi-supervised clustering require a significant amount of feedback from an expert to improve the clusters. In this paper, we ask whether a large language model (LLM) can amplify an expert’s guidance to enable query-efficient, few-shot semi-supervised text clustering. We show that LLMs are surprisingly effective at improving clustering. We explore three stages where LLMs can be incorporated into clustering: before clustering (improving input features), during clustering (by providing constraints to the clusterer), and after clustering (using LLMs post-correction). We find that incorporating LLMs in the first two stages routinely provides significant improvements in cluster quality, and that LLMs enable a user to make trade-offs between cost and accuracy to produce desired clusters. We release our code and LLM prompts for the public to use.1 |
Author | Gashteovski, Kiril Viswanathan, Vijay Neubig, Graham Lawrence, Carolin Wu, Tongshuang |
Author_xml | – sequence: 1 givenname: Vijay surname: Viswanathan fullname: Viswanathan, Vijay organization: Carnegie Mellon University, USA – sequence: 2 givenname: Kiril surname: Gashteovski fullname: Gashteovski, Kiril organization: NEC Laboratories Europe, Germany – sequence: 2 givenname: Kiril surname: Gashteovski fullname: Gashteovski, Kiril organization: Center for Advanced Interdisciplinary Research, Ss. Cyril and Methodius Uni. of Skopje, Germany – sequence: 4 givenname: Carolin surname: Lawrence fullname: Lawrence, Carolin organization: NEC Laboratories Europe, Germany – sequence: 5 givenname: Tongshuang surname: Wu fullname: Wu, Tongshuang organization: Carnegie Mellon University, USA – sequence: 6 givenname: Graham surname: Neubig fullname: Neubig, Graham organization: Carnegie Mellon University, USA |
BookMark | eNp1kE9Lw0AUxBepYK29-QF69GB0_2WzOUpptRDxoIK35SV5G7ekWdlsEb-9qVUooqc3DL8ZHnNKRp3vkJBzRq8YU_w6QtUaMJQqqY_ImAuaJUJnL6MDfUKmfb-mlDLNNFV8TGgBocFZAV2zhUHc-xrbfrbooGxxtsT35PHVx9m83fYRg-uaM3Jsoe1x-n0n5Hm5eJrfJcXD7Wp-UySVyHhMVCaslZjWaDVPrbYMpcaq0oMvZa7zvBQSKLDMsmyQVtfAMKVlWadaKyEmZLXvrT2szVtwGwgfxoMzX4YPjYEQXdWiUVSlSlnMqxRkrUptaZlKQblQvMZ818X3XVXwfR_QmspFiM53MYBrDaNmN6E5nHAIXf4K_TzxD36xxzcumrXfhm5Y52_0EyoxgdE |
CitedBy_id | crossref_primary_10_1109_RBME_2024_3492381 crossref_primary_10_3390_su17051896 crossref_primary_10_1016_j_jretconser_2024_104078 crossref_primary_10_3390_math12182928 |
Cites_doi | 10.18653/v1/2023.nlp4convai-1.7 10.1145/2661829.2662073 10.18653/v1/2022.naacl-main.55 10.1145/3173574.3174023 10.1145/1458082.1458150 10.1145/3340960 10.18653/v1/D19-1131 10.1145/2505515.2514692 10.1137/1.9781611972740.31 10.1145/3534678.3539449 10.3386/w31122 10.1109/ICDE.2016.7498276 10.18653/v1/D19-1410 10.18653/v1/2021.emnlp-main.811 10.1007/BF01890115 10.1007/978-3-030-46150-8_4 10.18653/v1/D17-1278 10.1137/1.9781611974973.27 10.1613/jair.3003 10.18653/v1/2021.naacl-main.427 10.18653/v1/2020.nlp4convai-1.5 10.1007/978-1-4614-3223-4_4 10.1145/3178876.3186030 10.18653/v1/2023.emnlp-main.858 10.1109/TIT.1982.1056489 |
ContentType | Journal Article |
DBID | AAYXX CITATION DOA |
DOI | 10.1162/tacl_a_00648 |
DatabaseName | CrossRef DOAJ (Directory of Open Access Journals) |
DatabaseTitle | CrossRef |
DatabaseTitleList | CrossRef |
Database_xml | – sequence: 1 dbid: DOA name: DOAJ Directory of Open Access Journals (WRLC) url: https://www.doaj.org/ sourceTypes: Open Website |
DeliveryMethod | fulltext_linktorsrc |
EISSN | 2307-387X |
EndPage | 333 |
ExternalDocumentID | oai_doaj_org_article_606566fe9c5a4d6b8f0b54302362de93 10_1162_tacl_a_00648 tacl_a_00648.pdf |
GroupedDBID | AAFWJ ABUWG AFKRA AFPKN ALMA_UNASSIGNED_HOLDINGS ALSLI ARAPS BENPR BGLVJ CCPQU CPGLG CRLPW DWQXO EBS GROUPED_DOAJ HCIFZ JMNJE K7- M~E OJV OK1 PHGZT PIMPY RMI AAYXX CITATION PHGZM PQGLB PRQQA PUEGO |
ID | FETCH-LOGICAL-c372t-673ff4e5def825f8f1e48ecc873f449899b34a0a17f17b34f8da1e50bbd588633 |
IEDL.DBID | DOA |
ISSN | 2307-387X |
IngestDate | Wed Aug 27 01:13:01 EDT 2025 Thu Apr 24 22:53:39 EDT 2025 Tue Jul 01 03:28:36 EDT 2025 Thu Apr 10 09:09:01 EDT 2025 |
IsDoiOpenAccess | true |
IsOpenAccess | true |
IsPeerReviewed | true |
IsScholarly | true |
Language | English |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-c372t-673ff4e5def825f8f1e48ecc873f449899b34a0a17f17b34f8da1e50bbd588633 |
Notes | 2024 |
OpenAccessLink | https://doaj.org/article/606566fe9c5a4d6b8f0b54302362de93 |
PageCount | 13 |
ParticipantIDs | crossref_citationtrail_10_1162_tacl_a_00648 mit_journals_10_1162_tacl_a_00648 crossref_primary_10_1162_tacl_a_00648 doaj_primary_oai_doaj_org_article_606566fe9c5a4d6b8f0b54302362de93 |
ProviderPackageCode | CITATION AAYXX |
PublicationCentury | 2000 |
PublicationDate | 2024-04-05 |
PublicationDateYYYYMMDD | 2024-04-05 |
PublicationDate_xml | – month: 04 year: 2024 text: 2024-04-05 day: 05 |
PublicationDecade | 2020 |
PublicationPlace | One Broadway, 12th Floor, Cambridge, Massachusetts 02142, USA |
PublicationPlace_xml | – name: One Broadway, 12th Floor, Cambridge, Massachusetts 02142, USA |
PublicationTitle | Transactions of the Association for Computational Linguistics |
PublicationYear | 2024 |
Publisher | MIT Press The MIT Press |
Publisher_xml | – name: MIT Press – name: The MIT Press |
References | Awasthi (2024041219024513300_bib3) 2013; 18 Wagstaff (2024041219024513300_bib35) 2000 Aggarwal (2024041219024513300_bib1) 2012 Zhang (2024041219024513300_bib37) 2021 Zhang (2024041219024513300_bib38) 2019 Bae (2024041219024513300_bib4) 2020; 53 Park (2024041219024513300_bib28) 2023 Shen (2024041219024513300_bib32) 2022 Hara (2024041219024513300_bib22) 2017 Zhang (2024041219024513300_bib39) 2023 Gashteovski (2024041219024513300_bib20) 2017 2024041219024513300_bib23 Day (2024041219024513300_bib15) 1984; 1 Kuhn (2024041219024513300_bib24) 1955; 52 Banko (2024041219024513300_bib5) 2007 Hongjin (2024041219024513300_bib33) 2022 Devlin (2024041219024513300_bib16) 2019 De Raedt (2024041219024513300_bib29) 2023 Larson (2024041219024513300_bib25) 2019 Basu (2024041219024513300_bib6) 2002 Dash (2024041219024513300_bib14) 2020 Lloyd (2024041219024513300_bib26) 1982; 28 Zhou (2024041219024513300_bib40) 2022 Casanueva (2024041219024513300_bib11) 2020 Bordes (2024041219024513300_bib8) 2013 Bunescu (2024041219024513300_bib9) 2006 Galárraga (2024041219024513300_bib19) 2014 Arthur (2024041219024513300_bib2) 2007 Sanh (2024041219024513300_bib31) 2019 Gashteovski (2024041219024513300_bib21) 2019 Fader (2024041219024513300_bib17) 2011 Reimers (2024041219024513300_bib30) 2019 Dasgupta (2024041219024513300_bib13) 2010; 39 Jinlan (2024041219024513300_bib18) 2023; abs/2302.04166 Vashishth (2024041219024513300_bib34) 2018 Basu (2024041219024513300_bib7) 2004 Caruana (2024041219024513300_bib10) 2013 Coden (2024041219024513300_bib12) 2017 Milne (2024041219024513300_bib27) 2008 Yin (2024041219024513300_bib36) 2016 |
References_xml | – year: 2023 ident: 2024041219024513300_bib29 article-title: Idas: Intent discovery with abstractive summarization publication-title: ArXiv doi: 10.18653/v1/2023.nlp4convai-1.7 – start-page: 4171 volume-title: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) year: 2019 ident: 2024041219024513300_bib16 article-title: BERT: Pre-training of deep bidirectional transformers for language understanding – year: 2014 ident: 2024041219024513300_bib19 article-title: Canonicalizing open knowledge bases publication-title: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management doi: 10.1145/2661829.2662073 – start-page: 754 volume-title: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies year: 2022 ident: 2024041219024513300_bib40 article-title: Learning dialogue representations from consecutive utterances doi: 10.18653/v1/2022.naacl-main.55 – year: 2017 ident: 2024041219024513300_bib22 article-title: A data-driven analysis of workers’ earnings on Amazon Mechanical Turk publication-title: Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems doi: 10.1145/3173574.3174023 – volume-title: International Conference on Information and Knowledge Management year: 2008 ident: 2024041219024513300_bib27 article-title: Learning to link with wikipedia doi: 10.1145/1458082.1458150 – volume: 53 start-page: 1 issue: 1 year: 2020 ident: 2024041219024513300_bib4 article-title: Interactive clustering: A comprehensive review publication-title: ACM Computing Surveys doi: 10.1145/3340960 – volume-title: Conference of the European Chapter of the Association for Computational Linguistics year: 2006 ident: 2024041219024513300_bib9 article-title: Using encyclopedic knowledge for named entity disambiguation – start-page: 1311 volume-title: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) year: 2019 ident: 2024041219024513300_bib25 article-title: An evaluation dataset for intent classification and out-of-scope prediction doi: 10.18653/v1/D19-1131 – volume-title: arXiv year: 2022 ident: 2024041219024513300_bib33 article-title: One embedder, any task: Instruction-finetuned text embeddings – volume-title: CACM year: 2007 ident: 2024041219024513300_bib5 article-title: Open information extraction from the web – start-page: 1259 volume-title: Proceedings of the 22nd ACM International Conference on Information & Knowledge Management year: 2013 ident: 2024041219024513300_bib10 article-title: Clustering: Probably approximately useless? doi: 10.1145/2505515.2514692 – volume: 18 start-page: 3:1–3:35 year: 2013 ident: 2024041219024513300_bib3 article-title: Local algorithms for interactive clustering publication-title: Journal of Machine Learning Research – volume-title: SDM year: 2004 ident: 2024041219024513300_bib7 article-title: Active semi-supervision for pairwise constrained clustering doi: 10.1137/1.9781611972740.31 – volume-title: Proceedings of the Seventeenth International Conference on Machine Learning year: 2000 ident: 2024041219024513300_bib35 article-title: Clustering with instance-level constraints – start-page: 1578 volume-title: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining year: 2022 ident: 2024041219024513300_bib32 article-title: Multi-view clustering for open knowledge base canonicalization doi: 10.1145/3534678.3539449 – year: 2019 ident: 2024041219024513300_bib31 article-title: Distilbert, a distilled version of bert: Smaller, faster, cheaper and lighter publication-title: ArXiv – ident: 2024041219024513300_bib23 doi: 10.3386/w31122 – volume: abs/2302.04166 year: 2023 ident: 2024041219024513300_bib18 article-title: GPTscore: Evaluate as you desire publication-title: ArXiv – start-page: 2787 volume-title: Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2 year: 2013 ident: 2024041219024513300_bib8 article-title: Translating embeddings for modeling multi-relational data – start-page: 625 year: 2016 ident: 2024041219024513300_bib36 article-title: A model-based approach for text clustering with outlier detection publication-title: 2016 IEEE 32nd International Conference on Data Engineering (ICDE) doi: 10.1109/ICDE.2016.7498276 – year: 2023 ident: 2024041219024513300_bib28 article-title: Generative agents: Interactive simulacra of human behavior publication-title: arXiv preprint arXiv:2304.03442 – volume-title: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing year: 2019 ident: 2024041219024513300_bib30 article-title: Sentence-bert: Sentence embeddings using siamese bert-networks doi: 10.18653/v1/D19-1410 – volume-title: Conference on Empirical Methods in Natural Language Processing year: 2020 ident: 2024041219024513300_bib14 article-title: Open knowledge graphs canonicalization using variational autoencoders doi: 10.18653/v1/2021.emnlp-main.811 – volume: 1 start-page: 7 year: 1984 ident: 2024041219024513300_bib15 article-title: Efficient algorithms for agglomerative hierarchical clustering methods publication-title: Journal of Classification doi: 10.1007/BF01890115 – volume-title: ECML/PKDD year: 2019 ident: 2024041219024513300_bib38 article-title: A framework for deep constrained clustering - algorithms and advances doi: 10.1007/978-3-030-46150-8_4 – volume-title: Conference on Empirical Methods in Natural Language Processing year: 2017 ident: 2024041219024513300_bib20 article-title: Minie: Minimizing facts in open information extraction doi: 10.18653/v1/D17-1278 – volume: 52 year: 1955 ident: 2024041219024513300_bib24 article-title: The Hungarian method for the assignment problem publication-title: Naval Research Logistics (NRL) – volume-title: Proceedings of the 2017 SIAM International Conference on Data Mining year: 2017 ident: 2024041219024513300_bib12 article-title: A method to accelerate human in the loop clustering doi: 10.1137/1.9781611974973.27 – volume: 39 start-page: 581 year: 2010 ident: 2024041219024513300_bib13 article-title: Which clustering do you want? Inducing your ideal clustering with minimal feedback publication-title: Journal of Artificial Intelligence Research doi: 10.1613/jair.3003 – volume-title: Conference on Empirical Methods in Natural Language Processing year: 2011 ident: 2024041219024513300_bib17 article-title: Identifying relations for open information extraction – volume-title: Proceedings of the Conference on Automatic Knowledge Base Construction (AKBC) year: 2019 ident: 2024041219024513300_bib21 article-title: Opiec: An open information extraction corpus – start-page: 5419 volume-title: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies year: 2021 ident: 2024041219024513300_bib37 article-title: Supporting clustering with contrastive learning doi: 10.18653/v1/2021.naacl-main.427 – start-page: 38 volume-title: Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI year: 2020 ident: 2024041219024513300_bib11 article-title: Efficient intent detection with dual sentence encoders doi: 10.18653/v1/2020.nlp4convai-1.5 – volume-title: Mining Text Data year: 2012 ident: 2024041219024513300_bib1 article-title: A survey of text clustering algorithms doi: 10.1007/978-1-4614-3223-4_4 – volume-title: ACM-SIAM Symposium on Discrete Algorithms year: 2007 ident: 2024041219024513300_bib2 article-title: k-means++: the advantages of careful seeding – start-page: 1317 volume-title: Proceedings of the 2018 World Wide Web Conference year: 2018 ident: 2024041219024513300_bib34 article-title: Cesi: Canonicalizing open knowledge bases using embeddings and side information doi: 10.1145/3178876.3186030 – volume-title: International Conference on Machine Learning year: 2002 ident: 2024041219024513300_bib6 article-title: Semi-supervised clustering by seeding – year: 2023 ident: 2024041219024513300_bib39 article-title: Clusterllm: Large language models as a guide for text clustering publication-title: ArXiv doi: 10.18653/v1/2023.emnlp-main.858 – volume: 28 start-page: 129 issue: 2 year: 1982 ident: 2024041219024513300_bib26 article-title: Least squares quantization in pcm publication-title: IEEE Transactions on Information Theory doi: 10.1109/TIT.1982.1056489 |
SSID | ssj0001818062 |
Score | 2.4334476 |
Snippet | Unlike traditional unsupervised clustering, semi-supervised clustering allows users to provide meaningful structure to the data, which helps the clustering... |
SourceID | doaj crossref mit |
SourceType | Open Website Enrichment Source Index Database Publisher |
StartPage | 321 |
Title | Large Language Models Enable Few-Shot Clustering |
URI | https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00648 https://doaj.org/article/606566fe9c5a4d6b8f0b54302362de93 |
Volume | 12 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV1LSwMxEA6iFy-iqFgfZQU9ydJ9JLPZoy0tRWoRtdBbyGOCh6UVu8W_b5JuSwXFi7clDGT3myzfTJj5hpAbrYvSyMTGymoeU6QQc4MsBtSpgsJoFrT0HscwnNCHKZtujfryNWEreeAVcB0XYLuIw2KpmaQGFLeJYtSPuoHMYBl0Ph3nbSVT4XbFtzBDtq50h6xTS10JKTwF828cFKT6HbNs2ucDswwOyUETEkb3q1c5Ijs4OybJyJdoR6PmOjHyM8uqRdQPrU7RAD_jl7d5HfWqpVc6cPxzQiaD_mtvGDfTDWKdF1nta-6tpcgMWpelWW5TpNwByt06paXLg1ROZSLTwqaFe7TcyBRZopRhnEOen5Ld2XyGZySiWvIkU7ygiDQHq0AWAAZBGsxdttwid-vvFbqR_vYTKCoRUgDIxDY6LXK7sX5fSV78Ytf10G1svFB1WHDuE437xF_ua5FrB7xofpzFjxud_8dGF2Q_c_FIKLphl2S3_ljilYsnatUme93--Om5HY7QF0WdyaE |
linkProvider | Directory of Open Access Journals |
linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1LS8NAEF5qe9CLKCrWZwQ9STSP3c0WvGhpqTXtxRZ6W_Yxq0JtpU3x77ubpKWKgrewmZDNl2y-mWXmG4QulUoaWgTGl0YxHwOmPtNAfAoqlDTRiuRaer0-7Qxxd0RGFXS3rIUpfuQ3729FFk0m1Pi2xHAlNhDSKD_BBXeMyjZQjVoessuy1u31u2t7LK6QmUbLfPcfl31jolyw3_LLqog-55f2DtouHUPvvpjFLqrAZA8FqUvU9tJyU9FzncvGc6-VFzx5bfj0n1-nmdccL5zegWWhfTRstwbNjl_2OPBVnESZy7w3BgPRYGysZpgJATMLK7PjGDdsNCRjLAIRJiZM7KFhWoRAAik1YYzG8QGqTqYTOEQeVoIFkWQJBsAxNZKKhFINVGiIbcxcR9fL5-WqFAB3fSjGPA8EaMTX0amjq5X1RyF88Yfdg4NuZePkqvOB6eyFl2-O2yjJuo0GGooIrKlkJpAEu35FNNLQsFO7sMDzcvnMf73R0T9sztFmZ9BLefrYfzpGW5F1QPIsG3KCqtlsAafWgcjkWfmdfAGyk8OY |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Large+Language+Models+Enable+Few-Shot+Clustering&rft.jtitle=Transactions+of+the+Association+for+Computational+Linguistics&rft.au=Vijay+Viswanathan&rft.au=Kiril+Gashteovski&rft.au=Carolin+Lawrence&rft.au=Tongshuang+Wu&rft.date=2024-04-05&rft.pub=The+MIT+Press&rft.eissn=2307-387X&rft.volume=12&rft_id=info:doi/10.1162%2Ftacl_a_00648&rft.externalDBID=DOA&rft.externalDocID=oai_doaj_org_article_606566fe9c5a4d6b8f0b54302362de93 |
thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2307-387X&client=summon |
thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2307-387X&client=summon |
thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2307-387X&client=summon |