Clustering Algorithms and RAG Enhancing Semi-Supervised Text Classification with Large LLMs

This paper introduces an innovative semi-supervised learning approach for text classification, addressing the challenge of abundant data but limited labeled examples. Our methodology integrates few-shot learning with retrieval-augmented generation (RAG) and conventional statistical clustering, enabl...

Full description

Saved in:

Bibliographic Details
Main Authors	Zhong, Shan, Zeng, Jiahao, Yu, Yongxin, Lin, Bohong
Format	Journal Article
Language	English
Published	09.11.2024
Subjects	Computer Science - Computation and Language Computer Science - Learning
Online Access	Get full text

Cover

Loading…

More Information
Summary:	This paper introduces an innovative semi-supervised learning approach for text classification, addressing the challenge of abundant data but limited labeled examples. Our methodology integrates few-shot learning with retrieval-augmented generation (RAG) and conventional statistical clustering, enabling effective learning from a minimal number of labeled instances while generating high-quality labeled data. To the best of our knowledge, we are the first to incorporate RAG alongside clustering in text data generation. Our experiments on the Reuters and Web of Science datasets demonstrate state-of-the-art performance, with few-shot augmented data alone producing results nearly equivalent to those achieved with fully labeled datasets. Notably, accuracies of 95.41\% and 82.43\% were achieved for complex text document classification tasks, where the number of categories can exceed 100.
DOI:	10.48550/arxiv.2411.06175