Weakly supervised prototype topic model with discriminative seed words: modifying the category prior by self-exploring supervised signals

Dataless text classification, i.e., a new paradigm of weakly supervised learning, refers to the task of learning with unlabeled documents and a few predefined representative words of categories, known as seed words . The recent generative dataless methods construct document-specific category priors...

Full description

Saved in:

Bibliographic Details
Published in	Soft computing (Berlin, Germany) Vol. 27; no. 9; pp. 5397 - 5410
Main Authors	Li, Ximing, Wang, Bing, Wang, Yue, Ouyang, Jihong, Garg, Harish, Thanh, Dang N. H.
Format	Journal Article
Language	English
Published	Berlin/Heidelberg Springer Berlin Heidelberg 01.05.2023
Subjects	Artificial Intelligence Computational Intelligence Control Data Analytics and Machine Learning Engineering Mathematical Logic and Foundations Mechatronics Robotics Topic modeling Dataless text classification Prototype scheme Seed words Category prior
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Dataless text classification, i.e., a new paradigm of weakly supervised learning, refers to the task of learning with unlabeled documents and a few predefined representative words of categories, known as seed words . The recent generative dataless methods construct document-specific category priors by using seed word occurrences only; however, such category priors often contain very limited and even noisy supervised signals. To remedy this problem, in this paper, we propose a novel formulation of category prior. First, for each document, we consider its label membership degree by not only counting seed word occurrences, but also using a novel prototype scheme , which captures pseudo-nearest neighboring categories. Second, for each label, we consider its frequency prior knowledge of the corpus, which is also a discriminative knowledge for classification. By incorporating the proposed category prior into the previous generative dataless method, we suggest a novel generative dataless method, namely Weakly Supervised Prototype Topic Model. The experimental results on real-world datasets demonstrate that W sptm outperforms the existing baseline methods.
ISSN:	1432-7643 1433-7479
DOI:	10.1007/s00500-022-07771-9