Weakly supervised prototype topic model with discriminative seed words: modifying the category prior by self-exploring supervised signals

Dataless text classification, i.e., a new paradigm of weakly supervised learning, refers to the task of learning with unlabeled documents and a few predefined representative words of categories, known as seed words . The recent generative dataless methods construct document-specific category priors...

Full description

Saved in:
Bibliographic Details
Published inSoft computing (Berlin, Germany) Vol. 27; no. 9; pp. 5397 - 5410
Main Authors Li, Ximing, Wang, Bing, Wang, Yue, Ouyang, Jihong, Garg, Harish, Thanh, Dang N. H.
Format Journal Article
LanguageEnglish
Published Berlin/Heidelberg Springer Berlin Heidelberg 01.05.2023
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Dataless text classification, i.e., a new paradigm of weakly supervised learning, refers to the task of learning with unlabeled documents and a few predefined representative words of categories, known as seed words . The recent generative dataless methods construct document-specific category priors by using seed word occurrences only; however, such category priors often contain very limited and even noisy supervised signals. To remedy this problem, in this paper, we propose a novel formulation of category prior. First, for each document, we consider its label membership degree by not only counting seed word occurrences, but also using a novel prototype scheme , which captures pseudo-nearest neighboring categories. Second, for each label, we consider its frequency prior knowledge of the corpus, which is also a discriminative knowledge for classification. By incorporating the proposed category prior into the previous generative dataless method, we suggest a novel generative dataless method, namely Weakly Supervised Prototype Topic Model. The experimental results on real-world datasets demonstrate that W sptm outperforms the existing baseline methods.
ISSN:1432-7643
1433-7479
DOI:10.1007/s00500-022-07771-9