Weakly supervised prototype topic model with discriminative seed words: modifying the category prior by self-exploring supervised signals
Dataless text classification, i.e., a new paradigm of weakly supervised learning, refers to the task of learning with unlabeled documents and a few predefined representative words of categories, known as seed words . The recent generative dataless methods construct document-specific category priors...
Saved in:
Published in | Soft computing (Berlin, Germany) Vol. 27; no. 9; pp. 5397 - 5410 |
---|---|
Main Authors | , , , , , |
Format | Journal Article |
Language | English |
Published |
Berlin/Heidelberg
Springer Berlin Heidelberg
01.05.2023
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Dataless text classification,
i.e.,
a new paradigm of weakly supervised learning, refers to the task of learning with unlabeled documents and a few predefined representative words of categories, known as
seed words
. The recent generative dataless methods construct document-specific category priors by using seed word occurrences only; however, such category priors often contain very limited and even noisy supervised signals. To remedy this problem, in this paper, we propose a novel formulation of category prior. First, for each document, we consider its label membership degree by not only counting seed word occurrences, but also using a novel
prototype scheme
, which captures pseudo-nearest neighboring categories. Second, for each label, we consider its frequency prior knowledge of the corpus, which is also a discriminative knowledge for classification. By incorporating the proposed category prior into the previous generative dataless method, we suggest a novel generative dataless method, namely Weakly Supervised Prototype Topic Model. The experimental results on real-world datasets demonstrate that W
sptm
outperforms the existing baseline methods. |
---|---|
ISSN: | 1432-7643 1433-7479 |
DOI: | 10.1007/s00500-022-07771-9 |