OSP-Class: Open Set Pseudo-labeling with Noise Robust Training for Text Classification

Although text classification performance has been greatly improved with the capability of deep learning based language models, the cost of building labeled training data is still a significant burden. To address this issue, pseudo-labeling based weakly supervised text classification models have attr...

Full description

Saved in:
Bibliographic Details
Published in2022 IEEE International Conference on Big Data (Big Data) pp. 5520 - 5529
Main Authors Kim, Dohyung, Koo, Jahwan, Kim, Ung-Mo
Format Conference Proceeding
LanguageEnglish
Published IEEE 17.12.2022
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Although text classification performance has been greatly improved with the capability of deep learning based language models, the cost of building labeled training data is still a significant burden. To address this issue, pseudo-labeling based weakly supervised text classification models have attracted significant attention. However, prior works on weakly-supervised classification usually assume that each unlabeled sample belongs to one and only one of the target classes. Such an assumption may not be appropriate or realistic for open set scenarios where the unlabeled data are obtained via a coarse collection process like crowdsourcing. This paper studies weakly-supervised text classification under the open set scenarios, relaxing such a closed set assumption and allowing for out-of-label-space data. We propose a simple combination framework of multiple existing techniques called OSP-Class, an Open Set Pseudo-labeling with noise robust training for text Classification, which is designed to generate a category-specific denoising model that can remove text corpus containing unknown data by jointly combining (1) category vocabulary expansion with pre-trained BERT masked language model, (2) binary pseudo-labeling with weighted random oversampling, and (3) noise robust training with NCE+MAE loss function. Our model achieves the highest F1 score with an average of 92.49% in the open set environment. We demonstrate that the proposed method is comparable and slightly improved performance with some minor settings even if it does not use fundamentally innovative or different techniques.
DOI:10.1109/BigData55660.2022.10020273