Enhancing Binary Classification by Modeling Uncertain Boundary in Three-Way Decisions (Extended Abstract)

Text classification techniques are playing a crucial role in identifying relevant texts from a large data set, e.g., various online crimes such as Cyberbullying, terrorist recruiting, propaganda or attack planning. Until now, supervised deep learning has brought about breakthroughs in processing mul...

Full description

Saved in:
Bibliographic Details
Published in2018 IEEE 34th International Conference on Data Engineering (ICDE) pp. 1827 - 1828
Main Authors Yuefeng Li, Libiao Zhang, Yue Xu, Yiyu Yao, Lau, Raymond Y. K., Yutong Wu
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.04.2018
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Text classification techniques are playing a crucial role in identifying relevant texts from a large data set, e.g., various online crimes such as Cyberbullying, terrorist recruiting, propaganda or attack planning. Until now, supervised deep learning has brought about breakthroughs in processing multimedia data; however, there was no good practical way to harvest this opportunity for text classification because acquiring and maintaining a massive amount of training examples are too expensive for a large number of categories (e.g., Yahoo! taxonomy contains nearly 300,000 categories and the Library of Congress Subject Headings (LCSH) contains 394,070 subjects). Therefore, the question of how to effectively learn from sparse or small set of training examples is crucial for the true success of text classification. Semi-supervised approaches have been proposed for this challenge, which usually use a pair or several existing classifiers to extend a small training set. However, extracted pseudo training samples are uncertain because they are determined by a machine rather than people. Also, the massive volume and high variability of text data are creating a number of challenging issues such as the scalability and complicated relations between words. There are two fundamental issues with regards to the performance of existing classifiers: overlook and overload. Overlook means that some objects relevant to a class have been omitted, whereas overload means that some objects assigned to a class are actually not relevant to that class. The two issues are even more serious in the following two cases: (1) large uncertain boundary - the decision boundary between two classes includes many mixed examples (e.g., relevant and nonrelevant documents together), and (2) unbalanced classes - one class (e.g., information about terrorist attacks) is much smaller than another class (e.g., normal descriptions). We propose a three-way decision model [1] for dealing with the uncertain boundary for improving text classification performance based on rough set techniques and centroid solution. It aims to understand the uncertain boundary through partitioning the training samples into three regions (the positive, boundary and negative regions) by two main boundary vectors created from the labeled positive and negative training subsets, respectively, and further resolve the objects in the boundary region by two derived boundary vectors produced according to the structure of the boundary region. Four decision rules are proposed from the training process and applied to the incoming documents for more precise classification. The experimental results on the standard data sets RCV1 and Reuters-21578 show that the usage of boundary vectors is very effective and efficient for dealing with uncertainties of the decision boundary, and the proposed model has significantly improved the performance of binary text classification in terms of F1 measure and AUC area compared with six other popular baseline models.
ISSN:2375-026X
DOI:10.1109/ICDE.2018.00271