Generating Balanced Classifier-Independent Training Samples from Unlabeled Data

We consider the problem of generating balanced training samples from an unlabeled data set with an unknown class distribution. While random sampling works well when the data is balanced, it is very ineffective for unbalanced data. Other approaches, such as active learning and cost-sensitive learning...

Full description

Saved in:

Bibliographic Details
Published in	Advances in Knowledge Discovery and Data Mining pp. 266 - 281
Main Authors	Park, Youngja, Qi, Zijie, Chari, Suresh N., Molloy, Ian M.
Format	Book Chapter
Language	English
Published	Berlin, Heidelberg Springer Berlin Heidelberg
Series	Lecture Notes in Computer Science
Subjects	Active Learning Class Distribution Domain Knowledge Training Sample Unlabeled Data
Online Access	Get full text

Cover

Loading…

More Information
Summary:	We consider the problem of generating balanced training samples from an unlabeled data set with an unknown class distribution. While random sampling works well when the data is balanced, it is very ineffective for unbalanced data. Other approaches, such as active learning and cost-sensitive learning, are also suboptimal as they are classifier-dependent, and require misclassification costs and labeled samples. We propose a new strategy for generating training samples which is independent of the underlying class distribution of the data and the classifier that will be trained using the labeled data. Our methods are iterative and can be seen as variants of active learning, where we use semi-supervised clustering at each iteration to perform biased sampling from the clusters. Several strategies are provided to estimate the underlying class distributions in the clusters and increase the balancedness in the training samples. Experiments with both highly skewed and balanced data from the UCI repository and a private data show that our algorithm produces much more balanced samples than random sampling or uncertainty sampling. Further, our sampling strategy is substantially more efficient than active learning methods. The experiments also validate that, with more balanced training data, classifiers trained with our samples outperform classifiers trained with random sampling or active learning.
ISBN:	3642302165 9783642302169
ISSN:	0302-9743 1611-3349
DOI:	10.1007/978-3-642-30217-6_23