Online feature selection for high-dimensional class-imbalanced data

When tackling high dimensionality in data mining, online feature selection which deals with features flowing in one by one over time, presents more advantages than traditional feature selection methods. However, in real-world applications, such as fraud detection and medical diagnosis, the data is h...

Full description

Saved in:

Bibliographic Details
Published in	Knowledge-based systems Vol. 136; pp. 187 - 199
Main Authors	Zhou, Peng, Hu, Xuegang, Li, Peipei, Wu, Xindong
Format	Journal Article
Language	English
Published	Amsterdam Elsevier B.V 15.11.2017 Elsevier Science Ltd
Subjects	Algorithms Class imbalance Data mining Datasets Feature extraction Fraud High dimensional Neighborhood rough set Online data bases Online feature selection Set theory Online feature selection Neighborhood rough set High dimensional Class imbalance
Online Access	Get full text

Cover

Loading…

More Information
Summary:	When tackling high dimensionality in data mining, online feature selection which deals with features flowing in one by one over time, presents more advantages than traditional feature selection methods. However, in real-world applications, such as fraud detection and medical diagnosis, the data is high-dimensional and highly class imbalanced, namely there are many more instances of some classes than others. In such cases of class imbalance, existing online feature selection algorithms usually ignore the small classes which can be important in these applications. It is hence a challenge to learn from high-dimensional and class imbalanced data in an online manner. Motivated by this, we first formalize the problem of online streaming feature selection for class imbalanced data, and then present an efficient online feature selection framework regarding the dependency between condition features and decision classes. Meanwhile, we propose a new algorithm of Online Feature Selection based on the Dependency in K nearest neighbors, called K-OFSD. In terms of Neighborhood Rough Set theory, K-OFSD uses the information of nearest neighbors to select relevant features which can get higher separability between the majority class and the minority class. Finally, experimental studies on seven high-dimensional and class imbalanced data sets show that our algorithm can achieve better performance than traditional feature selection methods with the same numbers of features and state-of-the-art online streaming feature selection algorithms in an online manner.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	0950-7051 1872-7409
DOI:	10.1016/j.knosys.2017.09.006