A subspace ensemble framework for classification with high dimensional missing data

Real world classification tasks may involve high dimensional missing data. The traditional approach to handling the missing data is to impute the data first, and then apply the traditional classification algorithms on the imputed data. This method first assumes that there exist a distribution or fea...

Full description

Saved in:
Bibliographic Details
Published inMultidimensional systems and signal processing Vol. 28; no. 4; pp. 1309 - 1324
Main Authors Gao, Hang, Jian, Songlei, Peng, Yuxing, Liu, Xinwang
Format Journal Article
LanguageEnglish
Published New York Springer US 01.10.2017
Springer Nature B.V
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Real world classification tasks may involve high dimensional missing data. The traditional approach to handling the missing data is to impute the data first, and then apply the traditional classification algorithms on the imputed data. This method first assumes that there exist a distribution or feature relations among the data, and then estimates missing items with existing observed values. A reasonable assumption is a necessary guarantee for accurate imputation. The distribution or feature relations of data, however, is often complex or even impossible to be captured in high dimensional data sets, leading to inaccurate imputation. In this paper, we propose a complete-case projection subspace ensemble framework, where two alternative partition strategies, namely bootstrap subspace partition and missing pattern-sensitive subspace partition, are developed for incomplete datasets with even missing patterns and uneven missing patterns, respectively. Multiple component classifiers are then separately trained in these subspaces. After that, a final ensemble classifier is constructed by a weighted majority vote of component classifiers. In the experiments, we demonstrate the effectiveness of the proposed framework over eight high dimensional UCI datasets. Meanwhile, we apply the two proposed partition strategies over data sets with different missing patterns. As indicated, the proposed algorithm significantly outperforms existing imputation methods in most cases.
ISSN:0923-6082
1573-0824
DOI:10.1007/s11045-016-0393-4