Uncertainty based optimal sample selection for big data

Instance selection (IS) is a primary data reduction technique in knowledge discovery in databases (KDD) that maximizes the performance (or generalization ability) of a classification algorithm. Its purpose is to reduce the size of original dataset by eliminating unnecessary instances and maintaining...

Full description

Saved in:
Bibliographic Details
Published inIEEE access Vol. 11; p. 1
Main Authors Ajmal, S., Ashfaq, R.A.R.
Format Journal Article
LanguageEnglish
Published Piscataway IEEE 01.01.2023
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Instance selection (IS) is a primary data reduction technique in knowledge discovery in databases (KDD) that maximizes the performance (or generalization ability) of a classification algorithm. Its purpose is to reduce the size of original dataset by eliminating unnecessary instances and maintaining the predictive performance of classifier in mining tasks. However, high computational cost is needed to construct a reduced sample set which renders well known IS methods inapplicable in very large datasets. To this end, an uncertainty based IS framework (called UBOSS) is proposed that can be combined with any IS algorithm and aims to efficiently select a set of optimal samples while minimizing the computational cost associated with large-scale data. Our proposed methodology comprises three main steps. Initially, it uses an IS method to identify the patterns of representative and unrepresentative samples from the original dataset. Next, an uncertainty-based selector is designed to formulate fuzzy samples using a classifier whose output is membership vector, corresponds to each training sample. This process is further integrated with divide-and-conquer strategy to obtain final representative samples set which has benefits of improved prediction accuracy and low computation time. Experiments are conducted on six large- scale datasets to evaluate the performance of proposed framework and compared with baseline method based on three well known IS algorithms CNN, IB3, and DROP3 individually. Results analysis indicates that our proposed methodology outperforms the baselines in terms of reduced run time and search space while maintains final classification accuracy.
ISSN:2169-3536
2169-3536
DOI:10.1109/ACCESS.2022.3233598