Uncertainty based optimal sample selection for big data

Instance selection (IS) is a primary data reduction technique in knowledge discovery in databases (KDD) that maximizes the performance (or generalization ability) of a classification algorithm. Its purpose is to reduce the size of original dataset by eliminating unnecessary instances and maintaining...

Full description

Saved in:

Bibliographic Details
Published in	IEEE access Vol. 11; p. 1
Main Authors	Ajmal, S., Ashfaq, R.A.R.
Format	Journal Article
Language	English
Published	Piscataway IEEE 01.01.2023 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Big Data Classification algorithms Data mining Fuzzy sets Identification methods Instance selection Machine Learning Optimization Pattern recognition Performance degradation Performance evaluation Prediction models Prototypes Training Uncertainty
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Instance selection (IS) is a primary data reduction technique in knowledge discovery in databases (KDD) that maximizes the performance (or generalization ability) of a classification algorithm. Its purpose is to reduce the size of original dataset by eliminating unnecessary instances and maintaining the predictive performance of classifier in mining tasks. However, high computational cost is needed to construct a reduced sample set which renders well known IS methods inapplicable in very large datasets. To this end, an uncertainty based IS framework (called UBOSS) is proposed that can be combined with any IS algorithm and aims to efficiently select a set of optimal samples while minimizing the computational cost associated with large-scale data. Our proposed methodology comprises three main steps. Initially, it uses an IS method to identify the patterns of representative and unrepresentative samples from the original dataset. Next, an uncertainty-based selector is designed to formulate fuzzy samples using a classifier whose output is membership vector, corresponds to each training sample. This process is further integrated with divide-and-conquer strategy to obtain final representative samples set which has benefits of improved prediction accuracy and low computation time. Experiments are conducted on six large- scale datasets to evaluate the performance of proposed framework and compared with baseline method based on three well known IS algorithms CNN, IB3, and DROP3 individually. Results analysis indicates that our proposed methodology outperforms the baselines in terms of reduced run time and search space while maintains final classification accuracy.
ISSN:	2169-3536 2169-3536
DOI:	10.1109/ACCESS.2022.3233598