Investigating the impact of data normalization on classification performance

Data normalization is one of the pre-processing approaches where the data is either scaled or transformed to make an equal contribution of each feature. The success of machine learning algorithms depends upon the quality of the data to obtain a generalized predictive model of the classification prob...

Full description

Saved in:
Bibliographic Details
Published inApplied soft computing Vol. 97; p. 105524
Main Authors Singh, Dalwinder, Singh, Birmohan
Format Journal Article
LanguageEnglish
Published Elsevier B.V 01.12.2020
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Data normalization is one of the pre-processing approaches where the data is either scaled or transformed to make an equal contribution of each feature. The success of machine learning algorithms depends upon the quality of the data to obtain a generalized predictive model of the classification problem. The importance of data normalization for improving data quality and subsequently the performance of machine learning algorithms has been presented in many studies. But, the work lacks for the feature selection and feature weighting approaches, a current research trend in machine learning for improving performance. Therefore, this study aims to investigate the impact of fourteen data normalization methods on classification performance considering full feature set, feature selection, and feature weighting. In this paper, we also present a modified Ant Lion optimization that search feature subsets and the best feature weights along with the parameter of Nearest Neighbor Classifier. Experiments are performed on 21 publicly available real and synthetic datasets, and results are analyzed based on the accuracy, the percentage of feature reduced and runtime. It has been observed from the results that no single method outperforms others. Therefore, we have suggested a set of the best and the worst methods combining the normalization procedure and empirical analysis of results. The better performers are z-Score and Pareto Scaling for the full feature set and feature selection, and tanh and its variant for feature weighting. The worst performers are Mean Centered, Variable Stability Scaling and Median and Median Absolute Deviation methods along with un-normalized data. •The impact of data normalization on classification performance is investigated empirically.•Full feature set, feature selection and feature weighting are used for empirical analysis.•A modified Ant Lion Optimization algorithm is presented for searching optimal solutions.•A set of best and worst normalization methods are identified and recommended.
ISSN:1568-4946
1872-9681
DOI:10.1016/j.asoc.2019.105524