Investigating the impact of data normalization on classification performance

Data normalization is one of the pre-processing approaches where the data is either scaled or transformed to make an equal contribution of each feature. The success of machine learning algorithms depends upon the quality of the data to obtain a generalized predictive model of the classification prob...

Full description

Saved in:

Bibliographic Details
Published in	Applied soft computing Vol. 97; p. 105524
Main Authors	Singh, Dalwinder, Singh, Birmohan
Format	Journal Article
Language	English
Published	Elsevier B.V 01.12.2020
Subjects	Ant lion optimization Data normalization Feature selection Feature weighting k-NN classifier Feature selection k-NN classifier Data normalization Ant lion optimization Feature weighting
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Data normalization is one of the pre-processing approaches where the data is either scaled or transformed to make an equal contribution of each feature. The success of machine learning algorithms depends upon the quality of the data to obtain a generalized predictive model of the classification problem. The importance of data normalization for improving data quality and subsequently the performance of machine learning algorithms has been presented in many studies. But, the work lacks for the feature selection and feature weighting approaches, a current research trend in machine learning for improving performance. Therefore, this study aims to investigate the impact of fourteen data normalization methods on classification performance considering full feature set, feature selection, and feature weighting. In this paper, we also present a modified Ant Lion optimization that search feature subsets and the best feature weights along with the parameter of Nearest Neighbor Classifier. Experiments are performed on 21 publicly available real and synthetic datasets, and results are analyzed based on the accuracy, the percentage of feature reduced and runtime. It has been observed from the results that no single method outperforms others. Therefore, we have suggested a set of the best and the worst methods combining the normalization procedure and empirical analysis of results. The better performers are z-Score and Pareto Scaling for the full feature set and feature selection, and tanh and its variant for feature weighting. The worst performers are Mean Centered, Variable Stability Scaling and Median and Median Absolute Deviation methods along with un-normalized data. •The impact of data normalization on classification performance is investigated empirically.•Full feature set, feature selection and feature weighting are used for empirical analysis.•A modified Ant Lion Optimization algorithm is presented for searching optimal solutions.•A set of best and worst normalization methods are identified and recommended.
ISSN:	1568-4946 1872-9681
DOI:	10.1016/j.asoc.2019.105524