Interpretable linear dimensionality reduction based on bias-variance analysis

One of the central issues of several machine learning applications on real data is the choice of the input features. Ideally, the designer should select a small number of the relevant, nonredundant features to preserve the complete information contained in the original dataset, with little collinear...

Full description

Saved in:

Bibliographic Details
Published in	Data mining and knowledge discovery Vol. 38; no. 4; pp. 1713 - 1781
Main Authors	Bonetti, Paolo, Metelli, Alberto Maria, Restelli, Marcello
Format	Journal Article
Language	English
Published	New York Springer US 01.07.2024 Springer Nature B.V
Subjects	Algorithms Artificial Intelligence Bias Chemistry and Earth Sciences Collinearity Computer Science Data Mining and Knowledge Discovery Datasets Design Design analysis Information Storage and Retrieval Machine learning Physics Statistics for Engineering Synthetic data Variance analysis Dimensionality reduction Bias-variance tradeoff Feature aggregation Linear regression
Online Access	Get full text

Cover

Loading…

More Information
Summary:	One of the central issues of several machine learning applications on real data is the choice of the input features. Ideally, the designer should select a small number of the relevant, nonredundant features to preserve the complete information contained in the original dataset, with little collinearity among features. This procedure helps mitigate problems like overfitting and the curse of dimensionality, which arise when dealing with high-dimensional problems. On the other hand, it is not desirable to simply discard some features, since they may still contain information that can be exploited to improve results. Instead, dimensionality reduction techniques are designed to limit the number of features in a dataset by projecting them into a lower dimensional space, possibly considering all the original features. However, the projected features resulting from the application of dimensionality reduction techniques are usually difficult to interpret. In this paper, we seek to design a principled dimensionality reduction approach that maintains the interpretability of the resulting features. Specifically, we propose a bias-variance analysis for linear models and we leverage these theoretical results to design an algorithm, Linear Correlated Features Aggregation (LinCFA), which aggregates groups of continuous features with their average if their correlation is “sufficiently large”. In this way, all features are considered, the dimensionality is reduced and the interpretability is preserved. Finally, we provide numerical validations of the proposed algorithm both on synthetic datasets to confirm the theoretical results and on real datasets to show some promising applications.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	1384-5810 1573-756X
DOI:	10.1007/s10618-024-01015-0