A Novel Framework for Fast Feature Selection Based on Multi-Stage Correlation Measures

Datasets with thousands of features represent a challenge for many of the existing learning methods because of the well known curse of dimensionality. Not only that, but the presence of irrelevant and redundant features on any dataset can degrade the performance of any model where training and infer...

Full description

Saved in:
Bibliographic Details
Published inMachine learning and knowledge extraction Vol. 4; no. 1; pp. 131 - 149
Main Authors Garcia-Ramirez, Ivan-Alejandro, Calderon-Mora, Arturo, Mendez-Vazquez, Andres, Ortega-Cisneros, Susana, Reyes-Amezcua, Ivan
Format Journal Article
LanguageEnglish
Published Basel MDPI AG 01.03.2022
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Datasets with thousands of features represent a challenge for many of the existing learning methods because of the well known curse of dimensionality. Not only that, but the presence of irrelevant and redundant features on any dataset can degrade the performance of any model where training and inference is attempted. In addition, in large datasets, the manual management of features tends to be impractical. Therefore, the increasing interest of developing frameworks for the automatic discovery and removal of useless features through the literature of Machine Learning. This is the reason why, in this paper, we propose a novel framework for selecting relevant features in supervised datasets based on a cascade of methods where speed and precision are in mind. This framework consists of a novel combination of Approximated and Simulate Annealing versions of the Maximal Information Coefficient (MIC) to generalize the simple linear relation between features. This process is performed in a series of steps by applying the MIC algorithms and cutoff strategies to remove irrelevant and redundant features. The framework is also designed to achieve a balance between accuracy and speed. To test the performance of the proposed framework, a series of experiments are conducted on a large battery of datasets from SPECTF Heart to Sonar data. The results show the balance of accuracy and speed that the proposed framework can achieve.
ISSN:2504-4990
2504-4990
DOI:10.3390/make4010007