Single and Ensemble Based Filters in Environmental Data

ABSTRACT Researchers rely on species distribution models (SDMs) to establish a correlation between species occurrence records and environmental data. These models offer insights into the ecological and evolutionary aspects of the subject. Feature selection (FS) aims to choose useful interlinked feat...

Full description

Saved in:

Bibliographic Details
Published in	Expert systems Vol. 42; no. 7
Main Authors	Cherif, Yousra, Idri, Ali
Format	Journal Article
Language	English
Published	Oxford Blackwell Publishing Ltd 01.07.2025
Subjects	Accuracy Datasets Decision trees Ensemble learning environmental data Feature selection machine learning and classification Multivariate analysis multivariate filters Performance measurement redstart birds species distribution models univariate filters wheatear birds
Online Access	Get full text
ISSN	0266-4720 1468-0394
DOI	10.1111/exsy.70076

Cover

Loading…

More Information
Summary:	ABSTRACT Researchers rely on species distribution models (SDMs) to establish a correlation between species occurrence records and environmental data. These models offer insights into the ecological and evolutionary aspects of the subject. Feature selection (FS) aims to choose useful interlinked features or remove unnecessary and redundant ones and make the induced model easier to understand. Although feature selection plays a crucial role in SDMs, only a limited number of studies in the literature have addressed it with several key shortcomings such as lack of the use of multivariate techniques, lack of comparison between the univariate and the multivariate filters, and absence of a comparison between the ensemble univariate and multivariate filters. Therefore, this study presents a rigorous empirical evaluation consisting of assessing and comparing six filter‐based univariate feature selection methods using two thresholds with two multivariate techniques, as well as four classifiers: Extreme Gradient boosting (XGB), Random Forest (RF), Decision Tree (DT), and Light gradient‐boosting machine (LGBM). Furthermore, the current study proposes a novel approach for ensemble construction consisting of evaluating the applications of ensemble learning using 40% of features ranked by means of Borda Count and Reciprocal Rank (univariate filter ensembles) as well as the fusion‐based and the intersection‐based ensembles (multivariate filter ensembles). Moreover, we evaluated and compared the performances of univariate and multivariate techniques with their ensembles. Similarly, we evaluated and compared the performances of the best ensemble techniques across datasets. The empirical evaluations involve several techniques, such as the 5‐fold cross‐validation method, the Scott Knott (SK) test, and Borda Count. In addition, we used three performance metrics (accuracy, Kappa, and F1‐score). Experiments showed that Consistency‐based subset selection in conjunction with RF outperformed all other univariate and multivariate FS techniques with an accuracy value of 91.63% across all datasets. However, Fisher score trained with RF was the best choice when considering the number of features. Moreover, the univariate or multivariate based ensembles, in general, outperformed their singles. In addition, when comparing the univariate and multivariate ensembles, the fusion‐based ensemble outperformed all other ensembles achieving an accuracy of 91.77% when using RF across datasets. Nevertheless, in terms of performance and number of features, the ensemble constructed using Reciprocal Rank performed better than all other FS techniques regardless of the classifier used. It achieved an accuracy of 91.61% across datasets when using RF.
Bibliography:	The authors received no specific funding for this work. Funding ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	0266-4720 1468-0394
DOI:	10.1111/exsy.70076