Examining characteristics of predictive models with imbalanced big data
High class imbalance between majority and minority classes in datasets can skew the performance of Machine Learning algorithms and bias predictions in favor of the majority (negative) class. This bias, for cases where the minority (positive) class is of greater interest and the occurrence of false n...
Saved in:
Published in | Journal of big data Vol. 6; no. 1; pp. 1 - 21 |
---|---|
Main Authors | , , , |
Format | Journal Article |
Language | English |
Published |
Cham
Springer International Publishing
31.07.2019
Springer Nature B.V SpringerOpen |
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | High class imbalance between majority and minority classes in datasets can skew the performance of
Machine Learning
algorithms and bias predictions in favor of the majority (negative) class. This bias, for cases where the minority (positive) class is of greater interest and the occurrence of false negatives is costlier than false positives, may result in adverse consequences. Our paper presents two case studies, each utilizing a unique, combined approach of
Random Undersampling
and
Feature Selection
to investigate the effect of class imbalance on big data analytics.
Random Undersampling
is used to generate six class distributions ranging from balanced to moderately imbalanced, and
Feature Importance
is used as our
Feature Selection
method. Classification performance was reported for the
Random Forest
,
Gradient-Boosted Trees
, and
Logistic Regression
learners, as implemented within the Apache Spark framework. The first case study utilized a training dataset and a test dataset from the
ECBDL’14
bioinformatics competition. The training and test datasets contain about 32 million instances and 2.9 million instances, respectively. For the first case study,
Gradient-Boosted Trees
obtained the best results, with either a features-set of 60 or the full set, and a negative-to-positive ratio of either 45:55 or 40:60. The second case study, unlike the first, included training data from one source (POST dataset) and test data from a separate source (Slowloris dataset), where POST and Slowloris are two types of
Denial of Service
attacks. The POST dataset contains about 1.7 million instances, while the Slowloris dataset contains about 0.2 million instances. For the second case study,
Logistic Regression
obtained the best results, with a features-set of 5 and any of the following negative-to-positive ratios: 40:60, 45:55, 50:50, 65:35, and 75:25. We conclude that combining
Feature Selection
with
Random Undersampling
improves the classification performance of learners with imbalanced big data from different application domains. |
---|---|
ISSN: | 2196-1115 2196-1115 |
DOI: | 10.1186/s40537-019-0231-2 |