A statistical analysis of intrinsic bias of network security datasets for training machine learning mechanisms
Machine learning mechanisms for network intrusion detection systems lack accurate evaluation, comparison, and deployment due to the scarcity of well-constructed datasets. In this paper, we propose a statistical analysis of the features contained in four highly used security datasets. We conclude tha...
Saved in:
Published in | Annales des télécommunications Vol. 77; no. 7-8; pp. 555 - 571 |
---|---|
Main Authors | , , , , |
Format | Journal Article |
Language | English |
Published |
Cham
Springer International Publishing
01.08.2022
Springer Nature B.V |
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Machine learning mechanisms for network intrusion detection systems lack accurate evaluation, comparison, and deployment due to the scarcity of well-constructed datasets. In this paper, we propose a statistical analysis of the features contained in four highly used security datasets. We conclude that the analyzed datasets should not be used as a benchmark for creating novel anomaly-based mechanisms for intrusion detection systems. The analyzed datasets introduce a biased classification since features are over-correlated, and most of the features are capable of making a complete distinction between normal and attack flows. Our proposed methodology analyzes the correlation among features instead of checking for redundant values or data imbalance. The results align with the performance of three machine learning techniques. We show that biased classification occurs due to a significant difference between attack and normal data. The syntactically generated features are statistically different between normal and attack classes, which implies overfitting in the machine learning approaches. |
---|---|
ISSN: | 0003-4347 1958-9395 |
DOI: | 10.1007/s12243-021-00904-5 |