A statistical analysis of intrinsic bias of network security datasets for training machine learning mechanisms

Machine learning mechanisms for network intrusion detection systems lack accurate evaluation, comparison, and deployment due to the scarcity of well-constructed datasets. In this paper, we propose a statistical analysis of the features contained in four highly used security datasets. We conclude tha...

Full description

Saved in:

Bibliographic Details
Published in	Annales des télécommunications Vol. 77; no. 7-8; pp. 555 - 571
Main Authors	Silva, João Vitor V., de Oliveira, Nicollas R., Medeiros, Dianne S. V., Lopez, Martin Andreoni, Mattos, Diogo M. F.
Format	Journal Article
Language	English
Published	Cham Springer International Publishing 01.08.2022 Springer Nature B.V
Subjects	Circuits Classification Communications Engineering Computer Communication Networks Datasets Engineering Information and Communication Information Systems and Communication Service Intrusion detection systems Machine learning Networks R & D/Technology Policy Security Signal,Image and Speech Processing Statistical analysis Hypothesis testing Network security Statistics Machine learning
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Machine learning mechanisms for network intrusion detection systems lack accurate evaluation, comparison, and deployment due to the scarcity of well-constructed datasets. In this paper, we propose a statistical analysis of the features contained in four highly used security datasets. We conclude that the analyzed datasets should not be used as a benchmark for creating novel anomaly-based mechanisms for intrusion detection systems. The analyzed datasets introduce a biased classification since features are over-correlated, and most of the features are capable of making a complete distinction between normal and attack flows. Our proposed methodology analyzes the correlation among features instead of checking for redundant values or data imbalance. The results align with the performance of three machine learning techniques. We show that biased classification occurs due to a significant difference between attack and normal data. The syntactically generated features are statistically different between normal and attack classes, which implies overfitting in the machine learning approaches.
ISSN:	0003-4347 1958-9395
DOI:	10.1007/s12243-021-00904-5