A statistical analysis of intrinsic bias of network security datasets for training machine learning mechanisms

Machine learning mechanisms for network intrusion detection systems lack accurate evaluation, comparison, and deployment due to the scarcity of well-constructed datasets. In this paper, we propose a statistical analysis of the features contained in four highly used security datasets. We conclude tha...

Full description

Saved in:
Bibliographic Details
Published inAnnales des télécommunications Vol. 77; no. 7-8; pp. 555 - 571
Main Authors Silva, João Vitor V., de Oliveira, Nicollas R., Medeiros, Dianne S. V., Lopez, Martin Andreoni, Mattos, Diogo M. F.
Format Journal Article
LanguageEnglish
Published Cham Springer International Publishing 01.08.2022
Springer Nature B.V
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Machine learning mechanisms for network intrusion detection systems lack accurate evaluation, comparison, and deployment due to the scarcity of well-constructed datasets. In this paper, we propose a statistical analysis of the features contained in four highly used security datasets. We conclude that the analyzed datasets should not be used as a benchmark for creating novel anomaly-based mechanisms for intrusion detection systems. The analyzed datasets introduce a biased classification since features are over-correlated, and most of the features are capable of making a complete distinction between normal and attack flows. Our proposed methodology analyzes the correlation among features instead of checking for redundant values or data imbalance. The results align with the performance of three machine learning techniques. We show that biased classification occurs due to a significant difference between attack and normal data. The syntactically generated features are statistically different between normal and attack classes, which implies overfitting in the machine learning approaches.
ISSN:0003-4347
1958-9395
DOI:10.1007/s12243-021-00904-5