Concept Drift Detection in Phishing Using Autoencoders

When machine learning models are built with non-stationary data their performance will naturally decrease over time due to concept drift, shifts in the underlying distribution of the data. A common solution is to retrain the machine learning model which can be expensive, both in obtaining new labele...

Full description

Saved in:
Bibliographic Details
Published inMachine Learning and Metaheuristics Algorithms, and Applications Vol. 1366; pp. 208 - 220
Main Authors Menon, Aditya Gopal, Gressel, Gilad
Format Book Chapter
LanguageEnglish
Published Singapore Springer 2021
Springer Singapore
SeriesCommunications in Computer and Information Science
Subjects
Online AccessGet full text
ISBN9811604185
9789811604188
ISSN1865-0929
1865-0937
DOI10.1007/978-981-16-0419-5_17

Cover

Loading…
More Information
Summary:When machine learning models are built with non-stationary data their performance will naturally decrease over time due to concept drift, shifts in the underlying distribution of the data. A common solution is to retrain the machine learning model which can be expensive, both in obtaining new labeled data and in compute time. Traditionally many approaches to concept drift detection operate upon streaming data. However drift is also prevalent in semi-stationary data such as web data, social media, and any data set which is generated from human behaviors. Changing web technology causes concept drift in the website data that is used by phishing detection models. In this work, we create “Autoencoder Drift Detection” (ADD) an unsupervised approach for a drift detection mechanism that is suitable for semi-stationary data. We use the reconstruction error of the autoencoder as a proxy to detect concept drift. We use ADD to detect drift in a phishing detection data set which contains drift as it was collected over one year. We also show that ADD is competitive within ±24% with popular streaming drift detection algorithms on benchmark drift datasets. The average accuracy on the phishing data set is .473 without drift detection and using ADD is increased to .648.
ISBN:9811604185
9789811604188
ISSN:1865-0929
1865-0937
DOI:10.1007/978-981-16-0419-5_17