Concept Drift Detection in Phishing Using Autoencoders

When machine learning models are built with non-stationary data their performance will naturally decrease over time due to concept drift, shifts in the underlying distribution of the data. A common solution is to retrain the machine learning model which can be expensive, both in obtaining new labele...

Full description

Saved in:

Bibliographic Details
Published in	Machine Learning and Metaheuristics Algorithms, and Applications Vol. 1366; pp. 208 - 220
Main Authors	Menon, Aditya Gopal, Gressel, Gilad
Format	Book Chapter
Language	English
Published	Singapore Springer 2021 Springer Singapore
Series	Communications in Computer and Information Science
Subjects	Autoencoders Concept drift Machine learning Phishing
Online Access	Get full text
ISBN	9811604185 9789811604188
ISSN	1865-0929 1865-0937
DOI	10.1007/978-981-16-0419-5_17

Cover

Loading…

More Information
Summary:	When machine learning models are built with non-stationary data their performance will naturally decrease over time due to concept drift, shifts in the underlying distribution of the data. A common solution is to retrain the machine learning model which can be expensive, both in obtaining new labeled data and in compute time. Traditionally many approaches to concept drift detection operate upon streaming data. However drift is also prevalent in semi-stationary data such as web data, social media, and any data set which is generated from human behaviors. Changing web technology causes concept drift in the website data that is used by phishing detection models. In this work, we create “Autoencoder Drift Detection” (ADD) an unsupervised approach for a drift detection mechanism that is suitable for semi-stationary data. We use the reconstruction error of the autoencoder as a proxy to detect concept drift. We use ADD to detect drift in a phishing detection data set which contains drift as it was collected over one year. We also show that ADD is competitive within ±24% with popular streaming drift detection algorithms on benchmark drift datasets. The average accuracy on the phishing data set is .473 without drift detection and using ADD is increased to .648.
ISBN:	9811604185 9789811604188
ISSN:	1865-0929 1865-0937
DOI:	10.1007/978-981-16-0419-5_17