Classification of Landing and Distribution Domains Using Whois' Text Mining

Detection of drive-by-download attack has gained a focus in security research since the attack has turned into the most popular and serious threat to web infrastructure. The attack exploits vulnerabilities in web browsers and their extensions for unnoticeably downloading malicious software. Often, t...

Full description

Saved in:
Bibliographic Details
Published in2017 IEEE Trustcom/BigDataSE/ICESS pp. 1 - 8
Main Authors Tran Phuong Thao, Yamada, Akira, Murakami, Kosuke, Urakawa, Jumpei, Sawaya, Yukiko, Kubota, Ayumu
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.08.2017
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Detection of drive-by-download attack has gained a focus in security research since the attack has turned into the most popular and serious threat to web infrastructure. The attack exploits vulnerabilities in web browsers and their extensions for unnoticeably downloading malicious software. Often, the victim is sent through a long chain of redirection operations in order to take down the offending pages. Concretely, the attack is triggered when a user visits a benign webpage that is compromised by the attacker (called landing page) and is inserted some malicious code inside. The user is then automatically redirected to an actual page that installs malware on the user's computer (called distribution page) without his/her consent or knowledge. While there is a large body of works targeting on detection of drive-by download attack, there is little attention on the redirection which is a crucial characteristic of the attack. In this paper, for the first time, we propose an approach to the classification of landing and distribution domains which are important components forming the head and tail of a redirection chain in the attack. The methodology in our approach is to use machine learning for text mining on the registered information of the domains called whois. We intensively implemented our approach with six popular supervised learning algorithms, compared the results and concluded that Linear-based Support Vector Machine and CART algorithm-based Decision Tree are the best models for our dataset which respectively give 98.55% and 99.28% of accuracy, 97.78% and 98.95% of F1 score, 98.35% and 99.45% of average precision.
ISSN:2324-9013
DOI:10.1109/Trustcom/BigDataSE/ICESS.2017.213