Adjusting lexical features of actual proxy logs for intrusion detection

Modern http-based malware imitates benign traffic to evade detection. To detect unseen malicious traffic, we proposed a linguistic-based detection method for proxy logs. This method extracts words as feature vectors automatically with natural language techniques, and discriminates between benign tra...

Full description

Saved in:
Bibliographic Details
Published inJournal of information security and applications Vol. 50; p. 102408
Main Author Mimura, Mamoru
Format Journal Article
LanguageEnglish
Published Elsevier Ltd 01.02.2020
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Modern http-based malware imitates benign traffic to evade detection. To detect unseen malicious traffic, we proposed a linguistic-based detection method for proxy logs. This method extracts words as feature vectors automatically with natural language techniques, and discriminates between benign traffic and malicious traffic. The previous method generates a corpus from all the extracted words which contain trivial words. To generate discriminative feature representation, a corpus has to be effectively summarized. In actual proxy logs, benign traffic is dominant, and occupies malicious feature representation. Hence, the imbalance between benign and malicious traffic occurs. Moreover, a malicious paragraph might be mixed with some benign proxy logs. Therefore, the previous method does not perform accuracy in practical environment. This paper demonstrates that our previous method is not effective in actual proxy logs because of the imbalance. To mitigate the imbalance, our method adjusts lexical features of actual proxy logs based on the word importance. Our method does not adjust the number of each class such as the traditional sampling techniques. We performed cross-validation and timeline analysis with captured pcap files from Exploit Kit and actual proxy logs. The experimental results show our method could detect unseen malicious traffic in actual proxy logs. Moreover, we examine the effectiveness of mixing benign logs in each proportion. The best F-measure achieves 0.95 in the timeline analysis.
ISSN:2214-2126
DOI:10.1016/j.jisa.2019.102408