Adjusting lexical features of actual proxy logs for intrusion detection
Modern http-based malware imitates benign traffic to evade detection. To detect unseen malicious traffic, we proposed a linguistic-based detection method for proxy logs. This method extracts words as feature vectors automatically with natural language techniques, and discriminates between benign tra...
Saved in:
Published in | Journal of information security and applications Vol. 50; p. 102408 |
---|---|
Main Author | |
Format | Journal Article |
Language | English |
Published |
Elsevier Ltd
01.02.2020
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Modern http-based malware imitates benign traffic to evade detection. To detect unseen malicious traffic, we proposed a linguistic-based detection method for proxy logs. This method extracts words as feature vectors automatically with natural language techniques, and discriminates between benign traffic and malicious traffic. The previous method generates a corpus from all the extracted words which contain trivial words. To generate discriminative feature representation, a corpus has to be effectively summarized. In actual proxy logs, benign traffic is dominant, and occupies malicious feature representation. Hence, the imbalance between benign and malicious traffic occurs. Moreover, a malicious paragraph might be mixed with some benign proxy logs. Therefore, the previous method does not perform accuracy in practical environment. This paper demonstrates that our previous method is not effective in actual proxy logs because of the imbalance. To mitigate the imbalance, our method adjusts lexical features of actual proxy logs based on the word importance. Our method does not adjust the number of each class such as the traditional sampling techniques. We performed cross-validation and timeline analysis with captured pcap files from Exploit Kit and actual proxy logs. The experimental results show our method could detect unseen malicious traffic in actual proxy logs. Moreover, we examine the effectiveness of mixing benign logs in each proportion. The best F-measure achieves 0.95 in the timeline analysis. |
---|---|
ISSN: | 2214-2126 |
DOI: | 10.1016/j.jisa.2019.102408 |