System API Vectorization for Malware Detection

Data is essential to the performance of artificial intelligence (AI) based malware detection models. System APIs, which allocate operating system resources, are important for identifying malicious behaviors. However, few studies have been conducted on data in the malware detection AI model. They ove...

Full description

Saved in:

Bibliographic Details
Published in	IEEE access Vol. 11; p. 1
Main Authors	Shin, Kyounga, Lee, Yunho, Lim, Jungho, Kang, Honggoo, Lee, Sangjin
Format	Journal Article
Language	English
Published	Piscataway IEEE 01.01.2023 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Application programming interface Application programming interfaces Artificial intelligence Computer viruses Data collection Data models Data processing Heuristic algorithms Malware Mathematical models N-gram statistic vector Optimization Performance tests Probability system API Time complexity Training vectorization Word2Vec
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Data is essential to the performance of artificial intelligence (AI) based malware detection models. System APIs, which allocate operating system resources, are important for identifying malicious behaviors. However, few studies have been conducted on data in the malware detection AI model. They overlooked collection of benign data, which is as important as malware data, and data characterization of system APIs. As an optimization method for data-driven artificial intelligence, this paper studied the data collection, purification, preprocessing, and vectorization for EXE files and system APIs. The objectivity of the data was ensured by using global data, and a more robust model could be created by collecting benign data from Virus Total. By analyzing the weight distribution according to the order of system API execution, we identified that major malicious behaviors occurred at the beginning of execution.We found the optimal API length and optimal dimension (feature number). Finally, accuracy of the N-gram model ranged from 97.62 to 95.73, and that of the Word2Vec model ranged from 97.44 to 95.89. In the generalization performance test using different data from the source of the training ones, we confirmed that N-gram was affected by the quantity of training data, and Word2Vec was affected by data similarity. This study systematized the entire procedure of AI data processing for malware detection, and is the first study to compare and analyze statistical vectors and word embeddings based on the characteristics of system APIs.
ISSN:	2169-3536 2169-3536
DOI:	10.1109/ACCESS.2023.3276902