Feature Extraction Methods and Classification for Malware Incident News
Studies related to data mining are one of the topics that have received much interest recently, including for the form of unstructured data. One that is commonly discussed is the automatic classification process using machine learning methods. A large amount of data is the main obstacle in the manua...
Saved in:
Published in | 2023 IEEE International Conference on Cryptography, Informatics, and Cybersecurity (ICoCICs) pp. 115 - 120 |
---|---|
Main Authors | , , , |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
22.08.2023
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Studies related to data mining are one of the topics that have received much interest recently, including for the form of unstructured data. One that is commonly discussed is the automatic classification process using machine learning methods. A large amount of data is the main obstacle in the manual classification process. However, there are still many people who have difficulty determining the right combination between feature extraction and classification methods, so with this, we provide suggestions for using a variety of ways that can produce better accuracy in text classification. This research compares several feature extraction methods, including Bag-of-Word (BoW), Term Frequency - Inverse Document Frequency (TF-IDF), and Word2Vec with focusing on the Skip-gram model. On the other hand, this research also uses several classification methods, which include Support Vector Machine (SVM), Decision Tree, Logistic Regression, Gaussian Naive Bayes, K-Nearest Neighbor, Neural Network, Random Forest, and Doc2Vec. This research used two hundred crawled articles from several web blogs that have been labeled manually and have been split into two classes, malware incident news, and non-malware incident news class. The dataset quality was also measured using an open-source Python library known as "Cleanlab". |
---|---|
DOI: | 10.1109/ICoCICs58778.2023.10276685 |