The Performance Comparison of the Classifiers According to Binary Bow, Count Bow and Tf-Idf Feature Vectors for  Malware Detection

Young Man Kwon; So Hee Jun; Won Mo Gal; Myung Jae Lim

doi:10.14419/ijet.v7i3.33.18515

Article Summary Abstract References Full Article How to cite

Authors
- Young Man Kwon
- So Hee Jun
- Won Mo Gal
- Myung Jae Lim
2018-08-29

https://doi.org/10.14419/ijet.v7i3.33.18515
Malware Detection, Feature Selection, Machine Learning, BOW (Bag of words), TF-IDF
Abstract

In this paper, we compared the performance of the classifiers according to feature vectors with Binary BOW, Count BOW and TF-IDF for malware detection. We used the feature of Opcode that extracted from PE file. For performance comparison, we measured the AUC score for the classifiers those are DT, KNN, MLP, MNB and SVM. As a result, we recommend neural network (MLP) and instance-based model (KNN) because they show the high AUC score and accuracy regardless of the unbalanced dataset and the feature vector. If you use classical classifiers, we recommend DT because it guarantees high AUC score and accuracy regardless of the same condition as the above. If you use SVM, you have to do Robust scaling to resolved outlier and unbalanced dataset. If you use MNB, you need to use N-gram technique to improve AUC score.
Â
Â
References
1. [1] Ashwini Mujumdar, Gayatri Masiwal, DR. B. B. Meshram, â€œAnalysis of Signature-Based and Behavior-Based Anti-Malware Approaches,â€ International Journal of Advanced Research in Computer Engineering and Technology (IJARCET), Volume 2, Issue 6, June 2013
  [2] J.Zico Kolter, Marcus A. Maloof, â€œLearning to Detect and Classify Malicious Executables in the Wild,â€ The Journal of Machine Learning Research, Volume 7, December 2006, pp: 2721-2744
  [3] Daniel Gilbert, â€œConvolutional Neural Networks for Malware Classification,â€ October 2016
  [4] Elizabeth D. Liddy, â€œNatural Language Processing,â€ In Encyclopedia of Library and Information Science, Volume 2, NY.Marcel Decker, 2001
  [5] Trung Kien Tran, Hiroshi Sato, â€œNLP-based Approaches for Malware Classification from API Sequences,â€ Asia Pacific Symposium on Intelligent and Evolutionary Systems, 2017
  [6] Python Library, scikit-learn, TfidfTransformer, http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer
  [7] Manohar Swamynathan, Mastering Machine Learning with Python in Six Steps, Apress, 2017, pp: 268-272
  [8] Aurelien Geron, Hands-On Machine Learning with Scikit-Learn & TensorFlow, Oâ€™REILLY, 2017, pp: 167-179
  [9] Sarah Guido, Andreas Muller, Introduction to Machine Learning with Python, Oâ€™REILLY, 2016, pp: 104-119
  [10] Andrew McCallum, Kamal Nigam, â€œA comparison of Event Models for NaÃ¯ve Bayes Text Classification,â€ AAAI Workshop, 1998, pp: 41-48
  [11] Chih-Wei Hsu, Chih-Chung Chang, Chih-Jen Lin, â€œA Practical Guide to Support Vector Classification,â€ 2016
  [12] Willeam B.Carnar, John M.Trenkle, â€œN-Gram-Based Text Categorization,â€ In Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval, 1994, pp: 161-175
  [13] Mikhail Zolotukhin, Timo Hamalainen, â€œDetection of Zero-day Malware Based on the Analysis of Op-code Sequences,â€ The 11TH Annual IEEE CCNC â€“ Security, Privacy and Content Protection, 2014
  [14] virusshare, https://virusshare.com
  [15] joxeankoret, http://malwareurls.joxeankoret.com
  [16] malc0de, http://malc0de.com
  [17] malwareblacklist, http://www.malwareblacklist.com
  [18] Sarah Guido, Andreas Muller, Introduction to Machine Learning with Python, Oâ€™REILLY, 2016, pp: 292-296
  [19] Charles X. Ling, Jin Huang, Harry Zhang, â€œAUC: a Better Measure than Accuracy in Comparing Learning Algorithms,â€ Part of the Lecture Notes in Computer Science book series (LNCS), Volumne 2671, May 2003
  [20] Box Plot: Display of Distribution, http://www.physics.csbsju.edu/stats/box2.html
  [21] Introductino to Multi-Layer Perceptrons (Feedforward Neural Networks), https://www.iro.umontreal.ca/~bengioy/ift6266/H12/html.old/mlp_en.html
  [22] Asaf Shabtai, Robert Moskovitch, Clint Feher, Shlomi Dolev, Yuval Elovici, â€œDetecting unknown malicious code by applying classification techniques on OpCode patterns,â€ Shabtai et al. Security Informatics, January 2012
Downloads
How to Cite
Man Kwon, Y., Hee Jun, S., Mo Gal, W., & Jae Lim, M. (2018). The Performance Comparison of the Classifiers According to Binary Bow, Count Bow and Tf-Idf Feature Vectors for Malware Detection. International Journal of Engineering & Technology, 7(3.33), 15-22. https://doi.org/10.14419/ijet.v7i3.33.18515
ACM

ACS

APA

ABNT

Chicago

Harvard

IEEE

MLA

Turabian

Vancouver

Download Citation

Endnote/Zotero/Mendeley (RIS)

BibTeX
Received date: 2018-08-28

Accepted date: 2018-08-28

Published date: 2018-08-29

The Performance Comparison of the Classifiers According to Binary Bow, Count Bow and Tf-Idf Feature Vectors for Malware Detection

Authors

Abstract

References

Downloads

How to Cite

Published