k-NN improvement to data analysis

  • Abstract
  • Keywords
  • References
  • PDF
  • Abstract

    The problem to classify big data is an actual one the subject. There are multiple ways to classify data but the k Nearest Neighbors (k-NN) has become a popular tool for the data scientist. In this paper we examine several modifications of the k Nearest Neighbors algorithm that achieve better efficiency in terms of accuracy and CPU time when classifying test observations in comparison to the standard k Nearest Neighbors algorithm. To make the modifications faster than standard k-NN we use a special methodology which splits the input dataset into n folds and combine it with input data transformations. Each time we execute the process, one of the folds is saved as a test subset and the rest of the folds are applied for training. The process is executed n times. In the proposed methodology we are looking for the pair of subsets which produces the highest accuracy result.


  • Keywords

    Classification Problems; Data Analysis; K Nearest Neighbors (K-NN); Machine Learning.

  • References

      [1] Abdi, F, Khalili-Damghani, K & Abolmakarem, S (2018), Solving customer insurance coverage sales plan problem using a multi-stage data mining approach. Kybernetes, 47(1), 2-19. https://doi.org/10.1108/K-07-2017-0244.

      [2] Andrade A, Silva JS, Santo J & Belo-Soares P (2012), Classifier approaches for liver steatosis using ultrasound images, Procedia Technology 5, 763-770. https://doi.org/10.1016/j.protcy.2012.09.084.

      [3] Bagui, S, Bagui, S, Pal K. & Pal N (2003), Breast cancer detection using rank nearest neighbor classification rules, Pattern Recognition, 36(1), 25-34. https://doi.org/10.1016/S0031-3203(02)00044-4.

      [4] Deng Z, Zhu X, Cheng D, Zong, M & Zhang S (2018), Efficient k NN classification algorithm for big data. Neurocomputing, 195, 143-148. https://doi.org/10.1016/j.neucom.2015.08.112.

      [5] Gera C. & Joshi K (2015), A Survey on Data Mining Techniques in the Medicative Field. International Journal of Computer Applications, 113(13), 32-35. https://doi.org/10.5120/19888-1926.

      [6] Halder A, Dey S & Kumar A (2015) Active Learning Using Fuzzy k-NN for Cancer Classification from Microarray Gene Expression Data. Lecture Notes in Electrical Engineering, 103-113. https://doi.org/10.1007/978-81-322-2464-8_8.

      [7] James, G., Witten, D., Hastie, T. & Tibshirani, R (2013), An Introduction to Statistical Learning with Applications in R, Springer, http://www-bcf.usc.edu/~gareth/ISL/index.html (Access 2019). https://doi.org/10.1007/978-1-4614-7138-7_1.

      [8] Jena M, Mishra SP & Mishra D, (2018), A survey on applications of machine learning techniques for medical image segmentation, International Journal of Engineering & Technology, 7(4), 4489-4495.

      [9] Kulkarni SG & Babu MV (2013), Introspection of various K-Nearest Neighbor Techniques. UACEE International Journal of Advances in Computer Science and Its Applications, 3, 103-106.

      [10] Kumar A (2016), Learning Predictive Analytics with Python. Packt Publishing, https://www.packtpub.com/mapt/book/big_data_and_business_intelligence/9781783983261 (April 2019)

      [11] Lin W, Ke, S & Tsai C, (2017), Top 10 Data Mining Techniques in Business Applications: A Brief Survey. Kybernetes, 46(7), 1158-1170, https://doi.org/10.1108/K-10-2016-0302.

      [12] Ivanov I & Tanov V, (2018), Big Data Analytics Algorithms and Applications. Machine Learnings, ISBN 978-619-239-010-5, Sofia (in Bulgarian).

      [13] Phogat M, Kumar D, (2018), A survey of machine learning techniques for genomic diseases and data sets, International Journal of Engineering & Technology, 7(4), 5533-5538.

      [14] Shamitha SK, Ilango V, (2018), A survey on machine learning techniques for fraud detection in healthcare, International Journal of Engineering & Technology, 7(4), 5862-5868.

      [15] Shazmeen SF, Baig MMA & Pawar MR (2013), Performance Evaluation of Different Data Mining Classification Algorithm and Predictive Analysis, IOSR Journal of Computer Engineering, 10(6), 1-6. https://doi.org/10.9790/0661-1060106.

      [16] The Python Sklearn Library, 2019, http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html (Access 2019)

      [17] Umasankar P, Thiagarasu, V, (2018), Proposing a new methodology on vague association rule mining for the diagnosis of heart disease hesitation patterns, International Journal of Engineering & Technology, 7(4), 5851-5855.

      [18] Zhou Y, Li Y & Xia S (2009) An Improved KNN Text Classification Algorithm Based on Clustering. Journal of Computers, 4(3), 230-237. https://doi.org/10.4304/jcp.4.3.230-237.




Article ID: 29803
DOI: 10.14419/ijet.v8i4.29803

Copyright © 2012-2015 Science Publishing Corporation Inc. All rights reserved.