Content-based prediction: big data sampling perspective

  • Authors

    • Waleed Albattah Department of Information Technology,College of Computer, Qassim University, Saudi Arabia
    • Saleh Albahli Department of Information Technology,College of Computer, Qassim University, Saudi Arabia
    2019-12-15
    https://doi.org/10.14419/ijet.v8i4.30150
  • Sampling Techniques, Sampling with Replacement, Reservoir, Big Data, Machine Learning, Classifier, SVM, Random Forest.
  • Today, large volumes of data are actively generated on the order of terabytes or even petabytes. Hence, processing data on such a large scale in an efficient and effective manner is extremely challenging. However, existing research studies apply machine learning algorithms by loading the entire training dataset into the computer’s main memory (RAM). This causes a problem as the data grows too big over time and can’t be supported by most of the conventional models or hardware within a single machine’s memory. Inspired by current research studies, this paper discusses the benefits of implementing two sampling techniques that could be used for machine learning models: (1) sampling with replacement and (2) reservoir sampling. In this study, 40 experiments were performed by reducing the number of data instances by 50% of the original data using random sampling of a video dataset that was more than 40 GB in size. Remark that accuracies of SVM and random forest are very competitive classifiers and give the importance score of all repeated ten rounds of the process for each of the four combinations of sampling techniques and machine learning classifiers.

     

     

  • References

    1. [1] Madden, S. (2012). From databases to big data. IEEE Internet Computing, 16(3), 4-6. https://doi.org/10.1109/MIC.2012.50.

      [2] Akhgar, B., Saathoff, G. B., Arabnia, H. R., Hill, R., Staniforth, A., & Bayerl, P. S. (2015). Application of big data for national security: a practitioner’s guide to emerging technologies. Butterworth-Heinemann.

      [3] Albattah, W., & Khan, R. U. (2018). Processing Sampled Big Data. INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 9(8), 350-356. https://doi.org/10.14569/IJACSA.2018.090846.

      [4] W. Albattah, ―The Role of Sampling in Big Data Analysis, in Proceedings of the International Conference on Big Data and Advanced Wireless Technologies - BDAW ‘16, 2016, pp. 1–5. https://doi.org/10.1145/3010089.3010113.

      [5] M. Hilbert, ―Big Data for Development: A Review of Promises and Challenges, ‖ Dev. Policy Rev., vol. 34, no. 1, pp. 135–174, Jan. 2016. https://doi.org/10.1111/dpr.12142.

      [6] D. A. Reed and J. Dongarra, ― “Exascale computing and big dataâ€, Commun. ACM, vol. 58, no. 7, pp. 56–68, 2015. https://doi.org/10.1145/2699414.

      [7] A. L‘Heureux, K. Grolinger, H. F. Elyamany, and M. A. M. Capretz, ―Machine Learning With Big Data: Challenges and Approaches, IEEE Access, vol. 5, no. 1, pp. 7776–7797, 2017. https://doi.org/10.1109/ACCESS.2017.2696365.

      [8] K. Singh, S. C. Guntuku, A. Thakur, and C. Hota, ―Big Data Analytics framework for Peer-to-Peer Botnet detection using Random Forests, Inf. Sci. (Ny)., vol. 278, pp. 488–497, 2014. https://doi.org/10.1016/j.ins.2014.03.066.

      [9] R. Clarke, ―Big data, big risks, Inf. Syst. J., vol. 26, no. 1, pp. 77–90, Jan. 2016. https://doi.org/10.1111/isj.12088.

      [10] D. Sullivan, ―Introduction to big data security analytics in the enterprise. [Online]. Available: https://searchsecurity.techtarget.com/feature/Introduction-to-big-datasecurity-analytics-in-the-enterprise. [Accessed: 31-Jul-2018].

      [11] C.-W. Tsai, C.-F. Lai, H.-C. Chao, and A. V. Vasilakos, ―Big data analytics: a survey, J. Big Data, vol. 2, no. 1, p. 21, Dec. 2015. https://doi.org/10.1186/s40537-015-0030-3.

      [12] G. Bello-Orgaz, J. J. Jung, and D. Camacho, ―Social big data: Recent achievements and new challenges, ‖ Inf. Fusion, vol. 28, pp. 45–59, Mar. 2016. https://doi.org/10.1016/j.inffus.2015.08.005.

      [13] J. Zakir, T. Seymour, and K. Berg, ―Big Data Analytics, Issues Inf. Syst., vol. 16, no. 2, pp. 81–90, 2015.

      [14] U. Sivarajah, M. M. Kamal, Z. Irani, and V. Weerakkody, ―Critical analysis of Big Data challenges and analytical methods, J. Bus. Res., vol. 70, pp. 263–286, Jan. 2017. https://doi.org/10.1016/j.jbusres.2016.08.001.

      [15] K. Engemann et al., ―Limited sampling hampers ‗big data estimation of species richness in a tropical biodiversity hotspot., Ecol. Evol., vol. 5, no. 3, pp. 807–820, 2015. https://doi.org/10.1002/ece3.1405.

      [16] J. K. Kim and Z. Wang, ―Sampling techniques for big data analysis in finite population inference, Jan. 2018. https://doi.org/10.1111/insr.12290.

      [17] S. Liu, R. She, and P. Fan, ―How Many Samples Required in Big Data Collection: A Differential Message Importance Measure, Jan. 2018.

      [18] J. Bierkens, P. Fearnhead, and G. Roberts, ―The Zig-Zag Process and Super-Efficient Sampling for Bayesian Analysis of Big Data, Jul. 2016.

      [19] J. Zhao, J. Sun, Y. Zhai, Y. Ding, C. Wu, and M. Hu, ―A Novel Clustering-Based Sampling Approach for Minimum Sample Set in Big Data Environment, Int. J. Pattern Recognit. Artif. Intell., vol. 32, no. 2, pp. 1–10, Feb. 2018. https://doi.org/10.1142/S0218001418500039.

      [20] L. Zhou, S. Pan, J. Wang, and A. V. Vasilakos, ―Machine learning on big data: Opportunities and challenges, Neurocomputing, vol. 237, no. 1, pp. 350–361, 2017. https://doi.org/10.1016/j.neucom.2017.01.026.

      [21] Boser, B. E., Guyon, I. M., & Vapnik, V. N. (1992, July). A training algorithm for optimal margin classifiers. In Proceedings of the fifth annual workshop on Computational learning theory (pp. 144-152). ACM. https://doi.org/10.1145/130385.130401.

      [22] Tong, S., & Koller, D. (2001). Support vector machine active learning with applications to text classification. Journal of machine learning research, 2(Nov), 45-66. https://doi.org/10.1145/500156.500159.

      [23] Liaw, A., & Wiener, M. (2002). Classification and regression by randomForest. R news, 2(3), 18-22.

      [24] Breiman, L. (2001). Random forests. Machine learning, 45(1), 5-32. https://doi.org/10.1023/A:1010933404324.

      [25] NPDI Pornography Database, (2013), https://sites.google.com/site/pornographydatabase/.

  • Downloads

  • How to Cite

    Albattah, W., & Albahli, S. (2019). Content-based prediction: big data sampling perspective. International Journal of Engineering & Technology, 8(4), 627-635. https://doi.org/10.14419/ijet.v8i4.30150