Document Categorization Using Decision Tree: Preliminary Study

  • Abstract
  • Keywords
  • References
  • PDF
  • Abstract

    This preliminaries study aims to propose a good classification technique that capable of doing document classification based on text mining technique and create an algorithm to automatically classify document according to its folder based on document’s content while able to do sentiment analyses to data sets and summarize it. The objective of this paper to identify an efficient text mining classification technique which can resulted with highest accuracy of classifying document into document folder, capable of extracting valuable information from context-based term that can be used as an output for algorithm to do automatic classification and evaluate the classification technique. Methodology of this study comprises in 5 modules which is 1) Document collection, 2) Pre-Processing Stage, 3) Term Frequency-Inversed Document Frequency, 4) Classification Technique and Algorithm, and lastly 5) Evaluation and Visualization of the classification result. The proposed framework will have utilized Term Frequency-Inversed Document Frequency (TF-IDF) and Decision Tree technique which TF-IDF used as purposes to rank all the terms based on most frequent to least frequent terms so, while decision tree function as decision making in terms of deciding which folder the document belongs to.




  • Keywords

    Data mining; Unstructured data; Decision tree; Text mining; Text frequency – inversed document frequency.

  • References

      [1] Inzalkar M, J Sharma. A survey on text mining-techniques and application. International Journal of Research in Science and Engineering, 2015, 24: 1–14.

      [2] Zainol Z, Jaymes MTH, Nohuddin PNE. VisualUrText : A text analytics tool for unstructured textual data. Journal of Physics: Conference Series, 2018, 1018(1): 1-8.

      [3] Trstenjak B, Mikac S, Donko D. KNN with TF-IDF based framework for text categorization. Procedia Engineering, 2014, 69: 1356–1364.

      [4] Masuda K, Matsuzaki T, Tsujii J. Semantic search based on the online integration of NLP techniques. Procedia - Social and Behavioral Sciences, 2011, 27: 281–290.

      [5] Zhang W, Yoshida T, Tang X. A comparative study of TF*IDF, LSI and multi-words for text classification. Expert Systems with Applications, 2011, 38(3): 2758–2765.

      [6] Bijalwan V, Kumar V, Kumari P, Pascual J. KNN based machine learning approach for text and document mining. International Journal of Database Theory and Application, 2014, 7(1): 61–70.

      [7] Moudden I El, Jouhari H. Learned model for human activity recognition based on dimensionality reduction. Proceedings of the Smart Application and Data Analysis for Smart Cities, 2014, pp. 1-6.

      [8] Jabbar MA, Deekshatulu BL, Chandra P. Heart disease classification using nearest neighbor classifier with feature subset selection. Anale Seria Informatica, 2013, 11, 47-54.

      [9] Joachims T. Text categorization with support vector machines: Learning with many relevant features. Proceedings of the European Conference on Machine Learning, 1998, pp. 137-142.

      [10] Manekar V, Waghmare K. Improving accuracy of SVM using hybrid cultural algorithm. International Journal of Computer Technology and Applications, 2014, 5(3): 1194–1197.

      [11] Tan PN, Steinbach M, Kumar, V. Classification: Basic concepts, decision trees, and model evaluation. Introduction to Data Mining, 2006, 67(17): 145–205.

      [12] Rokach L, Maimon O. Decision trees. In O. Maimon, & L. Rokach (Eds.), Data Mining and Knowledge Discovery Handbook. Massachusetts: Springer, 2009, pp. 149–174.

      [13] Lee J. A new approach of top-down induction of decision trees for knowledge discovery. PhD thesis, Iowa State University, 2008.

      [14] Farooqui MA, Sheetlani J. Different classification technique for data mining in insurance industry using Weka, 2017, 19(1): 11–18.




Article ID: 26907
DOI: 10.14419/ijet.v7i4.34.26907

Copyright © 2012-2015 Science Publishing Corporation Inc. All rights reserved.