Comparative Study of Document Clustering Algorithms

  • Abstract
  • Keywords
  • References
  • PDF
  • Abstract

    Text clustering is a data mining technique that is becoming more important in present studies. Document clustering makes use of text clustering to divide documents according to the various topics. The choice of words in document clustering is important to ensure that the document can be classified correctly. Three different methods of clustering which are hierarchical clustering, k-means and k-medoids are used and compared in this study in order to identify the best method which produce the best result in document clustering. The three methods are applied on 60 sports articles involving four different types of sports. The k-medoids clustering produced the worst result while k-means clustering is found to be more sensitive towards general words. Therefore, the method of hierarchical clustering is deemed more stable to produce a meaningful result in document clustering analysis.


  • Keywords

    document clustering; text mining; hierarchical clustering; k-means; k-medoids.

  • References

      [1] Cohen K.B., Hunter L. Getting started in text mining. PLoS Computational Biology, 2008, 4: 1-3.

      [2] Neustein A., Imambi S. S., Rodrigues M., Teixeira A., Ferreira L. Application of text mining to biomedical knowledge extraction: Analyzing clinical narratives and medical literature. Text Mining of Web-Based Medical Content, 2014, pp. -31.

      [3] Kamaruddin S.S., Hamdan A.R., Abu Bakar A., Mat Nor F. Outlier detection in financial statements: A text mining method. WIT Transactions on Information and Communication Technologies, 2009, 42: 71-82.

      [4] Al-Daihani S.M., Abrahams A. A text mining analysis of academic libraries’ Tweets. Journal of Academic Librarianship, 2015, 1: 1-9.

      [5] Kadhim A.I., Cheah Y.N., Ahamed N.H. Text document preprocessing and dimension reduction techniques for text document clustering. Proceedings of the 4th International Conference on Artificial Intelligence with Applications in Engineering and Technology, 2014, pp. 69-73.

      [6] Mythily R., Banu A., Raghunathan S. Clustering Models for Data Stream Mining. Procedia Computer Science, 2015, 46: 619-626.

      [7] Steinbach M., Karypis G., Kumar V. A comparison of document clustering techniques. Proceedings of the KDD Workshop on Text Mining, 2000, pp. 1-20.

      [8] Balabantaray R. C., Sarma C., Jha M. Document clustering using K-Means and K-Medoids. International Journal of Knowledge Based Computer System, 2013, 1: 7-13.

      [9] Balcan M.-F., Laiang Y., Gupta P. Robust Hierarchical Clustering. Journal of Machine Learning Research, 2014, 15: 4011-4051

      [10] Bouguettaya A., Yu Q., Liu X., Zhou X., Song A. Efficient agglomerative hierarchical clustering. Expert Systems with Applications, 2015, 42(5): 2785-2797.

      [11] Slamet C., Rahman A., Ramdhani M.A., Darmalaksana W. Clustering the verses of the Holy Qur’an using K-Means algorithm. Asian Journal of Information Technology, 2016, 15(24): 5159-5162.

      [12] Younus Z.S., Mohamad D., Saba T., Alkawaz M.H., Rehman A., Al-Rodhaan M., Al-Dhelaan A. Content-based image retrieval using PSO and k-means clustering algorithm. Arabian Journal of Geosciences, 2015, 8(8): 6211-6224.

      [13] Peker M. A decision support system to improve medical diagnosis using a combination of k-medoids clustering based attribute weighting and SVM. Journal of Medical Systems, 2016, 40: 116.

      [14] Al-Anazi S., AlMahmoud H., Al-Turaiki I. Finding similar documents using different clustering techniques. Procedia Computer Science, 2016, 82: 28-34.

      [15] Feinerer I., Hornik K., Meyer D. Text Mining Infrastructure in R. Journal of Statistical Software, 2008, 25(5): 1-54.

      [16] Swathi B.V., Govardhan A. Find-k: A new algorithm for finding the k in partitioning clustering algorithms. International Journal of Computing Science and Communication Technologies, 2009, 2(1): 268-272.

      [17] Park H.S., Jun C.H. A simple and fast algorithm for K-Medoids clustering. Expert System with Applications, 2009, 36: 3336-3341

      [18] Arora P., Deepali D., Varshney S. Analysis of K-Means and K-Medoids algorithm for big data. Procedia Computer Science, 2015, 78: 507-512.

      [19] Hennig C., Liao T.F. How to find an appropriate clustering for mixed‐type variables with application to socio‐economic stratification. Journal of the Royal Statistical Society: Series C (Applied Statistics), 2013, 62(3): 309-369

      [20] Rao A.R., Srinivas V.V. Regionalization of watersheds by hybrid-cluster analysis. Journal of Hydrology, 2006, 318(1): 37-56.




Article ID: 20816
DOI: 10.14419/ijet.v7i4.11.20816

Copyright © 2012-2015 Science Publishing Corporation Inc. All rights reserved.