A semantic approach for text document clustering using frequent itemsets and WordNet


  • Harsha Patil MANIT, BHOPAL, INDIA
  • Ramjeevan Singh Thakur MANIT, BHOPAL,INDIA






Document Clustering, Frequent Item Sets, Semantic, Similarity Measures, WordNet.


Document Clustering is an unsupervised method for classified documents in clusters on the basis of their similarity. Any document get it place in any specific cluster, on the basis of membership score, which calculated through membership function. But many of the traditional clustering algorithms are generally based on only BOW (Bag of Words), which ignores the semantic similarity between document and Cluster. In this research we consider the semantic association between cluster and text document during the calculation of membership score of any document for any specific cluster. Several researchers are working on semantic aspects of document clustering to develop clustering performance. Many external knowledge bases like WordNet, Wikipedia, Lucene etc. are utilized for this purpose. The proposed approach exploits WordNet to improve cluster member ship function. The experimental result shows that clustering quality improved significantly by using proposed framework of semantic approach.



[1] Y. Zhao, and G. Karypis, Hierarchical Clustering Algorithms for Document Datasets, Data Mining and Knowledge Discovery, Vol. 10, No. 2, 2005, pp. 141-168. https://doi.org/10.1007/s10618-005-0361-3.

[2] Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data, An introduction to Cluster Analysis, John Wiley & Sons, Inc (1990).

[3] Zhao, Y., Karypis, G.: Evaluation of Hierarchical Clustering Algorithms for Document Datasets, In Proc. of Intl. Conf. on Information and Knowledge Management. (2002) https://doi.org/10.21236/ADA439551.

[4] Beil, F., Ester, M., Xu, X.: Frequent Term-based Text Clustering, In Proc. of Intl. Conf. on Knowledge Discovery and Data Mining,. (2002).

[5] Fung, B., Wang, K., Ester, M.: Hierarchical Document Clustering using Frequent Item sets, In Proc. of SIAM Intl. Conf. on Data Mining. (2003).

[6] Malik, H.H., Kender, J.R.: High Quality, Efficient Hierarchical Document Clustering Using Closed Interesting Item sets, In Proc. of IEEE Intl. Conf. on Data Mining. (2006).

[7] T. Wei, Y. Lu, H. Chang, Q. Zhou, and X. Bao, “A semantic approach for text clustering usingWordNet and lexical chains,†Expert Systems with Applications, vol. 42, no. 4, (2015) pp. 2264–2275.

[8] Hotho, A., Staab, S., et al.: Wordnet Improves Text Document Clustering, In Proc. of Semantic Web Workshop, the 26th Annual Intl. ACM SIGIR Conf. (2003).

[9] Zhang, X., Jing, L., Hu, X., et al: A Comparative Study of Ontology Based Term Similarity Measures on Document Clustering,In Proc. of 12th Intl. Conf. on Database Systems for Advanced Applications. (2007) https://doi.org/10.1016/j.eswa.2014.10.023.

[10] Su, C., Chen, Q., Wang, X., Meng, X.: Text Clustering Approach Based On Maximal Frequent Term Sets. In: Proceeding of 2003 IEEE International Conference on ―Systems, Man and Cybernetics", Harbin Institute of Technology, Shenzhen, China, (2009), pp.1551-1556.

[11] P. Treeratpituk and J. Callan. "Automatically labeling hierarchical clusters." Proceedings of the Sixth National Conference on Digital Government Research (2006), pp 167-176.

[12] Hartigan, J. A., Wong, M. A.: Algorithm AS 136: A K-Means Clustering Algorithm. In: Journal of the Royal Statistical Society. Series C (Applied Statistics), Vol. 28, (1979), pp. 100-108. https://doi.org/10.2307/2346830.

[13] M., Burdick, Calimlim, M., Gehrke, J.: MAFIA: A Maximal Frequent Itemset Algorithm for Transactional Databases. In: Proceedings of the 17th BIBLIOGRAPHY 63 International Conference on ―Data Engineering‖, Heidelberg, Germany (2001), pp 443-452. https://doi.org/10.1109/ICDE.2001.914857.

[14] Bjornar Larsen and Chinatsu Aone. Fast and effective text mining using linear-time document clustering. In Proc. of the Fifth ACM SIGKDD Int’l Conference on Knowledge Discovery and Data Mining, (1999), pp 16–22. https://doi.org/10.1145/312129.312186.

[15] Agrawal, R., Srikant R.: Fast Algorithms for Mining Association Rules in Large Databases, Proc. VLDB 94,Santiago de Chile, Chile, 1994, pp. 487-499.

[16] A. Tversky, “Features of Similarityâ€, Psycological Review, vol. 84, no. 4, (1977).

[17] P. Resnik, “Using information content to evaluate semantic similarityâ€, Proceedings of the 14th International Joint Conference on Artificial Intelligence, (1995) August 20-25; Montréal Québec, Canada.

[18] G. Varelas, E. Voutsakis, P. Raftopoulou, E. G. M. Petrakis and E. E. Milios, “Semantic similarity methods in WordNet and their application to information retrieval on the webâ€, Proceedings of the 7th annual ACM international workshop on Web information and data management, (2005) October 31- November 05, Bremen, Germany. https://doi.org/10.1145/1097047.1097051.

[19] N. Negm, P. Elkafrawy, M. Amin, and A. M. Salem. Investigate the Performance of Document Clustering Approach Based on Association Rules Mining, International journal of Advanced Computer Science and Applications, Vol. 4, no. 8, (2013), pp. 142-151.

[20] Daniel, R.M. Shukla, A.K., “Improving Text Search Process using Text Document Clustering Approachâ€, ISSN 2319-7064, International Journal of Science and Research (IJSR), Volume 3 Issue 5, (2014), pp 1424.

[21] M. Steinbach, G. Karypis, and V. Kumar. A comparison of document clustering techniques. Proc. Of the 6th ACM SIGKDD international conference on TextMining Workshop, KDD 2000,2000

[22] WILLETT, P., Recent Trends in Hierarchic Document Clustering: A Critical Review. Information Processing & Management, 24(5), 577-97 https://doi.org/10.1016/0306-4573(88)90027-1.

[23] Noor Asmat, Saif Ur Rehman, Jawad Ashraf and Asad Habib, Maximal Frequent Item sets Based Hierarchical Strategy for Document Clustering, in International Conference on Computer Science, Data Mining & Mechanical Engg. (ICCDMME’2015) April 20-21, 2015 Bangkok (Thailand).

[24] Sujata R. Kolhe ; S. D. Sawarkar, A concept driven document clustering using WordNet, In Proc. of the International conference Nascent Technologies in Engineering (ICNTE),2017.

[25] Yong Wang, Julia Hodges, Document Clustering with Semantic Analysis, In Proc. of the 39th Annual Hawaii International Conference on System Sciences, HICSS, Vol. 03, (2006), pp. 543,

[26] Meng, L., Huang, R., & Gu, J. (2013). A review of semantic similarity measures in wordnet. International Journal of Hybrid Information Technology, 6(1), (2013), pp 1–12.

[27] Hai-Tao Zheng, Bo-Yeong Kang, Hong-Gee Kim, “Exploiting noun phrases and semantic relationships for text document clustering,†Journal of Information Sciences, Vol. 179, Issue 13, Jun 2009, pp. 2249-2262. https://doi.org/10.1016/j.ins.2009.02.019.

[28] Chun-Ling Chen, Frank S. Tseng, Tyne Liang, “An Integration of Fuzzy Association Rules and WordNet for Document Clustering,†In Proc. of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, PAKDD,2009, pp. 147-159.

[29] S.Vijayalakshmi, Dr.D.Manimegalai, “Query based Text Document Clustering using its Hypernymy Relation,†International Journal of Computer Applications 23(1):Jun 2013, pp. 13–16, Jun. 2011

[30] Dang, Q., Zhang, J., Lu, Y., & Zhang, K. WordNet-based suffix tree clustering algorithm. In Paper presented at the 2013 international conference on information science and computer applications (ISCA 2013).

View Full Article: