Data labeling method based on Cluster similarity using Rough Entropy for Categorical Data Clustering


  • B. Suresh Kumar
  • Dr. H.Venkateswara Reddy
  • Dr. S.Viswanadha Raju
  • G Vijay Kanth





Categorical Data, Clustering, Data Labeling, Outlier, Entropy, Rough set.


In present research, Data mining is become one of the growing area which deals with data. Clustering is recognized as an efficient methodology in data grouping; to improve the efficiency of the clustering many researchers have used data labeling method. Labeling method works on similar data points, into the proper clusters. In categorical domain applying data labeling is not so easy when compare with numerical domain. In numeral domain it is easy to find difference between to data points, but in categorical it is not easy. Since data labeling on categorical is a challenging issue till date and it is quite complex to implement. The proposed methodology is deals on this problem.  According proposed method a sample data will be taken. That sampled data further divides sliding windows, and then a normal clustering algorithm will be applied on one sliding window and divides into clusters. Rough membership Entropy function is used to find the similarity between unlabelled data points to labeled data points. The proposed methodology has two important features those are 1) The Data points will moved into their proper clusters, means the quality clusters will take places, 2) Proposed methodology will execute with high efficiency rate. In this paper the proposed methodology is applied on KDD Cup99 data sets, and the results shows appreciably more proficient than earlier works.



[1]. Anil K. Jain and Richard C. Dubes. “Algorithms for Clustering Dataâ€, Prentice-Hall International, 1988.

[2]. Jain A K MN Murthy and P J Flyn, “Data Clustering: A Review,†ACM Computing Survey, 1999.

[3]. Kaufman L, P. Rousseuw,†Finding Groups in Data- An Introduction to Cluster Analysisâ€, Wiley Series in Probability and Math. Sciences, 1990.

[4]. Michael R. Anderberg,†Cluster analysis for applicationsâ€, Academic Press, 1973.

[5]. Han,J. and Kamber,M. “Data Mining Concepts and Techniquesâ€, Morgan Kaufmann, 2001.

[6]. Gibson, D., Kleinberg, J.M. and Raghavan,P. “Clustering Categorical Data An Approach Based on Dynamical Systemsâ€, VLDB pp. 3-4, pp. 222-236, 2000.

[7]. Bradley,P.S., Usama Fayyad, and Cory Reina,†Scaling clustering algorithms to large databasesâ€, Fourth International Conference on Knowledge Discovery and Data Mining, 1998.

[8]. Joydeep Ghosh. Scalable clustering methods for data mining. In Nong Ye, editor, “Handbook of Data Miningâ€, chapter 10, pp. 247–277. Lawrence Ealbaum Assoc, 2003.

[9]. Chen. H. L., Chuang K.T. and Chen. M.S (2008), “On Data Labeling for clustering Categorical dataâ€, IEEE Transactions on knowledge and Data Engineering, 20(2011), 1458-1471.

[10]. Fuyuan Cao, Jiye Liang, “A Data Labeling method for clustering categorical dataâ€, Elsevier Expert systems with applications, 38(2011), 2381-2385.

[11]. Chen, H.L., Chuang, K.T. And Chen, M.S. “Labeling Un clustered Categorical Data into Clusters Based on the Important Attribute Valuesâ€, IEEE International Conference. Data Mining (ICDM), 2005.

[12]. Klinkenberg, R.,†Using labeled and unlabeled data to learn drifting conceptsâ€, IJCAI-01Workshop on Learning from Temporal and Spatial Data, pp. 16-24, 2001.

[13]. Z. Pawlak, “Rough sets “, International journal of computer and information sciences, 11(1982), 341-356.

[14]. D. Parmer, T. Wu and J. Blackhurst, MMR, “An Algorithm for clustering data using rough set theoryâ€, Data and Knowledge Engineering, 63(3)(2007), 879-893.

[15]. H.Venkateswara Reddy, S.Viswanadha Raju. “A Study in Employing Rough Set Based Approach for Clustering on Categorical Time-Evolving Dataâ€, IOSR Journal of Computer Engineering (IOSRJCE), Volume 3, Issue 5 (July-Aug. 2012), PP 44-51 (ISSN: 2278-0661) DOI number 10.9790/0661-0354451.

[16]. Liang, J. Y., Wang, J. H., & Qian, Y. H. (2009). A new measure of uncertainty based on knowledge granulation for rough sets. Information Sciences, 179(4), 458–470.

[17]. Gluck, M.A. and Corter, J.E. “Information Uncertainty and the Utility of Categoriesâ€, Cognitive Science Society, pp. 283-287, 1985.

[18]. Shannon, C.E, “A Mathematical Theory of Communication,†Bell System Technical J., 1948.

[19]. Chun-Bao Chen, Li-Ya Wang, “Rough Set-Based Clustering with refinement Using Shannon’s Entropy Theoryâ€, ELSEVIER Computers and Mathematics with Applications 52 (2006) 1563-1576.

[20]. Jiang, F., Sui, Y. F., & Cao, C. G. (2008). A rough set approach to outlier detection. International Journal of General Systems, 37(5), 519–536.

[21]. Xiangjun Li, Fen Rao, “An Rough Entropy Based Approach to Outlier Detectionâ€, Journal of Computational Information Systems 8: 24 (2012) 10501-10508.

[22]. Venkateswara Reddy.H, Viswanadha Raju.S,†A Threshold for clustering Concept – Drifting Categorical Dataâ€, IEEE Computer Society, ICMLC 2011.

[23]. Tian Zhang, Raghu Ramakrishnan, and Miron Livny,†BIRCH: An Efficient Data Clustering Method for Very Large Databasesâ€,ACM SIGMOD International Conference on Management of Data,1996.

[24]. Ng, R.T. Jiawei Han “CLARANS: a method for clustering objects for spatial data miningâ€, Knowledge and Data Engineering, IEEE Transactions, 2002.

[25]. S. Guha, R. Rastogi, K. Shim. CURE,†An Efficient Clustering Algorithm for Large Databasesâ€, ACM SIGMOD International Conference on Management of Data, pp.73-84, 1998.

[26]. Huang, Z. and Ng, M.K, “A Fuzzy k-Modes Algorithm for Clustering Categorical Data†IEEE On Fuzzy Systems, 1999.

[27]. Guha,S., Rastogi,R. and Shim, K, “ROCK: A Robust Clustering Algorithm for Categorical Attributesâ€, International Conference On Data Eng. (ICDE), 1999.

[28]. Ganti, V., Gehrke, J. and Ramakrishnan, R, “CACTUS—Clustering Categorical Data Using Summaries,†ACM SIGKDD, 1999.

[29]. Vapnik, V.N,†The nature of statistical learning theoryâ€, Springer,1995.

[30]. Fredrik Farnstrom, James Lewis, and Charles Elkan,†Scalability for clustering algorithms revisitedâ€, ACM SIGKDD pp.:51–57, 2000.

[31]. Barbara, D., Li, Y. and Couto, J. “COOLCAT: An Entropy-Based Algorithm for Categorical Clusteringâ€, ACM International Conf. Information and Knowledge Management (CIKM), 2002.

[32]. Andritsos, P, Tsaparas, P, Miller R.J and Sevcik, K.C.“LIMBO: Scalable Clustering of Categorical Dataâ€, Extending Database Technology (EDBT), 2004.

View Full Article: