# Data labeling method based on Cluster similarity using Rough Entropy for Categorical Data Clustering

## DOI:

https://doi.org/10.14419/ijet.v7i4.6.20239## Published:

2018-09-25## Keywords:

Categorical Data, Clustering, Data Labeling, Outlier, Entropy, Rough set.## Abstract

In present research, Data mining is become one of the growing area which deals with data. Clustering is recognized as an efficient methodology in data grouping; to improve the efficiency of the clustering many researchers have used data labeling method. Labeling method works on similar data points, into the proper clusters. In categorical domain applying data labeling is not so easy when compare with numerical domain. In numeral domain it is easy to find difference between to data points, but in categorical it is not easy. Since data labeling on categorical is a challenging issue till date and it is quite complex to implement. The proposed methodology is deals on this problem.Â According proposed method a sample data will be taken. That sampled data further divides sliding windows, and then a normal clustering algorithm will be applied on one sliding window and divides into clusters. Rough membership Entropy function is used to find the similarity between unlabelled data points to labeled data points. The proposed methodology has two important features those are 1) The Data points will moved into their proper clusters, means the quality clusters will take places, 2) Proposed methodology will execute with high efficiency rate. In this paper the proposed methodology is applied on KDD Cup99 data sets, and the results shows appreciably more proficient than earlier works.

Â

## References

[1]. Anil K. Jain and Richard C. Dubes. â€œAlgorithms for Clustering Dataâ€, Prentice-Hall International, 1988.

[2]. Jain A K MN Murthy and P J Flyn, â€œData Clustering: A Review,â€ *ACM Computing Survey,* 1999.

[3]. Kaufman L, P. Rousseuw,â€ Finding Groups in Data- An Introduction to Cluster Analysisâ€, Wiley Series in Probability and Math. Sciences, 1990.

[4]. Michael R. Anderberg,â€ Cluster analysis for applicationsâ€, Academic Press, 1973.

[5]. Han,J. and Kamber,M. â€œData Mining Concepts and Techniquesâ€, Morgan Kaufmann, 2001.

[6]. Gibson, D., Kleinberg, J.M. and Raghavan,P. â€œClustering Categorical Data An Approach Based on Dynamical Systemsâ€, VLDB pp. 3-4, pp. 222-236, 2000.

[7]. Bradley,P.S., Usama Fayyad, and Cory Reina,â€ Scaling clustering algorithms to large databasesâ€, Fourth International Conference on Knowledge Discovery and Data Mining, 1998.

[8]. Joydeep Ghosh. Scalable clustering methods for data mining. In Nong Ye, editor, â€œHandbook of Data Miningâ€, chapter 10, pp. 247â€“277. Lawrence Ealbaum Assoc, 2003.

[9]. Chen. H. L., Chuang K.T. and Chen. M.S (2008), â€œOn Data Labeling for clustering Categorical dataâ€, IEEE Transactions on knowledge and Data Engineering, 20(2011), 1458-1471.

[10]. Fuyuan Cao, Jiye Liang, â€œA Data Labeling method for clustering categorical dataâ€, Elsevier Expert systems with applications, 38(2011), 2381-2385.

[11]. Chen, H.L., Chuang, K.T. And Chen, M.S. â€œLabeling Un clustered Categorical Data into Clusters Based on the Important Attribute Valuesâ€, IEEE International Conference. Data Mining (ICDM), 2005.

[12]. Klinkenberg, R.,â€ Using labeled and unlabeled data to learn drifting conceptsâ€, IJCAI-01Workshop on Learning from Temporal and Spatial Data, pp. 16-24, 2001.

[13]. Z. Pawlak, â€œRough sets â€œ, International journal of computer and information sciences, 11(1982), 341-356.

[14]. D. Parmer, T. Wu and J. Blackhurst, MMR, â€œAn Algorithm for clustering data using rough set theoryâ€, Data and Knowledge Engineering, 63(3)(2007), 879-893.

[15]. H.Venkateswara Reddy, S.Viswanadha Raju. â€œA Study in Employing Rough Set Based Approach for Clustering on Categorical Time-Evolving Dataâ€, IOSR Journal of Computer Engineering (IOSRJCE), Volume 3, Issue 5 (July-Aug. 2012), PP 44-51 (ISSN: 2278-0661) DOI number 10.9790/0661-0354451.

[16]. Liang, J. Y., Wang, J. H., & Qian, Y. H. (2009). A new measure of uncertainty based on knowledge granulation for rough sets. Information Sciences, 179(4), 458â€“470.

[17]. Gluck, M.A. and Corter, J.E. â€œInformation Uncertainty and the Utility of Categoriesâ€, Cognitive Science Society, pp. 283-287, 1985.

[18]. Shannon, C.E, â€œA Mathematical Theory of Communication,â€ Bell System Technical J., 1948.

[19]. Chun-Bao Chen, Li-Ya Wang, â€œRough Set-Based Clustering with refinement Using Shannonâ€™s Entropy Theoryâ€, ELSEVIER Computers and Mathematics with Applications 52 (2006) 1563-1576.

[20]. Jiang, F., Sui, Y. F., & Cao, C. G. (2008). A rough set approach to outlier detection. International Journal of General Systems, 37(5), 519â€“536.

[21]. Xiangjun Li, Fen Rao, â€œAn Rough Entropy Based Approach to Outlier Detectionâ€, Journal of Computational Information Systems 8: 24 (2012) 10501-10508.

[22]. Venkateswara Reddy.H, Viswanadha Raju.S,â€ A Threshold for clustering Concept â€“ Drifting Categorical Dataâ€, IEEE Computer Society, ICMLC 2011.

[23]. Tian Zhang, Raghu Ramakrishnan, and Miron Livny,â€ BIRCH: An Efficient Data Clustering Method for Very Large Databasesâ€,ACM SIGMOD International Conference on Management of Data,1996.

[24]. Ng, R.T. Jiawei Han â€œCLARANS: a method for clustering objects for spatial data miningâ€, Knowledge and Data Engineering, IEEE Transactions, 2002.

[25]. S. Guha, R. Rastogi, K. Shim. CURE,â€ An Efficient Clustering Algorithm for Large Databasesâ€, ACM SIGMOD International Conference on Management of Data, pp.73-84, 1998.

[26]. Huang, Z. and Ng, M.K, â€œA Fuzzy k-Modes Algorithm for Clustering Categorical Dataâ€ IEEE On Fuzzy Systems, 1999.

[27]. Guha,S., Rastogi,R. and Shim, K, â€œROCK: A Robust Clustering Algorithm for Categorical Attributesâ€, International Conference On Data Eng. (ICDE), 1999.

[28]. Ganti, V., Gehrke, J. and Ramakrishnan, R, â€œCACTUSâ€”Clustering Categorical Data Using Summaries,â€ ACM SIGKDD, 1999.

[29]. Vapnik, V.N,â€ The nature of statistical learning theoryâ€, Springer,1995.

[30]. Fredrik Farnstrom, James Lewis, and Charles Elkan,â€ Scalability for clustering algorithms revisitedâ€, ACM SIGKDD pp.:51â€“57, 2000.

[31]. Barbara, D., Li, Y. and Couto, J. â€œCOOLCAT: An Entropy-Based Algorithm for Categorical Clusteringâ€, ACM International Conf. Information and Knowledge Management (CIKM), 2002.

[32]. Andritsos, P, Tsaparas, P, Miller R.J and Sevcik, K.C.â€œLIMBO: Scalable Clustering of Categorical Dataâ€, Extending Database Technology (EDBT), 2004.