Outliers in Data Mining: Approaches and Detection

  • Authors

    • Deepti Mishra
    • Devpriya Soni
    2018-12-13
    https://doi.org/10.14419/ijet.v7i4.39.23930
  • Data Mining, Knowledge discovery, Outliers, Outlier Detection, Overfitting
  • The paper is grounded on the study of outliers which are the objects that somehow arise unlike from residue data stored and can be pointed as outliers. At present in data mining Outlier detection is the currently innovative topic for research. Outliers detection in a set of patterns is a pertinent problem in the data mining area. Outlier mining is the problem of detecting unseen events, abnormal data and exceptions. Another perspective of outliers they affect the outcomes and analysis of data. Presence of outliers make the results in confusable state. The patterns generated after the calculations from the data are not authentic and precise because of the outliers. This is the focus of this review as well as of that of this paper as well. There are some common categories of outliers described in this paper. In the residue of this paper, we will discuss briefly about data mining, outliers and their different categories, data mining techniques for outlier detection, application to support outlier detection from the data set, and approaches for outlier detection.

     

     

  • References

    1. [1] M. R. Anderberg, Cluster Analysis for Applications, Academic Press, 1973.

      [2] B. Dawson and R. G. Trapp, Basic and clinical Biostatics, Mc Graw Hill, 2004.

      [3] J.P.Baride, A. Kulkarni and R. Mazumdar, Manual of Biostatics, Jaypee, 2003.

      [4] K. H. Tung, J. Han, L. V. S. Lakshmanan and R. T. Ng, "Constraint Based Clustering in Large Databases".

      [5] P. K. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, 2006.

      [6] Atkinson, Exploring Multivariate Data with the Forward Search, New York: Springer-Verlag, 2004.

      [7] M. Steinbach, L. Ertoz and V. Kumar, "The Challenges of Clustering High Dimensional Data," pp. 1-33.

      [8] D. H. FIsher, "Knowledge Acquisition Via Incremental Conceptual Clustering," Machine Learning 2, pp. 139-172, 1987.

      [9] M. Petrovsky, "Outlier Detection Algorithms in Data Mining Systems," Programming and Computer Software, pp. 228-237, 2003.

      [10] P. Sun and S. Chawla, "On Local Spatial Outliers," in ICDM, 2004.

      [11] S. Subramaniam, T. Palpanas, D. Papadopoulos, V. Kalogeraki and D. Gunopulos, "OnlineOutlierDetectioninSensorDataUsing Non-ParametricModels," in ACM VLDB, 2006.

      [12] S. D. Bay and M. Schwabacher, "Near Linear Time Detection of Distance-Based Outliers and".

      [13] Y. Chen and L. Tu, "Density-Based Clustering for Real-Time Stream Data," in ACM SIGKDD, 2007.

      [14] M. Wu and C. Jermaine, "Outlier Detection by Sampling with Accuracy Guarantees," in ACM KDD, 2006.

      [15] F. Anguilli and F. Fassetti, "DOLPHIN: An Efficient Algorithm for Mining Distance -Based Outliers in Very Large Datasets," ACM Transactions on Knowledge Discovery from Data, 2009.

      [16] L. Tu and Y. Chen, "Stream Data Clustering Based on Grid Density and Attraction," ACM Transaction on Computational Logic, pp. 1-26, 2008.

      [17] S.Vijayarani and P.Jothi, "An Efficient Clustering Algorithm for Outlier Detection in Data Streams," International Journal of Advanced Research in Computer and Communication Engineering , pp. 3657-3666, 2013.

      [18] B. Micenková, "Outlier Detection and Explanation For Domain Experts," Denmark, 2015.

      [19] S. Guha, R. Rastogi and K. Shim, "ROCK : A Robust Clustering Algorithm for Categorical Attributes," Information System, pp. 345-366, 2000.

      [20] D. L. Sreenivasa, M. N. Murthy and G. Athithan, "Outlier Analysis Of Categorical Data Using Navf," in IEEE Conference, 2013.

      [21] R. P. Jakkulwar and R. Fadnavis, "Analysis of Outleir Detection in Categorical Data Set," IJERGS, pp. 622-625, 2015.

      [22] Zhou, L. Wei, W. Qian and W. Jin, "HOT: Hypergraph-based Outlier Test for Categorical Data".

      [23] D. Agarwal, "Detecting Anomalies in Cross-Classified Streams: a Bayesian Approach," Knowledge and Information System, pp. 29-44, 2006.

      [24] S. S. Ahmed and H. Kitagawa, "Distance Based Outlier Detection on UNcertain Data of Gaussian Distribution," Springer, pp. 109-121, 2012.

      [25] D. Ren, I. Rahal and W. Perrizo, "A Vertical Outlier Detection Algorithm with Clusters as by Product," in ICTAI, 2004.

      [26] S. Ramaswamy, R. Rastogi and K. Shim, "Efficient algorithms for mining outliers from large data sets," in ACM SIGMOD, 2000.

      [27] McCallum, K. Nigam and L. H. Ungar, "Efficient Clustering of High -Dimensional Data sets with Application to Reference Matching".

      [28] Ghoting, S. Parthasarathy and M. E. Otey, "Fast Mining of Distance-Based Outliers in High-Dimensional Datasets," in SIAM, 2006.

      [29] T. Zhang, R. Ramakrishnan and M. Livny, "BIRCH: An Efficient Data Clustering Method for Very Large Databases," in ACM SIGMOD, Canada, 1996.

      [30] J. Zhao, C.-T. Lu and Y. Kou, "Detecting Region Outliers in Meteorological Data," in ACM GIS, New Orleans, Louisiana, USA, 2003.

      [31] L. Akoglu, H. Tong, J. Vreekan and C. Faloutsos, "Fast and Relaible Anomaly Detection in Categorical Data," in ACM CIKM, Maui Hi USA, 2012.

      [32] N. M.Hewai and M. K.Saad, "Class Outliers Mining: Distance -Based Approach," International Journal of Electrical and Computer Engineering, pp. 448-461, 2007.

      [33] M. F. Jiang, S. S. Tseng and C. M. Su, "Two-phase clustering process for outliers detection," Pattern Recognition Letters, Elsevier, pp. 691-700, 2001.

      [34] P. Filzmoser, R. Maronna and M. Werner, "Outlier Identification in high dimensions," 2006.

      [35] K. Ro, C. Zou, Z. Wang and G. Yin, "Outlier Detection for High Dimensional Data," Biometrika, pp. 589-599, 2015.

      [36] H. P. Kriegel, M. Schubert and A. Zimek, "Angle-Based Outlier Detection in High Dimensional Data," in International Conference on Knowledge Discovery and Data Mining, Las Vegas, 2008.

      [37] Banarjee, "Density Based evolutionary outlier detection," in ACM GECCO, 2012.

      [38] K. S., P.Visu and J.Janet, "A review on clustering and outlier analysis techniques in data mining," American journal of applied sciences, pp. 254-258, 2012.

      [39] J. Laurikkalaa, M. Juholaa and E. Kentalab, "Informal identification of outliers in medical data," in IDAMAP, 2000.

      [40] B. Gal, "Outlier detection," in Data Mining and Data Discovery Handbook, US, Springer, 2010, pp. 117-132.

      [41] [Online]. Available: https://fhss.byu.edu/spss%20modeler/chapter%205.pdf.

      [42] P. L. Rosin and F. Fiereins, "Improving Neural Network Generelization".

      [43] P. Rana, D. Pahuja and R. Gautam, "A Critical Review on Outlier Detection Techniques," International Journal of Science and Research (IJSR) , pp. 2394-2404, 2014.

      [44] J. Zhang, "Advancements of Outlier Detection: A Survey," ICST Transactions on Scalable Information Systems, pp. 1-26, 2013.

      [45] N. Reunanen, "Modular framework for outlier detection," Finland, 2014.

      [46] K. Singh and S. Upadhyaya, "Outlier Detection: Applications And Techniques," IJCSI International Journal of Computer Science, pp. 307-324, 2012.

      [47] M. VERLEYSEN, "Learning High Dimensional Data," Limitations and Future Trends in Neural Computation, pp. 141-162, 2003.

      [48] V.Ilango, R. Subramanian and V. Vasudevan, "A five step procedure for outlier analysis in data mining," European Journal of Scientific Research, pp. 327-339, 2012.

      [49] N. M.Hewai and M. K. Saad, "Class outliers mining: Distance-Based Approach," International Journal of Electrical and Computer Engineering , pp. 448-461, 2007.

      [50] Zimek, R. Campello and J. Sander, "Ensembles for Unsupervised Outlier Detection: Challenges and Research Questions," SIGKDD Explorations, pp. 11-23.

      [51] J. G. Cheng, "Outlier Management in Intelligent Data Analysis," London, 2000.

      [52] M. O. Mansur and M. N. M. Sap, "Outlier Detection Technique in Data Mining: A Research Perspective," in Proceedings of the Postgraduate Annual Research Seminar 2005 , 2005.

      [53] L. Sunitha, M. B. Raju and B. S. Srinivas, "A Comparative Study between Noisy Data and Outlier Data in Data Mining," International Journal of Current Engineering and Technology , pp. 575-577, 2013.

      [54] V. Chandola, A. Banarjee and V. Kumar, "Outlier Detection: A Survey," 2004.

      [55] S. Kim and S. Cho, "Prototype based outlier detection," in IJCNN, 2006.

      [56] V. Schultze and J. Pawlitschko, "The Identification of Outliers in Exponenetial Samples," Statistica Neerlandica, pp. 41-57, 2002.

      [57] E. Eskin, A. Arnold, M. Prerau, L. Portnoy and S. Stolfo, "A Geometric Framework for Unsupervised Anomaly Detection: Detecting Intrusions in Unlabeled Data," Applications of Data Mining in Computer Security, 2002.

      [58] L. Davies and U. Gather, "The Identification of Multiple Outliers," Journal of the American Statistical Association , pp. 782-792, 1993.

      [59] Z. He, X. Xu and S. Deng, "An Otimization Model for Outlier Detection in Categorical Data," in Proceedings of ICIC, 2005.

      [60] T. Cheng and Z. Li, "A Multiscale Approach for Spatio-Temporal Outlier Detection," Transactions in GIS, pp. 253-263, 2004.

      [61] S. Shekar, C.-T. Lu and P. Zhang, "Detecting Graph Based Outlier: Algorithms and Applications," in ACM SIGKDD, 2001.

      [62] S. Muthukrishnan, R. Shah and J. Vitter, "Mining Deviants in Time Series Data Streams," in SSDBM, 2004.

      [63] Schlkopf, J. Platt, J. Shawe-Taylor, A. J. Smola and R. C. Williamson, "Estimating the Support of a High Dimensional Distribution," Neural Computation, pp. 1443-1471, 2001.

      [64] T. Hu and S. Y. Sung, "Detecting Pattern Based Outliers," Pattern Recognition Letters, pp. 3059-3068, 2003.

      [65] J. Laurikkala, M. Juhola and E. Kentala, "Informal Identification of Outliers in Medical Data," in IDAMP, 2000.

      [66] Yu, G. Sheikholeslami and A. Zhang, "Finding Outliers in Very Large Datasets," Journal of Knowledge and Information Systems, pp. 387-412, 2002.

      [67] P. Rousseeuw and A.M.Leory, Robust regression and outlier detection, John Wiley and sons, 1996.

      [68] H.S.Behera, A. Ghosh and S. K. Mishra, "A new hybrid K-means clustering based outlier detection technique for effective data mining," International journal of advanced research in computer science and software engineering, pp. 287-292, 2012.

      [69] F. E. Grubbs, "Procedures for detecting outlying observations in samples," Technometrics, pp. 1-21, 1969.

      [70] Y. Li and H. Kitagawa, "DB-Outlier Detection by example in high dimensional data sets," IEEE, pp. 73-79, 2007.

      [71] N. Pham and R. Pagh, "A Near linear Time Approximation Algorithm For angle based outlier detection in high dimensional data," in KDD, 2012.

      [72] Y. Li, D. Wu, J. Ren and C. Hu, "An improved Outlier Detection Method in High Dimensional Based on Weighted Hypergraph," in Second INternational Symposium on Electronic Commerce and Security, 2009.

      [73] C. Aggarwal and P. S.Yu, "Outlier Detection for High Dimensional Data," in Proceedings of the ACM International Conference on Management of data SIGMOID, Santa BArbara, CA, 2001.

      [74] S.D.Pachgade and S.S.Dhande, "Outlier Detection over data set using cluster based and distance based approach," International journal of Advanced Research in computer science and software engineering, pp. 12-16, 2012.

      [75] P. Guo, J.-Y. Dai and Y.-X. Wang, "Outlier Detection in HIgh Dimension Based on Projections," in IEEE, Fifth INternational Conference on MAchine learning and cybernetics, 2006.

      [76] R. Pamula, J. K. Deka and S. Nandi, "An outlier detection method based on clustrering," in IEEE, Second International conference on Emerging Applications of information technology, 2011.

      [77] V. Barnett and T. Lewis, Outliers in statisticsl Data, Willy, 1994.

      [78] D.M.Hawkins, Identification of outliers, Springer, 1980.

      [79] R. Butler, "Outlier Discordancy Test in Normal Linear Model," JSTOR, Journal of Royal Statistical Society, pp. 120-132, 1983.

      [80] S. S.S., "A Survey on Outlier Detection Methods," (IJCSIT) International Journal of Computer Science and Information Technologies, pp. 8153-8156, 2014.

      [81] M. K. Deshmukh and A. S. Kapse, "A Survey On Outlier Detection Technique In Streaming Data Using Data Clustering Approach," International Journal Of Engineering And Computer Science , pp. 15453-15456 , 2016.

      [82] M. Aouf and L. A. Park, "Approximate Document Outlier Detection Using Random Spectral Projection".

      [83] Y. Zhang, N. Meratine and P. Havinga, "A Taxonomy Framework for Unsupervised Outlier Detection Techniques for Multi-Type Data Sets".

      [84] V. J. Hodge and J. Austin, "A Survey of Outlier Detection Methodologies," 2004.

      [85] R. Kimball and M. Ross, The Data WareHouse Toolkit The Definitve Guide to Dimensional Modelling, John Wiley & Sons, Inc, 2008.

      [86] M. H. Dunham, Data Mining Introduction and advanced topics, Printice Hall, 2002.

      [87] M. Golfarelli and S. Rizzi, Data Warehouse Design Modern Principles and Methodologies, Tata McGraw Hills, 2012.

      [88] P. Ponniah, Data warehousing Fundamentals A comprehensive Guide fro IT Professionals, 2006.

      [89] W. Inmonn, 1996.

      [90] Berson and S. J. Smith, Data Warehousing, Data Mining, and OLAP, Tata McGraw Hill, 2008.

      [91] T. B. Pedersen, J. Thorhauge and S. E. Jespersen, "Combining.Data. Warehousing.and. Data.Mining.Techniques. for.Web.Log.Analysis," in Research and Trends in Data Mining Technologies and Applications, Monash University, Australia, Idea Group Publication, 2007, pp. 1-28.

      [92] Jensen and J. Neville, "Data Mining in Social Networks," in Symposium on Dynamic Social Network Modeling and Analysis, 2002.

      [93] P. Peng, Q. Ma and C. Li, "The Research and Implementation of Data Mining Component Library System".

      [94] P. C. Agarwal, Probabality and statistics, 2007.

      [95] Biswas, Probability and Statistics, 2012.

      [96] J. L. Devore, Probability and Statistics for Engineers, 2011.

      [97] A.Abede, J.Danials, W.McKean and J.A.Kapenga, Statistics and data analysis, Western Michigan University, 2001.

      [98] S. Haykin, Neural Networks and Learning Machines, Pearson, 2008.

      [99] V. Pudi and P. R. Krishna, Data Mining, Oxford University Press, 2009.

      [100] R. Jindal and M. D. Borah, "A SURVEY ON EDUCATIONAL DATA MINING AND RESEARCH TRENDS," International Journal of Database Management Systems ( IJDMS ) , pp. 53-74, 2013.

      [101] Eskin, "Anomaly Detectopn over Noisy Data Using Learned Probability," in Machine Learning, 2000.

      [102] W. Eberlea and L. Holder, "Anomaly detection in data represented as graphs," Intelligent Data Analysis, pp. 663-689, 2007.

      [103] R. E. Marmelstein, "Application of Genetic Algorithms to Data Mining," in MAICS, 1997.

      [104] R. O. Duda, P. E. Hart and D. G. Strork, Pattern Classification, Wiley, 2000.

      [105] J. Singh and S. Agarwal, "Survey on Outlier Detection on Data Mining," International Journal of Computer Applications, pp. 29-33, 2013.

      [106] M. K. Jiawei Han, Data Mining Concepts and Techniques, San Francisco: Morgan Kaufman , 2001.

      [107] Smita and P. Sharma, "Use of Data Mining in Various Field: A Survey Paper," IOSR Journal of Computer Engineering (IOSR-JCE) , pp. 18-21, 2014.

      [108] J. Xi, "Outlier Detection Algorithms in data mining," IEEE, Second International Symposium on Intelligent Information Technology Application, pp. 94-97, 2008.

      [109] N. Padhy, D. P. Mishra and R. Panigrahi, "The Survey of Data Mining Applications And Feature Scope," International Journal of Computer Science, Engineering and Information Technology (IJCSEIT), pp. 43-58, 2012.

  • Downloads

  • How to Cite

    Mishra, D., & Soni, D. (2018). Outliers in Data Mining: Approaches and Detection. International Journal of Engineering & Technology, 7(4.39), 189-198. https://doi.org/10.14419/ijet.v7i4.39.23930