Distributed Mining of Outliers from Large Multi-Dimensional Databases

  • Authors

    • K. Ashesh
    • Dr. G. Appa Rao
    2018-09-27
    https://doi.org/10.14419/ijet.v7i4.7.20564
  • Distributed outlier detection, outlier detection, hierarchical index tree.
  • A data point is given dataset is considered to be outlier when it is not distant to all its nearest neighbours. Obviously it is based on distance measure. However, in distributed environments it is challenging to detect outliers. Many approaches to mine outliers such environments came into existence. However, a faster and more efficient way is desired. In this paper we employ a novel index tree which is hierarchical in nature. Its hierarchical structure paves way for space pruning while its clustering property helps in faster search of finding neighbours of a given data point. Its time complexity is linear to the size of dataset and its dimensions. On top of the hierarchical tree (Hi-tree) nearest neighbour search avoids unnecessary computations besides pruning unpromising points. An algorithm by name Distributed Mining of Outliers using Hi-tree (DMOH) is proposed. The index tree can be exploited with parallel processing phenomenon. We built a prototype application to demonstrate proof of the concept. Our empirical study revealed the efficiency of the proposed algorithm on top of Hi-tree.

     

     

  • References

    1. [1] Wen Jin Anthony K. H. Tung Jiawei Ha. (2001). Mining Top Local Outliers in Large Databases. IEEE, p1-6.

      [2] Matthew Eric Otey, Amol Ghoting and Srinivasan Parthasarathy. (2005). Fast Distributed Outlier Detection in Mixed-Attribute Data Sets. IEEE, p1-33.

      [3] Jinlong Huang, QingshengZhu , Lijun Yang, Ji Feng. (2016). A non-parameter outlier detection algorithm based on Natural Neighbor. Elsevier, p1-3.

      [4] RICARDO J. G. B. CAMPELLO, DAVOUD MOULAVI,ARTHUR ZIMEK,JORG SANDER. (2015). Hierarchical density estimates for data clustering, visualization, and outlier detection. ACM Transactions on Knowledge Discovery from Data, 10 (1), p1-52.

      [5] Monowar H. Bhuyana, D.K. Bhattacharyya b , J.K. Kalitac. (2016). A multi-step outlier based anomaly detection approach to network wide traffic. Elsevier, p1-29.

      [6] Mostafa Rahmani,and George K. Atia. (2017). Randomized Robust Subspace Recovery for High Dimensional Data Matrices. IEEE,p1-14.

      [7] Gianluigi Folinon , Pietro Sabatino. (2016). Ensemble based collaborative and distributed intrusion detection systems A survey. Journal of Network and Computer Applications, p1-16.

      [8] Xu Chu, Ihab F. Ilyas, Sanjay Krishnan, Jiannan Wang. (2016). Data Cleaning Overview and Emerging Challenges. IEEE, p1-6.

      [9] Mohiuddin Ahmed, Abdun Naser Mahmood, Jiankun Hu. (2016). A survey of network anomaly detection techniques. Elsevier, p1-13.

      [10] RiyanartoSarno, RahadianDustrialDewandono, Tohari Ahmad, Mohammad Farid Naufal and Fernandes Sinaga. (2015). Hybrid Association Rule Learning and Process Mining for Fraud Detection. IAENG International Journal of Computer Science, p1-14.

      [11] Aaron Nech Ira Kemelmacher-Shlizerman. (2017). Level Playing Field for Million Scale Face Recognition. IEEE, p1-10.

      [12] Nauman Shahid,Ijaz Haider Naqvi,Saad Qaisar,. (2015). One-class support vector machines Analysis of outlier detection for wireless sensor networks in harsh environments. IEEE, p1-50.

      [13] Nauman Shahid,Ijaz Haider Naqvi,Saad Qaisar,. (2015). One-class support vector machines Analysis of outlier detection for wireless sensor networks in harsh environments. IEEE, p1-50.

      [14] Erich Schubert, Alexander Koos, Tobias Emrich, Andreas Zufle, Klaus Arthur Schmid, Arthur Zimek. (2015). A Framework for Clustering Uncertain Data. Proceedings of the VLDB Endowmen. 8 (12), p1-4.

      [15] Anna L. Buczak, Member, IEEE, and Erhan Guven. (2016). A Survey of Data Mining and Machine Learning Methods for Cyber Security Intrusion Detection. IEEE COMMUNICATIONS SURVEYS & TUTORIALS. 1 (2), p1-24.

      [16] XiaodongJia,ChaoJin,MattBuzza,Jay Lee. (2017). Wind turbine performance degradation assessment based on a novel similarity metric for machine performance curves. IEEE, p1-20.

      [17] Yongjoo Park, Michael Cafarella, BarzanMozafari. (2017). Visualization Aware Sampling for Very Large Databases. IEEE International Conference on Data Engineering, p1-14.

      [18] Fadel M. Megahed and L. Allison Jones-Farmer. (2015). A Statistical Process Monitoring Perspective on Big Data. Elsevier, p1-21.

      [19] Nurjahan Begum Liudmila Ulanova Jun Wang Eamonn Keogh. (2015). Accelerating Dynamic Time Warping Clustering with a Novel Admissible Pruning Strategy. Elsevier, p1-10.

      [20] SudiptoGuha,NinaMishra,GouravRoy,OkkeSchrijvers. (2016). Robust Random Cut Forest Based Anomaly Detection On Streams. International Conference on Machine Learning. 48, p1-10.

      [21] YongruiQina , Quan Z. Shenga , Nickolas J.G. Falknera , SchahramDustdarb , Hua Wangc , Athanasios V. Vasilakosd. (2016). When Things Matter: A Survey on Data-Centric Internet of Things. Preprint submitted to Journal of Network and Computer Applications, p1-20.

      [22] Theodoros Rekatsinas , Xu Chu , Ihab F. Ilyas , Christopher Ré. (2017). HoloClean Holistic Data Repairs with Probabilistic Inference. IEEE, p1-13.

      [23] Saeed Aghabozorgi, Ali SeyedShirkhorshidin ,Teh Ying Wah. (2015). Time series clustering A decade review. Information Systems, p1-23.

      [24] Yan Xia Xudong Cao Fang Wen Gang Hua Jian Sun. (2015). Learning Discriminative Reconstructions for Unsupervised Outlier Removal. IEEE, p1-8.

      [25] MENG JIANG and PENG CUI,ALEX BEUTEL and CHRISTOS FALOUTSOS, SHIQIANG YANG. (2016). Catching Synchronized Behaviors in Large Networks A Graph Mining Approach. ACM Transactions on Knowledge Discovery from Dat. 10 (4), p1-27.

      [26] McMahon GT, Gomes HE, Hohne SH, Hu TM, Levine BA & Conlin PR (2005), Web-based care management in patients with poorly controlled diabetes. Diabetes Care 28, 1624–1629.

      [27] Thakurdesai PA, Kole PL & Pareek RP (2004), Evaluation of the quality and contents of diabetes mellitus patient education on Internet. Patient Education and Counseling 53, 309–313.

  • Downloads

  • How to Cite

    Ashesh, K., & G. Appa Rao, D. (2018). Distributed Mining of Outliers from Large Multi-Dimensional Databases. International Journal of Engineering & Technology, 7(4.7), 292-296. https://doi.org/10.14419/ijet.v7i4.7.20564