Scalable density based spatial clustering with integrated one-class SVM for noise reduction
Keywords:DBSCAN, One-Class SVM, Noise Reduction, Clustering, Spark.
Information extraction from data is one of the key necessities for data analysis. Unsupervised nature of data leads to complex computational methods for analysis. This paper presents a density based spatial clustering technique integrated with one-class SVM, a machine learning technique for noise reduction, a modified variant of DBSCAN called NRDBSCAN. Analysis of DBSCAN exhibits its major requirement of accurate thresholds, absence of which yields suboptimal results. However, identifying accurate threshold settings is unattainable. Noise is one of the major side-effects of the threshold gap. The proposed work reduces noise by integrating a machine learning classifier into the operation structure of DBSCAN. Further, the proposed technique is parallelized using Spark architecture, thereby increasing its scalability and its ability to handle large amounts of data. Experiments and comparisons with similar techniques indicate high scalability levels and high homogeneity levels in the clustering process.
 Hartigan JA, and Wong MA,â€Algorithm AS 136: A k-means clustering algorithmâ€, Journal of the Royal Statistical Society. Series C (Applied Statistics), Vol.28, No.1, (1979), pp.100-108. https://doi.org/10.2307/2346830.
 Wei CP, Lee YH, and Hsu CM,â€Empirical comparison of fast clustering algorithms for large data setsâ€, Proceedings of the 33rd Annual Hawaii International Conference, (2000), pp:1-10.
 Ester M, Kriegel HP, Sander J, and Xu X,â€A density-based algorithm for discovering clusters in large spatial databases with noiseâ€, In KDD 1996, Vol.96, No.34, (1996), pp.226-231.
 Hinneburg A, and Keim DA,â€An efficient approach to clustering in large multimedia databases with noiseâ€, In KDD 1998, Vol.98, (1998), pp.58-65.
 Ankerst M, Breunig MM, Kriegel HP, and Sander J,â€OPTICS: ordering points to identify the clustering structureâ€, In ACM Sigmod record 1999, Vol.28, No.2, (1999), pp.49-60. https://doi.org/10.1145/304182.304187.
 GÃ¼ngÃ¶r E, and Ã–zmen A,â€Distance and density based clustering algorithm using Gaussian kernelâ€, Expert Systems with Applications, Vol.69, (2017), pp.10-20. https://doi.org/10.1016/j.eswa.2016.10.022.
 Zhou S, Zhou A, Jin W, Fan Y, and Qian W,â€FDBSCAN: a fast DBSCAN algorithmâ€, Ruan Jian Xue Bao, Vol.11, No.6, (2000), pp.735-744.
 Tsai CF, and Yeh HF,â€Npust: An efficient clustering algorithm using partition space technique for large databasesâ€, Proceedings of the International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems, (2009), pp: 787-796. Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-02568-6_80.
 Chowdhury AR, Mollah ME, and Rahman MA,â€An efficient method for subjectively choosing parameter â€˜kâ€™automatically in VDBSCAN (Varied Density Based Spatial Clustering of Applications with Noise) algorithmâ€, Proceedings of the 2nd International Conference on Computer and Automation Engineering (ICCAE), (2010), Vol.1, pp: 38-41. IEEE.
 Parimala M, Lopez D, and Senthilkumar NC,â€A survey on density based clustering algorithms for mining large spatial databasesâ€, International Journal of Advanced Science and Technology, Vol.31, No.1, (2011), pp.59-66.
 Chen X, Liu W, Qiu H, and Lai J,â€APSCAN: A parameter free algorithm for clusteringâ€, Pattern Recognition Letters, Vol.32, No.7, (2011), pp.973-986. https://doi.org/10.1016/j.patrec.2011.02.001.
 Zhu Y, Ting KM, and Carman MJ,â€Density-ratio based clustering for discovering clusters with varying densitiesâ€, Pattern Recognition Letters, Vol.60, (2016), pp.983-997. https://doi.org/10.1016/j.patcog.2016.07.007.
 Louhichi S, Gzara M, and Ben-Abdallah H,â€Unsupervised varied density based clustering algorithm using splineâ€, Pattern Recognition Letters, 2016.
 Mai ST, He X, Feng J, Plant C, and BÃ¶hm C,â€Anytime density-based clustering of complex dataâ€, Knowledge and Information Systems, Vol.45, No.2, (2015), pp.319-355. https://doi.org/10.1007/s10115-014-0797-0.
 Liu P, Zhou D, and Wu N,â€VDBSCAN: varied density based spatial clustering of applications with noiseâ€, Proceedings of the International Conference on Service Systems and Service Management, (2007), pp: 1-4. IEEE. https://doi.org/10.1109/ICSSSM.2007.4280175.
 Xiaoyun C, Yufang M, Yan Z, and Ping W,â€GMDBSCAN: multi-density DBSCAN cluster based on gridâ€, Proceedings of the International Conference on e-Business Engineering (ICEBE), (2008), pp: 780-783. IEEE. https://doi.org/10.1109/ICEBE.2008.54.
 Borah B, and Bhattacharyya DK,â€ DDSC: a density differentiated spatial clustering techniqueâ€, Journal of Computers, Vol.3, No.2, (2008), pp.72-79. https://doi.org/10.4304/jcp.3.2.72-79.
 Ram A, Sharma A, Jalal AS, Agrawal A, and Singh R,â€An enhanced density based spatial clustering of applications with noiseâ€, Proceedings of the International Conference on Advanced Computing (IACC), (2009), pp:1475-1478. IEEE.
 SchÃ¶lkopf B, Platt JC, Shawe-Taylor J, Smola AJ, and Williamson RC,â€Estimating the support of a high-dimensional distributionâ€, Neural Computation, Vol.13, No.7, (2001), pp.1443-1471. https://doi.org/10.1162/089976601750264965.
 Manevitz LM, and Yousef M,â€One-class SVMs for document classificationâ€, Journal of Machine Learning Research, Vol.2, (2001), pp.139-154.
 Nafees Ahmed K, and Abdul Razak T,â€Density based clustering using modified PSO based neighbor selectionâ€, International Journal on Computer Science and Engineering (IJCSE), Vol.9, No.5, (2017), pp.192-199.