Analysis of large volume data processing using clustering algorithms

The study of large dataset with velocity, variety and volume which is also known as Big data. When the dataset has limited number of clusters, low dimensions and small number of data points the existing traditional clustering algorithms can be used.. As we know this is the internet age, the data is growing very fast and existing clustering algorithms are not giving the acceptable results in terms of time complexity and spatial complexity. So there is a need to develop a new approach of applying clustering of large volume of data processing with low time and spatial complexity through MapReduce and Hadoop frame work applying to different clustering algorithms, k-means, Canopy clustering and proposed algorithm .The analysis shows that the large volume of data processing will take low time and spatial complexity when compared to small volume of data.


Introduction
The data is increasing in terms of volume, variety, and velocity, the existing clustering algorithm takes more time to produce the results. To produce results in terms of less time and less memory one should think of something big and that is parallel programing. MapReduce is one of the programming designs for large volumes of datasets in parallel .MapReduce with HDFS can be used to handle the big data ,which is commonly known as Hadoop .Once the file is placed into HDFS it can be read n number of times.

Map reduce
MapReduce is a frame work and it is patented by Google which supports processing of large data sets in parallel across Hadoop clusters .The MapReduce is a program block which divides the data and merges the intermediate results. Implementation of MapReduce can be done using any language to run the job [3] .It has two phases namely map and reduce/map () and reduce (). Map phase The input applied in this phase is divided into chunks. By default, splitting is done by Hadoop Distributed System (HDFS).The size of the chunks are mutable. The input is key -value in the form of records in MapReduce [3]. The map function takes key and values as input and produces the intermediate values of list as:

Reduce phase
After the completion of map function the intermediate values are combined to get the final result for the same output key. Like map function even reduce function runs in parallel and each of the reduce function run on a different output key [3].

Existing system approach
Clustering is the best example for unsupervised learning algorithm. It is a simple approach to group data points or objects. Here the groups are called clusters. The objects are data points which are in the cluster are similar than those in the other clusters.

k-means clustering algorithm
K-means clustering algorithm is very simple and easy to understand. The steps involved in this algorithm are: Step 1: Randomly select the centroids and place them in space, which are temporary means of the cluster.
Step 2: Calculate the Euclidean distance between each data point and cluster center. And then assign the data points to cluster centroid whose distance is minimum.
Step 3: Recalculate the centroids for each cluster and replace by respective cluster centroid.
Step 4: If there is no reassignment of the data point then go to next step otherwise go to step2 Step 5: End

Limitations
Some of the drawbacks of existing k-mean algorithm through literature survey are: 1) A review of uncertainty handling formalisms by A. Hunter and S. Parsons [6].In this paper computation time is reduced but initial centroids are selected randomly. 2) An overview from a database perspective by M. S. Chen, J.
Han, and P. S. Yu. [4]. In this paper author proposed the initial centroid algorithm to avoid selection of random centroid 3) Efficient k-mean clustering algorithm for reducing the time complexity by D.Napoleon, P.Ganga Lakshmi. The authors say that reducing the time complexity is expensive for high dimensional datasets [3]. 4) Overcoming the Defects of k-Means Clustering by using Cano-py Clustering Algorithm by Ambika .s and Kavitha G [1]. Avoided random selection of centroid by using canopy clustering algorithm.

Proposed system
The main aim of the proposed System is to find the initial values of centroids that is K value for K-means clustering algorithm and studying the space complexity and time complexity on Hadoop and MapReduce platform. The Modules used in proposed system are 1) Big Data 2) Canopy clustering Algorithm 3) k-Mean Clustering Algorithm

Big data
Big data' is the term used to describe collection of data that is huge in size and yet growing exponentially with time and have the dimensions velocity, variety, volume.

Canopy clustering algorithm
The results of this algorithm are a number of canopies which are the cluster centers for the given dataset.

K-mean clustering algorithm
The execution time of K-Mean clustering Algorithm Given by O (nkdi) where n is the number of data points, k is the number of clusters, i is the number of iterations needed to converge and d is the dimensions. When the value of n and d increases then it is time consuming process or it is not applicable .In order to overcome the canopy clustering algorithm is used which is also called as pre clustering algorithm. In the Proposed system the output of the canopy clustering algorithm is given as input to the k-mean clustering algorithm

Canopykmeans clustering algorithm
Input: Dataset Output: number of clusters The algorithm uses two threshold values T1 and T2 Where T1 is loose distance and T2 is tight distance Where T1>T2 The steps involved in canopy clustering algorithm are: Step 1: Randomly select any data point from the dataset as a canopy center Step 2: Find the distance to all other points in the dataset from the canopy center.
Step 3: The distance calculated is less than the T1 then put data points into a canopy Step 4: Remove from the data set all the points which are less than T2 Step 5: Repeat the above step1 to step 4 until the dataset becomes empty Step 6: Feed the output as input K-mean clustering algorithm

Result and analysis
Big Data

K-Mean
Clustering Algorithm Canopy Clustering Algorithm         The data is growing in terms of volume, variety and velocity .The behavior of each clustering algorithm is analyzed through MapReduce and Hadoop platform which uses parallel processing technique. Here we considered the simulated social data of size one lakh with twelve attributes. The Figures 5 ,6 and Table1 shows that as the dataset increases the time taken and spatial com-plexity for k-mean clustering algorithm is less and constant as the is increasing and the same result from Figure 7,8 and Table 2 that is canopy clustering algorithm. The Proposed approach takes spatial complexity less than the canopy and k-mean clustering algorithm from the Figure 9, 10 and Table 3 and no need to give K value manually.

Conclusion
In this paper we have studied existing K-mean and canopy clustering algorithms for big data using MapReduce and Hadoop platform. And proposed new technique, the canopy algorithm is applied to the Big data and the output is given as the initial centers (the value of k)to K-mean clustering algorithm through MapReduce and Hadoop frame work which uses parallel processing technique.