k-NN improvement to data analysis

The problem to classify big data is an actual one the subject. There are multiple ways to classify data but the k Nearest Neighbors (k-NN) has become a popular tool for the data scientist. In this paper we examine several modifications of the k Nearest Neighbors algorithm that achieve better efficiency in terms of accuracy and CPU time when classifying test observations in comparison to the standard k Nearest Neighbors algorithm. To make the modifications faster than standard k-NN we use a special methodology which splits the input dataset into n folds and combine it with input data transformations. Each time we execute the process, one of the folds is saved as a test subset and the rest of the folds are applied for training. The process is executed n times. In the proposed methodology we are looking for the pair of subsets which produces the highest accuracy result.


Introduction
Data Analysis is becoming relevant over the most recent 20 years. The task of classifying a dataset with a large number of observations is one of the major research studies on the big data analysis. The method of the k-Nearest Neighbors is to determine to which class a new observation belongs, the method finding its closest neighbors whose class is known [11]. Nearest neighbors are determined based on the distance between the new observation. Although it is mainly used to resolve classification problems, it is not unusual for k-NN to be used for image recognition, text categorization, object recognition, and more [3], [5], [6], [9]. The data analysis tends to develop efficient computational techniques that will raise with experience, for analysis the vast complex data sets such as complex biological data [5], [8], [13]. Different approaches and algorithms Artificial neural networks, k-Nearest Neighbors, Random forest, Support vector machine have been investigated [17]. Fraud detection in healthcare domain are continuously evolving and are put into practice in many business fields. User behavior is monitored in fraud detection in order to analyze and find any suspicious or undesirable behavior and to avoid the same. Different types of fraud detection machine learning techniques are commented in [14]. The classic k-NN can be applied in customer relationship processes by more efficient filtering of prospective buyers of a particular product or service, as it may classify them as buyers or not buyers [1]. Different modifications based on the classic k-NN method have proposed in the literature [3], [4], [18].

Methodology
The process of classifying big data sets passes through several stages [12]. In the beginning, the set of data is prepared -this is done in two ways. The main set is divided into two subsets, called training and test. The model is constructed (the term is trained) on the training set, and the model is checked on the test subset to determine how it predicts the observations in it. The test subset is unknown to the model. In the second approach, the given set is divided into three subsets -training, test and validation. The validating subset does not change throughout the process, while the attitude towards the training and test is as in the first case. Observations may vary from the training to the test subset and vice versa. The aim is to find a model on the training subset with the best performance on the test subset. The model found is validated on the validation subset. Observations from the validation subset are new to the built model, and the properties of the built model can be evaluated. In this paper we propose some algorithm modifications of the k Nearest Neighbors method for accessing high-quality kNN model that achieves better efficiency in terms of accuracy and F1 precision when classifying test observations in comparison to the standard kNN algorithm. To validate the results obtained from the proposed kNN modifications and to assess the performance, the predictions of them are tested on the datasets available on the Internet and compared with the standard k-NN algorithm. In this investigation we apply the Python information technology. The estimation or verification of the predictive power of a model is done through existing criteria for the purpose. These are different criteria that have different degrees of relevance for different classification methods. The most commonly used criterion is implemented through the Python middle score command and it measures the difference between the actual and predicted values of the dependent variable. This criterion can be used for all subsets in a classification task -a training, test, validation subset and the entire (total) set. But any such criterion has real strength and significance when used on observations unknown to it, that is, it is not recommended to apply to the training subset in the sense that the model is built on the same subset. Model evaluation is performed in the third stage of the model. The numeric score value is a number in the zero and one. The goal is to find a model that shows a numeric score value on the test set (test or validation) close to a unit. During the second stage, the parameters for applying the model classification method are selected to allow it to show a large predictive power after construction. Objective of the study: To demonstrate k-NN algorithms variety with different data preprocessing. To promote about kNN algorithms and to describe their experimental properties. Because we will apply the method of closest neighbors in this report, we will describe the methodology for its application in the terminology used. The information technology we will work with is the Anaconda environment with Python 2.7. The k Nearest Neighbors method builds a pattern based on some observations that are pre-classified, i.e. broken down by classes [7,10]. The built model should predict to which class each new observation belongs, based on the classes of the nearest neighbors of the new observation. For example, if multiple observations are made up of two classes and a new observation occurs, the nearest observations are set to the new one (k is the number of neighbors and is determined in advance). The classes of these closest neighbors are examined and the new observation is classified into the predominant class. To implement the algorithm, the set of observations is divided into two subsets -a subset and a subset of the test subset. In addition, in some situations, even when the training set is "big enough," finding nearest neighbors can be done very quickly.

Algorithms
First Approach (Algorithm 1). We divide the main set by the train_test_split command to a training and test subset, and by the parameter test_size the size of the two subsets is determined. We choose the number of neighbors, ie. the value of k. We train the model on the train subset and check it on the test subset. We run the algorithm several times. In each execution, responses are different. And this is because in each execution the division of the training and test set is different. The algorithm described above is standard and is most commonly used to classify big data sets. It is found in many Internet applications and is detailed in a number of books and Internet sites. Second Approach (Algorithm 2). The basis of this algorithm is the idea that the value of k changes, as well as the observations that form the training and test subsets. We organize a loop to realize the division of the main set as in Algorithm 1, and remember the values of k which produces the highest value of the score coefficient on the test subset. As a result, we determine the number of neighbors k* for which the model reaches the highest predicted power. Third Approach (Algorithm 3). The main set is divided into two subsets -training and testing as in Algorithm 1. Next, we organize a loop to determine the value of the number of neighbors k. As a result, for the already divided set, we find the best value k*, in which the model reaches the highest value of the score coefficient on the test subset. We run the algorithm several times. In each execution, the responses are different because the subsets are different. Fourth Approach (Algorithm 4). In this algorithm, we first organize a loop on the k to determine the number of neighbors. Within the loop, we divide the main set of training and testing using the idea developed and described in detail in Ivanov and Tanov's textbook [12]. Than is a different reorganizing of the data before construct the model. In short, this is achieved by the command kf = KFold (len (X), n_folds = 6, shuffle = True) which divides the base set X into n_folds equal parts (in this case n_folds = 6). The division depends on the logical variable shuffle. With the true value of this variable, the observations are randomly divided into each n_folds section. Each time the KFold command is executed at true value of the logical variable, a different division is obtained, then a different model with different predictive power. If this variable has a false value, then the observations retain their ordinance from the main set in each subset. In each subsequent execution of the KFold command, the results of the change are changed because the division of the base set does not change. Fifth approach (Algorithm 5). The goal here is to divide the main plurality of subsets to choose the number of neighbors so that we get the highest value for the score coefficient on the test subset. In this algorithm, we begin by dividing the main set by the command kf = KFold (len (X), n_folds = 6, shuffle = True). We organize a cycle with repeated n_folds times to determine a training and test subset.
Within the loop, we define an internal loop to determine the number of neighbors. As a result, we find a division of the main set and the number of neighbors for which we have a high score value.

Experiments and discussions
In this section we will conduct experiments to compare the performance of the four data classification algorithms. The algorithms will apply to several sets of data listed in Table 1. The data sets are pre-classified and the observations are divided into classes. With each algorithm we will build a model that we will test on the test subset, and then we can use it to classify new observations that we do not know to which class they belong. We run the experiments on the Anaconda and Python 2.7 software platform via a 1.81GHz PENTI-UM® Dual CPU computer. In order to estimate the effectiveness of the considered algorithms we compare the success rate of each, presented via score coefficient received by the command score and the F1-score coefficient from the classification report.
At the core of all algorithms is the (KNeighborsClassifier k-NN) standard procedure that is described and used in Python in the sklearn.neighbors library (see [16]). The main commands are as follows: Command to define the model (knn = KNeighborsClassifier n_neighbors = ...); the training command of the knn.fit (X_train, y_train) that applies to the training subset (X_train, y_train); a command for determining the score coefficient knn.score (X_test, y_test) applied to the test subset (X_test, y_test). The results of applying the algorithms described in Table 2 are presented. The knn.score () value reaches numerical results corresponding to the algorithms in the columns of Table 2 are described. There are many similar investigations in the literature. For example, the liver data set has been analyzed in [2] where the kNN classifier is used for all features and using feature selection approach. The achieved accuracy in these two algorithms are 69.08% and 75.04%, respectively. The accuracy provided in [15] for classification of the liver dataset is 69.58%. Moreover, our results presented in Table 2 show that the best accuracy is 82.3% is derived by Algorithm 4.
From the experiments conducted and the analysis on them can draw conclusions. Algorithms 4 and 5, based on the proposed modifications of the k nearest neighbors, named Mk-NN modification reachs a higher value (reliability) of the score coefficient, i.e. higher matching rates between predicted classes and actual observation classes. In order to determine the advantages of both algorithms 4 and 5, it is sufficient to choose the variable shuffle = False, which means that the execution of the command kf = KFold (len (X), n_folds = 6, shuffle = False) is determined. In each of its execution, the division of n_folds equal parts of the set X is the same for both algorithms 4 and 5.
Let us under these conditions perform an additional experiment on the same data sets and the results can be seen in Table 3. From the results in Table 3 shows that the results of the two algorithms coincide, i.e. it does not matter if we first determine the number of neighbors or first divide the main set into a training or test.

Conclusion
In this paper we compare different algorithms for applying k-NN method which strive to achieve better efficiency in classifying test observations. The methodology proofs that the algorithms achieve better values of the controlled score parameters. The pair of train/test datasets that provides the highest score in terms of correctly predicted test observations against all test observations is selected for the final modelling of the dataset in question. Experiments were carried out to compare the algorithms of the k nearest neighbor algorithm to classify big data sets. Clearly, algorithms 4 and 5 stand out for greater reliability of built-in models.