Analyzing Climate Variability in Malaysia Using Association Rule Mining

Previous surveys proved that data mining is one of the methods that can be utilized for climate prediction, predominantly clustering and classification are the most applied methods in data mining to build a model to predict changes in the climate. Unlike the climate change, climate variability is a phenomenon where the occurrence of climate uncertainty is according to the changes year to year basis. This study is focusing to look at the effectiveness of the Association Rule Mining (ARM) techniques in predicting climate variability events in Malaysia. In this report, it explained how the patterns that exist within climate data is discovered using ARM and how the extracted pattern is used to predict climate variability. In this report also, a framework is developed to explain how ARM can generate rules and extract patterns from the data and how the extracted rules and patterns is used to develop a model for predicting climate variability event.


Introduction
Knowledge Discovery in Database consists of a few process and one of the main process is Data Mining [1] where it is a process aim to discover unknown information from the massive amount of data. Data mining techniques are different from standard statistical method, where data mining can be programmed to find hidden knowledge without relying on the previous information of the data, however, the unknown knowledge exists in the data were obtained using data mining task [2]. Data mining consists a few techniques that can be applied individually or combined for performing sophisticated processes. Classification is the systematic approach of the classification model based on a set of input data [3]. This technique is used in classifying the data from the database to some classes based on certain criteria [4]. The algorithms that are commonly used in this method are neural networks and decision tree [5]. Clusterization is a technique in which the data are divided into groups of similar objects. Each cluster constituted group data based on certain features [6]. The algorithms in clusterization techniques are K-means and hierarchical clustering [7]. ARM is often used to find patterns that exist in the itemset. It aims to extract the interesting relationship, frequent patterns and coloration existing in the set of items in the data repository [8]. ARM has been applied in many real world problems such as finding patterns in documents [9][10], predicting floods [11], trend analysis of social networks [12], monitoring elderly people [13], etc. Various research has been done to see the suitability of data mining in predicting the climate. The most commonly used techniques are ARM, classification and also clustering. Therefore, these methods are commonly used in research in getting the most appropriate and accurate techniques for modeling climate forecasting. This is because climate forecasting is known as a complex analy-sis due to various elements in weather and climate such as rainfall, wind speed, temperature, humidity and etc. [14]. ARM is capable to identify the relations exist within the datasets. ARM has several algorithms and the popular algorithms which is often used are Apriori and FP-Growth [15][16]. Many researchers have used ARM in developing a model for climate prediction.
Using previous data, model for climate prediction were developed using ARM [17][18]. In classification and clusterization, the data will be divided into specific groups. However, it is different with ARM, where patterns and rules are obtained from relationship discovered within the database [19]. This method is considered as useful for climate prediction due to the fact that climate data consist of various elements and factors.

Association Rule Mining
In ARM, it is known that the rules are measured by the value of support and confidence. In the datasets of A →B, where A and B are item sets, the value of support is measured as: Support, Supp (A) = A / T Supp (A) shows how many times that the item occurs in total of the transactions (T).
The confidence is value of percentage in T that contains A that also contains B. The conficence value can be calculated as:

Confidence (A →B) = Supp (A ᴜ B) / Supp (A)
Therefore, ARM will extract rules-based by: where minsup and minconf are the corresponding support and confidence thresholds.
The lift value of the rules is also analyzed in this study and it is one of the parameters to in ARM that can be used in the analysis. The value of the lift is calculated as below.
Lift = Confidence/Expected Confidence Therefore, ARM will provide the information about the probability of generated rules based by the lift value.

Climate Variability Prediction Framework (CVPF)
The CVPF is developed to extract rules from climate data variables and the rules is clusters according to their characteristic. The framework (see Figure 1), is consists of Stage (i): Data Processing, Stage (ii): Rule Analysis and Stage (iii): Prediction Model. The process of data cleaning and data normalization will be done in Stage (i). The inaccurate data will be identified in the data cleaning process and the data will be either replaced, or modified, or deleted from the dataset. Then, in normalization the set will be organized based by the columns and tables to reduce the data redundancy and therefore will improve the integrity of the data. The results from stage (i) will be applied in the normalization process. When handling with time series data, the values for the attribute will be divided into a number of sub-ranges in the normalization stage. Usually a low value of range will be used as it will have the effect of producing fewer columns in the output data which in turn provide computation efficiency benefits. The larger number of sub-range value, the more output attributes will be generated during the normalization process and this will affect the efficiency of the ARM algorithm used in the analysis stage. In stage (ii), the data will be analyzed using ARM and meaningful rules from the data sets will be extracted by using the FP-growth algorithm. This algorithm is known to be one of the popular and fastest association rule algorithms because compact data structure is used in the algorithm and eliminate the need of repeated database scan [20]. Due to that, the information within the data set produced by FP-Growth is greatly compressed and FP-Growth algorithm is used in this study to extract rules and patterns for the prediction model. The relevant rules and patterns will be identified, and then the selected rules will be grouped into different clusters in the rulesbased clustering process. All significant rules are clustered according to the rules features and similarity. In the last stage, prediction model based by the results from stage (ii) will be built. Then, the model will be tested using the previous climate data and the results from the model will be evaluated to measure the accuracy of the developed model.

Results and Discussion
Previous weather data of Petaling Jaya, Selangor is used in this study and the data were collected from the Institute of Climate Change, The National University of Malaysia. The monthly data sets for year 2014 and 2015 consist of humidity, temperature, wind speed, rainfall and number of rain days. During the experiment, the generated rules were observed. Several support and confidence value is used in the experiment and the support value of 15% and confidence value of 70% show more significant and meaningful patterns. The details of the ARM experiment based on the number of lift and confidence rules produced are shown in Table 1.  Table 1 shows that ARM extracted more rules from the data in 2013 and 2014. In both years, the rules extracted shows more information on the relationship between variables and produced more significant patterns to be used in the prediction model. Lift value shows the strength of the rules and indicates that lift with higher value is more reliable for prediction. The results show rules and patterns extracted from the dataset and the significant rules show detailed information that is related to climate that can be used to predict season changes in Malaysia. In Table 2, the rules generated shows features in the itemset to predict rainfalls and rainy days. In Table 4, significant rules that show details features of rainy days were chosen and from the rules it shows that: • In 2013, rainy days are >=21.6 days, and the related features during the rainy days are when the temperature is <27.62℃, the wind speed measure is <1.06m/s, the humitdity is <82.3% and the total amout of rainfall is < 527.80mm.

•
In 2014, the total of rainy days based by the rules generated >=21.4 days. During the raining season the related variables recorded are temperature<27.46℃, the wind is <0.94m/s, the humidity is <82.5% and the amount of rainfall is <624.00mm.

•
In 2015, the rules show that rain days happen between 17 to 20 days, and the related variables during the raining days are temperature<28.64℃, the percentage of humidity is less than 74.619%, the wind speed is less than 1.2m/s and amount rainfall is less than 437.96mm".
From the results, rules and pattern with similar features are identifed and used to predict an event that caused by climate variability. In Figure 2, the rules generated shows similarity in the variable's value that can be used to predict raining season. For all three years, the patterns show that period of rainy days are between 19-21 days, and in Figure 2 it shows the similarity of the variables. From the pattern, we can conclude that Petaling Jaya is expected to receive a huge amount of rainfall measured from 440mm to 620mm during raining season. The high amount of rainfall during this period is could bring risk to the public especially the of flash floods. Therefore, prediction of the long rainy period can be used by the authorities to predict events such as flash flood and landslides. They can take precautionary actions such as ensuring the river water level is in a safe state and to maintain a good drainage system in Petaling Jaya.   Meanwhile, in Figure 3, the resulting rules show the decreasing amount of rainfall received during the period from 350mm to 160mm of rain only. This pattern indicates that the shorter rainy day period can reduce the amount of raindall in the area. During this period, the shortage of rainfall will affect the water supply to the public and therefor, this information can be used to predict the dry season. From the analysis conducted, huge number of rules and patterns were extracted from the dataset using ARM. In this experiment, rules and patterns were generated using FP-Growth, and the results were analyzed to identify significant patterns and rules. Based on the analysis, the prediction of the seasons is based on the climate features indicated by the rules. The rules show significant features that can be used to identify types of climate and in Table  4 is the summary of the climates features that has been identified in this study.

Conclusion
The aim of this study is to prove that ARM can extract significant rules and patterns within the climate data. This study also shows that the patterns and rules produced using ARM can be applied to construct a prediction model. The significant patterns and the rules of the association is measured by the high confidence and lift value. From the analysis, it shows that the rain and dry season can be determined based on the rules produced by ARM. However, there is a limit during this study because the data period is based on monthly data. More detailed analysis results can be obtained if more detail climate data such as daily data is used in the future work. Going onward, this written report will concentrate on how to apply the clustering method based on the rules and pattern produced by ARM. In the clustering method, each cluster will be based on the rules' characteristic and behavior. The prediction model will be built based on the outcome of the association rule-based clustering method. The prediction model will be tested with previous data to measure the ability and accuracy of the model in predicting climate.