Empirical Bayesian Binary Classification Forests Using Bootstrap Prior

In this paper, we present a new method called Empirical Bayesian Random Forest (EBRF) for binary classification problem. The prior ingredient for the method was obtained using the bootstrap prior technique. EBRF addresses explicitly low accuracy problem in Random Forest (RF) classifier when the number of relevant input variables is relatively lower compared to the total number of input variables. The improvement was achieved by replacing the arbitrary subsample variable size with empirical Bayesian estimate. An illustration of the proposed, and existing methods was performed using five high-dimensional microarray datasets that emanated from colon, breast, lymphoma and Central Nervous System (CNS) cancer tumours. Results from the data analysis revealed that EBRF provides reasonably higher accuracy, sensitivity, specificity and Area Under Receiver Operating Characteristics Curve (AUC) than RF in most of the datasets used


Introduction
Recent advancement in technology has made collection of big datasets referred to as high-dimensional data in statistical parlance possible [1]. High-dimensional data popularly addressed as "large p small n" syndrome often arises in most areas of research especially in genomic studies [2][3]. Several techniques for handling high-dimensional data have been proposed in different areas of research. The methodologies of the methods differ from each other, but the universal standpoint of the methods is to find a way to analyze high-dimensional data better. [4] identified the needs for developing robust methods for high-dimensional data. Classical methods like ordinary least squares, logistic regression etc. often breaks down due to ill-conditioned design matrix when ≫ . [2] described two major approaches for analysing highdimensional data namely: modification of > procedures to accommodate high-dimensional data or developing a new strategy. Modification of procedures involves moving from complex models to simple model by selecting relevant subsets of the variables. Single classifiers such as Support Vector Machine (SVM), Linear Discriminant Analysis (LDA), or Naïve Bayes have been used for handling high-dimensional data [5][6][7]. However, in the recent time, ensembles algorithms have been shown to be a better alternative to single classifier especially when multi-modals variable are grouped [3]. [3] among others claimed that Random Forest (RF) produced the highest accuracies in most scientific applications.
Random Forest (RF) developed by [8] is an ensemble statistical learning method designed to improve the predictive accuracy of decision trees. It is one of the most popular ensemble algorithms which have been applied to different fields. It is widely applicable because of its distribution free assumption, modelling of nonlinear effects, computational speed and direct applicability to high-dimensional datasets [9]. It also gives room for model interpretation via variable importance measures which make it better than other black box models like Support Vector Machine (SVM) and Artificial Neural Network (ANN) [2]. Random forests algorithm involves selecting subsets of training datasets as well as subsets of variable space to build classification tree (CART, [10]). Among the above strengths of RF lies the weakness which is the determination of relevant input variables to be used at each splitting step [11][12]. Random subset selection of variables leads to hypergeometric probability model. The hypergeometric probability reduces with decrease in the number of relevant variables in the predictor space. This reduction further results to decrease in RF accuracy. There have been many improvements on RF in the past years, especially in the area of random subset of input variables. [13] made one of the earliest improvement on RF after the original paper in 2001. He questioned the use of Gini Index (GI) criterion proposed by [10] for selecting the variables at the splitting stage. He replaced GI with Gain Ratio (GR). The improvement only worked in some specific situations like low dimensional data. Following the drawback observed from the work of [13], [14] proposed Meta Random Forest. The brain behind their algorithm is to use random forest themselves as base classifiers for making ensembles. Meta random forests are developed using bagging and boosting approaches. The performances of those two new models were tested and compared with the original random forest algorithm. Among the observed models, bagged random forest produced the best results. [15] improved RF by using a combination of an attribute evaluator method and an instance filter method. Their approach involves preliminary variable selection before applying RF. They believe that by selecting the subset variables before applying RF will increase the chance of choosing relevant informative variables. The framework of their approach could be regarded to fall under the filter approach which is prone to false negative or positive issue [16]. False positive is not as crucial as a false negative in this scenario as second stage subset selection is embedded in RF. However, false negative is an important issue, that's a case where the relevant variable(s) have been dropped in the preliminary feature selection stage. Although, [15] used three different filter methods namely; Correlation-based feature selection (CFS), Symmetrical Uncertainty (SU) and GR. Their performance analysis results showed equal strength for the three purposes. Also, their approach violates the fundamental principle of RF as subset selection is already embedded in it. Apart from the non-probabilistic modification of RF, Bayesian approaches have also been proposed. One of the first Bayesian approaches to RF is Bayesian Additive Regression Trees (BART) [17][18]. Bayesian methods are the emerging solution to most reallife problems because they model uncertainty in parameters [1]. As RF follows from ensembles of the CART, BART is an ensemble of Bayesian CART [19]. BART is similar in spirit to boosting but motivated by RF. BART provides appealing results with low dimensional data but fails in handling high-dimensional data. [1] illustrated the computational inefficiency of BART, because a full Bayesian probability modelling scheme is used for building each tree. Also, BART is more of modification of boosting than RF as trees priors are specified such that trees with low information about the classification of each class are boosted. The full modelling of decision trees structure captured by BART gives room for slow computation. [1] addressed the issue by providing a full probabilistic model for the sum of trees rather than an individual tree. Their approach is motivated by Bayesian model averaging and thus referred to Bayesian Additive Regression Trees using Bayesian Model Averaging (BART-BMA). Their method is faster than BART but the accuracy observed in a drug discovery example is not different from RF and BART. This implies BART-BMA only improves the algorithmic time as well as interpretability. A simplified approach of Bayesian random forest called Bayesian Forest (BF) was proposed by [20]. Their approach focuses on modification of training samples selection rather than input variable selection. The traditional RF uses bootstrapping procedure by [21], while BF uses the Bayesian bootstrap of [22]. They showed that BF is not better than RF except in improving the interpretability of RF. They further extended BF to Empirical Bayesian Forest (EBF). EBF was motivated by building hierarchical modelling stages of RF. Empirical Bayes is a wellknown framework for fast approximate Bayesian inference [23][24]. They only use EBF to produce an approximate estimate for BF in situations where full BF could not be achieved. They conclude that EBF is not better than BF and both are equally not better than RF in most applications reviewed. The various Bayesian modifications, as well as non-probabilistic approaches (frequentist), fail to handle RF flaws especially in the area of high-dimensional data. Therefore, in this paper, we present an improved random forest classifier for binary class data with an update on the splitting stage and samples selection. Specifically, we replaced the hypergeometric probability weights with an Empirical Bayes weight. The Bayes weight is motivated by the posterior density of hypergeometric probability of selecting any relevant variable. The Bayesian inference for the approach was driven by hybridizing bootstrap prior technique of [24] with empirical Bayes approach.

Random Forest (RF)
Given a training dataset = [ , 1, 2, … , , = 1, 2, … , ], where is a binary outcome that assumes = 0, 1 values and is the vector of variables. Random Forests algorithm automatically decides on the splitting variables and splitting point by partitioning the response into 1 , 2 , … , regions, the closest form of model that RF assumes is; where is a constant in region . Estimating requires the computation of an impurity function. For classification case, the commonly used impurity functions are Misclassification Error Rate (MER), Gini Index, and deviance [8]. Random Forests (RF) then update built trees (1) in two steps; (i) bootstrapping the training dataset times to obtain a total of trees (ii) Subsampling < variables without replacement at each split step in each tree. Thus if we denote (1) by ℑ(̂: ∈ ), RF model is; RF has two tuning parameters, the number of trees and number of subsampled variables . [8] suggested using at least = 200 and = √ . [8] established that RF is highly sensitive to . He suggested using cross-validation to choose but at the expense of computation time. Also, arbitrarily increasing increases the adjacent trees correlation thereby reducing the accuracy of RF. Likewise, reducing increases accuracy but introduces bias. In the face of this dilemma, we introduce here a data dependent Bayesian approach called Empirical Bayes [24-27; 6] for estimating . The likelihood of selecting randomly any relevant variables out of a total random subset is given by;

Empirical Bayesian Random Forest (EBRF)
where is the number of relevant variables and also the parameter of interest, is the sample realization of . [28] defined a discrete ( , , ) conjugate distribution as a particular case of Polya or beta-binomial distribution [29]. Thus, for a hypergeometric likelihood with target outcomes the ( , , ) conjugate distribution of − is given by; The Bayesian estimate ̂Θ of is given by the posterior mean of (4); ̂Θ = ( + ) + + +1 (6) where ̂Θ is the posterior estimate of relevant variables. Moving from (6), the empirical Bayesian approach here implies = , and redefine (6) as; From (7) we then obtain ̂Θ =̂Θ, so that ̂Θ contains relevant variables. To complete the prior specification of parameters, prior parameter of the relevant variable is obtained by fitting a hypergeometric distribution to the data then estimate . The parameters ( , , ) were fixed at ( , 2 , 2 ). The prior specification fixed the initial probability of relevant variable as half of the entire variable space. The posterior relevant probability is then denoted as . The remaining steps of RF then follow with =̂Θ. After selecting ̂Θ variables, the impurity functions can then be obtained using Gini index ; where ̂ is the estimated class probability at each node . The variable with weight → 1, will correspond to variable with minimal unweighted Gini index and therefore useful for further splitting step. If on the other hand → 0, implies the variable is not useful and consequently expected to yield a maximal unweighted Gini index. In this case, the proposed weighted Gini index returns the unweighted Gini index so that the variable is dropped at the splitting stage. The idea behind this is to control the mixture behaviour of hypergeometric distribution [30]. The dominant category determines the estimates of categories probability. RF fails to balance this gap by specifying = √ , for example, if = 2000; ≈ 45, which implies taking a random sample of 45 variables to be used in each split. The hypergeometric probability of selecting any relevant variable out of say five relevant feature is approximately 0.11. This implies that at each splitting step, there is about 89% chance of selecting irrelevant feature. This high probability can be attributed to fewer number of relevant variables. Thus RF assumes that the entire predictor or input space is reasonably populated with relevant features. Also, one might think that increasing the subsample size will increase the chance of selecting relevant variable. It is indeed true, but it will increase the correlation between adjacent trees which is the primary objective of developing RF. This is the dilemma at the forefront of RF which we tackled in this research.

Datasets
Five microarray cancer datasets were used to compare the performance of RF and EBRF. The datasets covered colon cancer, breast cancer, Lymphoma cancer and CNS cancer. [31] first analyzed the data to identify biomarkers for colon cancer in 62 subjects based on 2000 genes expression profiles. Two distinct groups are identified; 40 tumorous samples and 22 normal samples.

2.) CNS data:
The Central Nervous System Embryonal Tumour were analyzed by [32] to identify biomarkers for CNS tumour in 34 subjects based on 7128 genes expression profiles. Two distinct groups are identified; 25 classic (C) samples and 9 desmoplastic (D) samples. [33] first analyzed the data to identify biomarkers for breast cancer in 49 subjects based on 7129 genes expression profiles. Two distinct groups are identified; 25 negative Estrogen Receptor (ER -) and 24 positive Estrogen Receptor (ER + ). [34] analyzed the data to identify biomarkers for breast cancer in 168 patients based on 2905 genes expression profiles. Two distinct groups are identified; Good: -111 patients with no event after five years of diagnosis and Poor: -57 patients with early metastasis. [35] analyzed the data to identify biomarkers for lymphoma cancer in 77 subjects based on 6817 genes expression profiles. Two distinct groups are identified; 58 Diffuse Large B-cell Lymphoma (DLBCL) and 19 Follicular Lymphoma (FL).

Performance Comparison
The performance criteria used to compare the two methods are sensitivity, specificity, accuracy, balance accuracy and Area under Receiver operating characteristics curve (AUC). The metrics were computed using the confusion matrix shown in Table 1. where TN represents True Nesgative, FP is the False Positive, FN represents False Negative, and TP is the True Positive. Also, N* is the total predicted negative and P* represents total predicted positive. Similarly, N is the total actual negative while P is the total actual positive. T represents the total number of observation equivalent to; = + + + Here negative means normal cells while positive means tumour cells. The class specific and overall classification metrics used can be defined as follows [7,36]  ).

Results and Discussion
In this section, we illustrate the application of Empirical Bayesian Random Forest (EBRF) on five published real datasets. Table 1 presents the data set which is a subset of 22 datasets from package "datamicroarray" in R statistical package [37]. The performance metrics were computed from 10-folds train/test cross-validation. The metrics used are accuracy, balance accuracy, sensitivity, specificity, and area under receiver operating characteristics curve (AUC) [36]. For each of the five datasets, 10 independent train/test splits were generated by randomly selecting 9/10 of the data as a training set and the remaining 1/10 as a test set. Thus, 10 × 10 = 100 test/train splits were created. Based on each training set, each method was then used to predict the corresponding test set and evaluated by its predictive performance. A similar approach was used in [5; 37-40].    Fig. 1 presents the detailed comparison of ( ≥ 1) for the various datasets.   Table 4 showed the performance of the two methods across five datasets used. On average, the update on RF largely improved the overall predictive performance. The large update can be attributed to an optimal number of subsampled variables used by EBRF. EBRF also enhanced the class-specific performance which is an essential metric in medicine. On average, EBRF is as sensitive as specific. That's it can correctly identify the presence of disease at least 78% of the time as well as correctly identify the absence of a disease at least 80% of time. Also, the false alarm rate (1specificity) for EBRF is approximately half of RF. Fig. 2

Conclusion
In this paper, an attempt is made to review the Random Forest (RF) algorithm by updating the subsampling variables selection method used. We replaced the arbitrary subsample variable size by an optimal empirical Bayes estimate. The results from the performance analysis using the new method revealed its high predictive performance strength. In almost all the datasets used, the new method largely improved the overall and class-specific accuracy of predicting the disease outcome. Also, a relatively lower false alarm rate was achieved with the new approach in all the datasets used.