class: center, middle, inverse, title-slide # Regularization Methods in Random Forests ## First PhD Project of 2019 ### Bruna Wundervald, PhD Candidate in Statistics ### February, 2019 --- # Outline 1. Introduction 2. Tree-based models - Trees - Random Forests 3. Regularization - Regularization in Random Forests (RRF) - Guided Regularization in Random Forests (GRRF) 4. Exploring the GRRF model - Evaluating the GRRF - Extending the GRRF 5. Conclusions and Final Remarks --- class: inverse, middle, center # 1. Introduction --- # 1. Introduction - In real life regression problems, predictors can be hard or even economically expensive to obtain. - A common issue when dealing with any kind of regression modeling is the subset selection. - Subset selection `\(\neq\)` Shrinkage/regularization <img src="img/arrow.png" width="8%" height="8%" style="display: block; margin: auto;" /> <p style="text-align: center;"><b> In shrinkage, the idea is to 'shrink' the regression coefficients towards zero. </b></p> - Shrinkage *can* lead to variable selection. - In the common usage of regularization, the model is parametric and there are regression coefficients to be 'shrunken'. --- # 1. Introduction - For tree-based methods, there is not yet a standard regularization procedure well established in the literature. - Options exist, but they still need to be evaluated and popularized. - The main goals of this presentation are to understand and explore regularization approaches for trees and random forests, focusing in the one proposed in (Deng and Runger, 2012). --- class: inverse, middle, center # 2. Tree-based models --- # 2. Tree-based models ## Trees Consider a variable of interest `\(Y_i \in \mathbb{R}\)` and `\(\mathbf{x} = (x_{i1},\dots, x_{ip})'\)` the set of predictor features, `\(1 \leq i \leq p\)`. A statistical framework for non-parametric regression characterizes their relationship as `\begin{equation} y_i = f_0(\mathbf{x_i}) + \epsilon_i, \thinspace \epsilon_i \stackrel{iid}\sim N(0, \sigma^2), \end{equation}` where `\(f_0\)` is an unknown regression function. > A regression tree is a flexible non-parametric method that be can applied to both regression and classification problems, based on the estimation of a series of binary conditional splitting statements (`if-else`) on the predictors space in order to create prediction rules. --- ## Trees Each rule has the form: `\(x_j > x_{j,th}\)`, where `\(x_j\)` describes the value of the feature at `\(j\)` and `\(x_{j,th}\)` is the decision cut point. The corresponding regression model predicts `\(Y\)` with a constant `\(c_m\)` in each splitted region `\(R_m\)`, or `\begin{equation} \hat f(\mathbf{x_i}) = \sum_{m = 1}^{M} c_m I\{\mathbf{x_i} \in R_m \}, \end{equation}` where `\(\mathbf{x}\)` represents the set of predictor variables. In a regression scenario, the prediction is a constant value, usually the mean of `\(y\)` in each region and the minimized measure is the residual sum of squares, given by `\begin{equation} RSS_{tree} = \sum_{j = 1}^{J} \sum_{i \in R_j} (y_i - \hat y_{R_j})^2 \end{equation}` where `\(\hat y_{R_j}\)` is the mean response for the observations in the *j*th region of the predictors space. --- # 2. Tree-based models ## Random Forests A simple way to reduce variance in tree models is to take many training sets from the population, build a separate prediction for each tree and average the their final predictions, resulting a equation of the form `\begin{equation} \hat f_{avg}(\mathbf{x}) = \frac{1}{B} \sum_{b=1}^{B} \hat f^{(b)}(\mathbf{x}), \end{equation}` where B is the number of training sets and `\(\hat f^b(x)\)` are the estimated functions. To produce many training sets for trees, bootstrap resampling is a very popular choice . The method that consists in growing B regression trees using B sampled sets and averaging the predictions is called bagging ensemble (Hastie, Tibshirani, and Friedman, 2009) --- ## Random Forests Random forests (Breiman, 2001) are a combination of trees that improves the bagging method by decorrelating the trees, by considering only a sample `\(m\)` of the predictors as splitting the candidate for each tree. - The variables are selected randomly and `\(m\)` is chosen as approximately `\(\sqrt p\)`, with `\(p\)` being the total of available features. This slight change prevents the strong predictors to always appear on the top of the trees. Thereby, the average of the trees using the random forest algorithm will be less variable and, hence, more reliable (Hastie, Tibshirani, and Friedman, 2009). --- ## Random Forests - One characteristic of the random forest model is that feature importance metrics can be obtained in a very easy way: - At each split in a tree, the improvement in the splitting criterion is the importance measure attributed to the respective variable. - Those values are accumulated over all of the trees in the forest, resulting in the final measure of importance for the random forest model. - Variable importance measures facilitate feature selection in random forests, but it does not work like an regularization. - Tree-based methods have an unwanted behaviour when in the presence of highly correlated variables: the importance values are split between the correlated features. --- class: inverse, middle, center # 3. Regularization --- # 3. Regularization In general, regularized regression consists in estimating a penalized function of the form `\begin{equation} \underset{f \in H}{min} \Big[ \sum_{i = 1}^{N} L(y_i, f(x_i)) + \lambda J(f) \Big ], \end{equation}` where `\(L(y, f(x))\)` is the chosen loss function, `\(J(f)\)` is a penalty functional, and `\(H\)` is a space of functions on which `\(J(f)\)` is defined (Hastie, Tibshirani, and Friedman, 2009). Regularization techniques produce models that are more interpretable and can have the same or very similar prediction error as the full model. Also, regularization comes in hand when we need to deal with many correlated features, because it is robust enough to not be influenced by them. --- # 3. Regularization ## Regularization in random forests The usual regularization methods are mainly dedicated to parametric models. For the tree-based models, an option is the one presented in (Deng and Runger, 2012), where the authors **penalize the variable importance values** of each variable for each tree when building a random forest. The simple version of method consists in applying a penalization parameter for each spliting. The regularized information gain `\(Gain_{R}(X_i, v)\)` is written as <img src="img/eq_gain.png" width="58%" height="58%" style="display: block; margin: auto;" /> where `\(F\)` represents the set of indices used in the previous nodes and `\(\lambda \in (0, 1]\)` is the penalization applied to the splitting. When `\(i \notin F\)` (the variable is new to the tree), the `\(i_{th}\)` feature is penalized by splitting at the node `\(v\)`. --- ## Guided regularization in random forests In (Deng and Runger, 2012) the authors also introduce the Guided Regularized Random Forests (GRRF), that leverages the importance scores calculated from an standard random forest model based on all the training data. The scores are used **to guide the next steps** of the regularized forest trough the use a new parameter `\(\gamma\)`. First, a normalized importance measure is obtained with `\begin{equation} Imp_{i}^{'} = \frac{Imp_i}{max_{j=1}^{P} Imp_j}, \end{equation}` where `\(Imp_i\)` is the importance from an standard random forest model and `\(0 \leq Imp_{i}^{'} \leq 1\)` and `\(P\)` is the number of columns of the features matrix. --- ## Guided regularization in random forests Now, instead of using the same penalization coefficient to each variable, the penalization now depend on the importance measures obtained previously, that is <img src="img/eq_gain2.png" width="58%" height="58%" style="display: block; margin: auto;" /> where `\(\lambda_i \in (0, 1]\)` is the coefficient for `\(X_i (i \in \{1,...P\})\)`, that depends on the results of a standard random forest. We have that `\begin{equation} \lambda_i = (1 - \gamma)\lambda_0 + \gamma Imp'_{i}, \end{equation}` being that `\(\lambda_0\)` represents the baseline regularization parameter and `\(\gamma \in [0, 1]\)`. In this fashion, a larger `\(\gamma\)` leads to a smaller `\(\lambda_i\)` and, thus, a larger penalty on `\(Gain(X_i, v)\)` when `\(X_i\)` has not been used in the nodes prior to node v. --- class: inverse, middle, center # 4. Exploring the GRRF model --- # 4. Exploring the GRRF model ## Evaluating the GRRF We tested both methods in simulated data. Using the model equation proposed in (Friedman, 1991) , first we simulated a response variable `\(Y\)` and its relationship to a matrix of predictors `\(\mathbf{X}\)` as `\begin{equation} y_i = 10 sin(\pi x_{i1} x_{i2}) + 20 (x_{i3} - 0.5)^{2} + 10 x_{i4} + 5 x_{i5} + \epsilon_i, \thinspace \epsilon_i \stackrel{iid}\sim N(0, \sigma^2), \end{equation}` where `\(\mathbf{X} \in [0, 1]\)`, meaning that the predictors were randomly drawn from a standard Uniform distribution. So, `\(y\)` is related to each covariable as: --- <div class="figure" style="text-align: center"> <img src="img/friedman.png" alt="Relationship between y and each simulated covariable" width="70%" /> <p class="caption">Relationship between y and each simulated covariable</p> </div> To this group of 5 variables, we also introduced a set of 30 noisy features, with many different levels of correlation, presented in the next slide. --- <div class="figure" style="text-align: center"> <img src="img/corrs_fried.png" alt="Correlations between the noisy covariables." width="3400" /> <p class="caption">Correlations between the noisy covariables.</p> </div> The simulated data was split into train (75%) and test (25%), so we can assess the accuracy of the model in data not seen in the training phase. --- ## Evaluating the GRRF A package for the statistical software `R` called `RRF` was created by the authors of the RRF and GRRF methodologies. We used this package in combination with the `randomForest` and the `rBayesianOptimization` packages to test four model in this data: - a standard random forest model; - a regularized random forest, with the `\(\lambda\)` parameter set to 0.8; - a guided regularized random forest, with `\(\lambda\)` set to 1 and `\(\gamma\)` set to 0.9; - a version of the GRRF with the parameters set to the result of a Bayesian optimisation in both of the parameters; the optimisation was conducted in the combination of the MSR calculated in the test set and the final number of features in the model as we were hoping to find `\begin{equation} \min\limits_{\gamma, \lambda \rightarrow (0, 1]^{2}}\{ MSR_{\gamma, \lambda} + 0.8 \times (\text{# of variables})_{\gamma, \lambda} \}. \end{equation}` *Bayesian optimisation: uses Bayesian models based on Gaussian processes to predict good tuning parameters, by creating a a relationship between the minimized measure and the parameters.* --- <table class="table table-condensed table-hover" style="width: auto !important; margin-left: auto; margin-right: auto;"> <caption>Results of the four models fit to the Friedman simulated data added with noisy variables dataset.</caption> <thead> <tr> <th style="text-align:left;"> Model </th> <th style="text-align:right;"> Number of Variables </th> <th style="text-align:right;"> MSR </th> <th style="text-align:left;"> MSR + 0.8 # of Variables </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Random Forest </td> <td style="text-align:right;"> 35 </td> <td style="text-align:right;"> 17.34 </td> <td style="text-align:left;width: 4cm; "> <span style="display: inline-block; direction: rtl; border-radius: 4px; padding-right: 2px; background-color: lightyellow; width: 100.00%">45.34</span> </td> </tr> <tr> <td style="text-align:left;"> Regularized Random Forest </td> <td style="text-align:right;"> 35 </td> <td style="text-align:right;"> 17.30 </td> <td style="text-align:left;width: 4cm; "> <span style="display: inline-block; direction: rtl; border-radius: 4px; padding-right: 2px; background-color: lightyellow; width: 99.91%">45.30</span> </td> </tr> <tr> <td style="text-align:left;"> Guided Regularized Random Forest </td> <td style="text-align:right;"> 17 </td> <td style="text-align:right;"> 17.01 </td> <td style="text-align:left;width: 4cm; "> <span style="display: inline-block; direction: rtl; border-radius: 4px; padding-right: 2px; background-color: lightgreen; width: 67.51%">30.61</span> </td> </tr> <tr> <td style="text-align:left;"> GRRF with Bayesian Optimization </td> <td style="text-align:right;"> 18 </td> <td style="text-align:right;"> 17.36 </td> <td style="text-align:left;width: 4cm; "> <span style="display: inline-block; direction: rtl; border-radius: 4px; padding-right: 2px; background-color: lightyellow; width: 70.05%">31.76</span> </td> </tr> </tbody> </table> The regularization parameters are not the performing the way we expect since: - the RRF model does not even get any regularization, - the GRRF still uses 17 variables, - the Bayesian optimised model is not doing well either. In terms of mean squared residuals, the models are very similar, but the GRRF stands out. Overall, we recognize this model as being the best one, since it produced an MSR of 17.01 with the smallest set of features. --- # 4. Exploring the GRRF model ## Evaluating the GRRF Consider now a real gene expression dataset, for which the purpose is to predict a continuous variable about race horses based on their gene expression. This case is particular interesting for a breeding laboratory, since it's very expensive to obtain all of the gene variables. As the dataset is highly-dimensional, with more than 48000 available predictors and 835 observations, we choose to evaluate the behaviour of the regularization parameters only in the first 1000 variables. We compared different configuration of the RRF and GRRF models by varying the `\(\gamma\)` and `\(\lambda\)` parameters from 0.01 to 0.99: - For each model, we calculated the mean squared residuals and the number of selected variables in the test set (25%), - The number of selected variables was determined by the ones with final importance value `\(>\)` 0. --- <div class="figure" style="text-align: center"> <img src="img/comp_fried.png" alt="Comparison of the number of selected variables and mean squared errors in the GRRF and RRF when varying their regularization parameters" width="80%" height="90%" /> <p class="caption">Comparison of the number of selected variables and mean squared errors in the GRRF and RRF when varying their regularization parameters</p> </div> --- ## Evaluating the GRRF - Results - When `\(\gamma\)` increases in the GRRF, the model gets more regularized, but the MSR does not change much: - A model with less variables is capable of keeping the same quality of bigger models. - The RRF model selects less variables when the value of `\(\lambda\)` is smaller but its MSR behaves almost inversely, decreasing when more variables are used in the model. - The MSR stabilizes after a certain value of `\(\lambda\)`. In general, we observe that the regularization parameters are not working as well as they should. --- # 4. Exploring the GRRF model ## Extending the GRRF We propose and evaluate an extension of the method GRRF: - the extension is given by the change in the regularization criteria. First, we used the real gene expression data calculate the marginal correlations of all of the predictors and the response. Out of the whole original set of variables, we selected the ones with at least some correlation to the response in the train set, for which the cut point was determined as, for `\(\mathbf{x} = (x_{i1},\dots, x_{i48910})'\)`, `\begin{equation} |corr(\mathbf{x_j}, \mathbf{y})| > 0.16, \thinspace \text{for} \thinspace j = 1,\dots,48910. \end{equation}` This produced a subset of 2626, reducing significantly the dimension of the data. With this new smaller dataset, we run the GRRF model 100 times, using `\(\lambda = 0.8\)` and `\(\gamma = 0.9\)`. The models allowed us to obtain a mean importance measure for each variable, used to create a plot of the importance against their marginal correlations to `\(y\)`. --- <div class="figure" style="text-align: center"> <img src="img/vars_corrs.png" alt="Marginal correlations of each predictor and `\(y\)` versus the mean importance calculated in 100 re-runs of the standard GRRF model." width="75%" height="75%" /> <p class="caption">Marginal correlations of each predictor and `\(y\)` versus the mean importance calculated in 100 re-runs of the standard GRRF model.</p> </div> The line at 0.5 in the `\(x\)` axis highlights the intense increase in the importance behaviour after this point. --- ## Extending the GRRF With that noted, we propose a new weighting formula for the predictors as <img src="img/eq_gain3.png" width="66%" height="65%" style="display: block; margin: auto;" /> where `\(|corr(\mathbf{x_i}, \mathbf{y})| \in [0,1]\)` and the new parameter `\(\tau \in [0,1]\)`, but `\(\tau\)` it's not expected to be bigger than 0.5 for an actual regularization. **The idea is to give way less weight to variables that are just slightly correlated to the response, which is done by setting `\(\tau\)` to a small value.** We fitted two more models 100 times: - a second model using as the variable weighting the absolute values of the marginal correlations of `\(\mathbf{X}\)` to `\(y\)`, - a third model considering the extension proposed, using `\(\lambda = 0.8\)`, `\(\gamma = 0.9\)` and `\(\tau = 0.2\)`. --- <div class="figure" style="text-align: center"> <img src="img/weightings.png" alt="Comparison of the MSR (evaluated in the test set) densities and histograms of the number of selected variables for 100 re-runs of the three models: (i) in green, the standard GRRF; (ii) in red, using simply the correlation as the variable weighting; (iii) using the methodology proposed." width="105%" height="105%" /> <p class="caption">Comparison of the MSR (evaluated in the test set) densities and histograms of the number of selected variables for 100 re-runs of the three models: (i) in green, the standard GRRF; (ii) in red, using simply the correlation as the variable weighting; (iii) using the methodology proposed.</p> </div> --- ## Extending the GRRF - Results - The model that used only the correlation has the smallest MSRs calculated in the test, but also a lot of selected variables. - The standard GRRF model has bigger values for the MSRs, as well as it is using less variables, but is still not so regularized as it could be. - The third model resulted in almost the same density of MSRs as the standard. GRRF but the number of selected variables is way below the GRRF in most of the cases. <p style="text-align: center;"><b> This model can keep the MSR perfomance of an GRRF with parameters set to `\(\lambda = 0.8\)` and `\(\gamma = 0.9\)` and retain less variables, improving the regularization of the model without having poor predictions. </b> --- class: inverse, middle, center # 5. Conclusions and Final Remarks --- # 5. Conclusions and Final Remarks - **The two regularization methods proposed in (Deng and Runger, 2012) do not perform a very strict regularization as it might be desired.** - Both techniques achieved a 'limit' of the least number of variables they can select on the simulated and real datasets used. - These regularization methods would not suffise in situations where variable selection is the key point of the analysis. - An extension of the variable weighting in the GRRF was suggested: > The extension consists in giving priority to variables highly correlated to the response, having a baseline threshold to decide which are those variables. --- # 5. Conclusions and Final Remarks - We need to notice that categorical predictor variables were not addressed here, demonstrating that such details still need to be determined. - Next steps include compare the method with variable selection, increase the size of model reruns and using cross-validation. Code at: `https://github.com/brunaw/regularization-rf` --- class: center, middle ## Acknowledgments This work was supported by a Science Foundation Ireland Career Development Award grant number: 17/CDA/4695 <img src="img/SFI_logo.jpg" width="50%" height="40%" style="display: block; margin: auto;" /> --- # Bibliography <p><cite><a id='bib-Breiman2001'></a><a href="#cite-Breiman2001">Breiman, L.</a> (2001). “Random Forests”. In: <em>Machine Learning</em>. ISSN: 1098-6596. DOI: <a href="https://doi.org/10.1017/CBO9781107415324.004">10.1017/CBO9781107415324.004</a>. eprint: arXiv:1011.1669v3.</cite></p> <p><cite><a id='bib-guided'></a><a href="#cite-guided">Deng, H. and G. C. Runger</a> (2012). “Gene selection with guided regularized random forest”. In: <em>CoRR</em> abs/1209.6425. eprint: 1209.6425. URL: <a href="http://arxiv.org/abs/1209.6425">http://arxiv.org/abs/1209.6425</a>.</cite></p> <p><cite><a id='bib-Friedman1991'></a><a href="#cite-Friedman1991">Friedman, J. H.</a> (1991). “Rejoinder: Multivariate Adaptive Regression Splines”. In: <em>The Annals of Statistics</em>. ISSN: 0090-5364. DOI: <a href="https://doi.org/10.1214/aos/1176347973">10.1214/aos/1176347973</a>. eprint: arXiv:1306.3979v1.</cite></p> <p><cite><a id='bib-HastieTrevor'></a><a href="#cite-HastieTrevor">Hastie, T, R. Tibshirani and J. Friedman</a> (2009). “The Elements of Statistical Learning”. In: <em>Elements</em> 1, pp. 337–387. ISSN: 03436993. DOI: <a href="https://doi.org/10.1007/b94608">10.1007/b94608</a>. eprint: 1010.3003. URL: <a href="http://www.springerlink.com/index/10.1007/b94608">http://www.springerlink.com/index/10.1007/b94608</a>.</cite></p> --- class: center, middle, inverse # Thanks! <img src= "https://s3.amazonaws.com/kleebtronics-media/img/icons/github-white.png", width="50", height="50", align="middle"> <b> [@brunaw](https://github.com/brunaw)