class: center, middle, inverse, title-slide # Regularization Methods in Random Forests ## 39th Conference On Applied Statistics In Ireland ### Bruna Wundervald, Andrew Parnell & Katarina Domijan ### May, 2019 --- class: middle # Outline 1. Motivation 2. Tree-based models - Trees - Random Forests 3. Regularization in Random Forests - Guided Regularization in Random Forests (GRRF) 4. Applying the GRRF - Simulated data - Real data 5. Conclusions and Final Remarks --- class: inverse, middle, center # 1. Motivation --- # 1. Motivation - Predictors can be hard or economically expensive to obtain. - Feature selection `\(\neq\)` Shrinkage/regularization: 'shrinks' the regression coefficients towards zero. - For tree-based methods, there is not yet a well established regularization procedure in the literature. - We are interested in tree models when there are many more predictors than observations: - Simulated data - Real data <div class="figure" style="text-align: center"> <img src="img/dim.png" alt="Figure 1. Big P, small n" width="30%" /> <p class="caption">Figure 1. Big P, small n</p> </div> --- class: inverse, middle, center # 2. Tree-based models --- # 2. Tree-based models ## Trees <div class="figure" style="text-align: center"> <img src="img/trees.png" alt="Figure 1. Example of a decision tree. " width="50%" /> <p class="caption">Figure 1. Example of a decision tree. </p> </div> --- ## Trees Consider a continuous variable of interest `\(Y_i \in \mathbb{R}\)` and `\(\mathbf{x} = (x_{i1},\dots, x_{ip})'\)` the set of predictor features, `\(i = 1 \dots n\)`. - Each estimated rule has the form: `\(x_j > x_{j,th}\)`, where `\(x_j\)` describes the value of the feature at `\(j\)` and `\(x_{j,th}\)` is the decision cut point. - The model predicts `\(Y\)` with a constant `\(c_m\)` in each splitted region `\(R_m\)`, usually the mean of `\(y\)`, or `\begin{equation} \hat f(\mathbf{x_i}) = \sum_{m = 1}^{M} c_m I\{\mathbf{x_i} \in R_m \}, \end{equation}` where `\(\mathbf{x}\)` represents the set of predictor variables. - The minimized measure is the residual sum of squares, given by `\begin{equation} RSS_{tree} = \sum_{j = 1}^{J} \sum_{i \in R_j} (y_i - \hat y_{R_j})^2 \end{equation}` where `\(\hat y_{R_j}\)` is the mean response in the *j*th region of the predictors'space. --- ## Random Forests - It is an average of many trees grown in Bootstrap samples. - Simple way to reduce variance in tree models: - take many training sets from the population with bootstrap resampling, - build a separate prediction for each dataset and - average their final predictions, resulting in `\begin{equation} \hat f_{avg}(\mathbf{x}) = \frac{1}{B} \sum_{b=1}^{B} \hat f^{(b)}(\mathbf{x}), \end{equation}` (Hastie, Tibshirani, and Friedman, 2009) <b> Variable importance:</b> improvement in the splitting criteria (RSS) for each variable - Those values are accumulated over all of the trees - It facilitates feature selection in random forests - Unwanted behavior when in the presence of highly correlated variables: the importances are split between the correlated features. --- class: inverse, middle, center # 3. Regularization --- # 3. Regularization - Regularized regression consists in estimating a penalized function of the form `\begin{equation} \underset{f \in H}{min} \Big[ \sum_{i = 1}^{N} L(y_i, f(x_i)) + \lambda J(f) \Big ], \end{equation}` where `\(L(y, f(x))\)` is the chosen loss function, `\(J(f)\)` is a penalty functional, and `\(H\)` is a space of functions on which `\(J(f)\)` is defined (Hastie, Tibshirani, and Friedman, 2009). - Produces models that are more parsimonious and have similar prediction error as the full model. - It is usually robust enough to not be influenced by the correlated variables. --- ## Regularization in Random Forests - One option is presented in (Deng and Runger, 2012): - The authors **penalise the gain (RSS reduction)** of each variable, for each tree when building a random forest. - The main idea is to weigh the gain of each variable, with `$$\begin{equation} Gain_{R}(X_i, v) = \begin{cases} \lambda_i Gain(X_i, v), i \notin F \text{ and} \\ Gain(X_i, v), i \in F, \end{cases} \end{equation}$$` where `\(F\)` represents the set of indices used in the previous nodes and `\(\lambda_i \in (0, 1]\)` is the penalization applied to the splitting. - The variables will only get picked if their gain is **very** high. --- ## How is `\(\lambda_i\)` chosen? The Guided Regularized Random Forests (GRRF) proposes the regularization parameter `\(\lambda_i\)` as: `\begin{equation} \lambda_i = (1 - \gamma)\lambda_0 + \gamma Imp'_{i}, \end{equation}` being that `\(\lambda_0\)` represents the baseline regularization parameter and `\(\gamma \in [0, 1]\)`, and `\(Imp'_{i}\)` is a standardized importance measure obtained from a random forest. - Larger `\(\gamma\)` = smaller `\(\lambda_i\)` = larger penalty on `\(Gain(X_i, v)\)` - <b>We generalize the method to</b> $$ \lambda_i = (1 - \gamma) \lambda_0(v) + \gamma g(X_i), $$ where `\(g(X_i)\)` is some function of the predictors and `\(\lambda_0(v)\)` can depend on some characteristic of the tree. - Gives us more flexibility regarding the weights of the gains for each variable. --- class: inverse, middle, center # 4. Applying the GRRF --- ## Methods: models - Our next results are using the GRRF with three main configurations: 1. Fixing `\(\gamma = 0.9\)`, `\(\lambda_0(v) = 1\)` and `\(g(X_i) = Imp_{i}^{'}\)`, or `$$\lambda_i = (1 - 0.9) 1 + 0.9 Imp_{i}^{'}$$` 2. Fixing `\(\gamma = 1\)`, `\(\lambda_0(v) = 0\)` and `\(g(X_i) = |corr(X_i, y)|\)`, or `$$\lambda_i = (1 - 1) 0 + 1 |corr(X_i, y)|$$` 3. Fixing `\(\gamma = 0.9\)`, `\(\lambda_0(v) = 0\)` and `\(\begin{equation} g(X_i) = \begin{cases} |corr(X_i, y)| Imp_{i}^{'} \textbf{, if } |corr(X_i, y)| > 0.5 \text{ and} \\ Imp_{i}^{'} 0.2 \textbf{, if } |corr(X_i, y)| \leq 0.5 \end{cases} \end{equation}\)`, or `$$\begin{equation} \lambda_i = \begin{cases} (1 - 0.9) 0 + 0.9 Imp_{i}^{'} |corr(X_i, y)| \textbf{, if } |corr(X_i, y)| > 0.5 \\ (1 - 0.9) 0 + 0.9 Imp_{i}^{'} 0.2 \textbf{, if } |corr(X_i, y)| \leq 0.5 \\ \end{cases} \end{equation}$$` we <b>weigh down</b> the variables that were not much correlated to the response, using the Spearman correlation. --- ## Methods: simulating data Using the model equation proposed in (Friedman, 1991) , we simulated a response variable `\(Y\)` and its relationship to a matrix of predictors `\(\mathbf{X}\)` as `\begin{equation} y_i = 10 sin(\pi x_{i1} x_{i2}) + 20 (x_{i3} - 0.5)^{2} + 10 x_{i4} + 5 x_{i5} + \epsilon_i, \thinspace \epsilon_i \stackrel{iid}\sim N(0, \sigma^2), \end{equation}` where `\(\mathbf{X} \in [0, 1]\)`, so the predictors were randomly drawn from a standard Uniform distribution. - Creates nonlinear relationships and interactions between the response and the predictors. - The five true predictors were added to: - 25 variables correlated to one of the true predictors (randomly selected); - 30 different variables drawn from a Normal distribution, with a random mean and standard deviation (pure noise); - **50** different datasets of the same form were simulated and split into train (75%) and test (25%) sets. **Why should we care about correlated predictors?** --- <div class="figure" style="text-align: center"> <img src="img/rf_comparison.png" alt="Figure 2. Mean variable importance in a Random Forest applied to the 50 datasets, with the correlated variables and without them." width="80%" /> <p class="caption">Figure 2. Mean variable importance in a Random Forest applied to the 50 datasets, with the correlated variables and without them.</p> </div> Clear split of the importance between the correlated variables in a standard Random Forest! **Misleading when we need to select variables.** --- ## Applying the models: simulated data <div class="figure" style="text-align: center"> <img src="img/sim_corr_results.png" alt="Figure 3. Root mean squared errors and counts of final variables in each fit using the simulated data " width="100%" /> <p class="caption">Figure 3. Root mean squared errors and counts of final variables in each fit using the simulated data </p> </div> --- ## Results: simulated data - All models selected a small number of variables. - Fewer variables for the third method: ideal scenario when we need to do regularization **Given the correlated variables, which were selected by the models?** <table class="table table-condensed table-hover" style="width: auto !important; margin-left: auto; margin-right: auto;"> <caption>Table 1. Percentage of variables for each model that were the true ones or correlated to the true ones.</caption> <thead> <tr> <th style="text-align:left;"> Model </th> <th style="text-align:left;"> Percentage </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 3rd: Gamma, importance scores and correlation </td> <td style="text-align:left;width: 4cm; "> <span style="display: inline-block; direction: rtl; border-radius: 4px; padding-right: 2px; background-color: lightgreen; width: 100.00%">0.933</span> </td> </tr> <tr> <td style="text-align:left;"> 2nd: Correlation </td> <td style="text-align:left;width: 4cm; "> <span style="display: inline-block; direction: rtl; border-radius: 4px; padding-right: 2px; background-color: lightyellow; width: 98.61%">0.920</span> </td> </tr> <tr> <td style="text-align:left;"> 1st: Using lambda and gamma </td> <td style="text-align:left;width: 4cm; "> <span style="display: inline-block; direction: rtl; border-radius: 4px; padding-right: 2px; background-color: lightyellow; width: 94.96%">0.886</span> </td> </tr> </tbody> </table> - **We avoided the correlated predictors'issue!** --- ## Applying the models: real data - Goal: predict the log of best race distance for bred racehorses <div class="figure" style="text-align: center"> <img src="img/horse.jpg" alt="Figure 3. Racehorses." width="80%" /> <p class="caption">Figure 3. Racehorses.</p> </div> --- ## Applying the models: real data - Predictors: trinary SNP and general variables including sex, inbreeding, and region. - Main issues: - Big P, small n! - Predictors are **correlated**, - Each variable is **very** expensive to obtain. - Data: - Originally 48910 predictors and 835 observations - The dataset was previously filtered for the predictors that had at least some (Spearman) correlation with the response (>0.15) - Resulted in 3582 predictors - The filtered dataset was split into 50 different train (75%) and test sets (25%). --- ## Results: real data <div class="figure" style="text-align: center"> <img src="img/real_data_results.png" alt="Figure 4. Root mean squared errors and counts of final variables in each fit using the real data" width="100%" /> <p class="caption">Figure 4. Root mean squared errors and counts of final variables in each fit using the real data</p> </div> --- ## Results: real data <table> <caption>Table 2. Summary of number of selected variables in each model.</caption> <thead> <tr> <th style="text-align:left;"> Model </th> <th style="text-align:right;"> Max. </th> <th style="text-align:right;"> Min. </th> <th style="text-align:right;"> Mean </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 3rd: Gamma, importance scores and correlation </td> <td style="text-align:right;width: 4cm; "> 41 </td> <td style="text-align:right;"> 30 </td> <td style="text-align:right;"> 34.44 </td> </tr> <tr> <td style="text-align:left;"> 2nd: Correlation </td> <td style="text-align:right;width: 4cm; "> 1799 </td> <td style="text-align:right;"> 1785 </td> <td style="text-align:right;"> 1791.22 </td> </tr> <tr> <td style="text-align:left;"> 1st: Using lambda and gamma </td> <td style="text-align:right;width: 4cm; "> 118 </td> <td style="text-align:right;"> 92 </td> <td style="text-align:right;"> 106.54 </td> </tr> </tbody> </table> <table> <caption>Table 3. Summary of root mean squared errors for each model.</caption> <thead> <tr> <th style="text-align:left;"> Model </th> <th style="text-align:right;"> Max. </th> <th style="text-align:right;"> Min. </th> <th style="text-align:right;"> Mean </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 3rd: Gamma, importance scores and correlation </td> <td style="text-align:right;width: 4cm; "> 0.254 </td> <td style="text-align:right;"> 0.201 </td> <td style="text-align:right;"> 0.233 </td> </tr> <tr> <td style="text-align:left;"> 2nd: Correlation </td> <td style="text-align:right;width: 4cm; "> 0.245 </td> <td style="text-align:right;"> 0.192 </td> <td style="text-align:right;"> 0.221 </td> </tr> <tr> <td style="text-align:left;"> 1st: Using lambda and gamma </td> <td style="text-align:right;width: 4cm; "> 0.251 </td> <td style="text-align:right;"> 0.191 </td> <td style="text-align:right;"> 0.226 </td> </tr> </tbody> </table> --- ## What if we have a limited number of variables to use? <div class="figure" style="text-align: center"> <img src="img/mean_imp.png" alt="Figure 5. Top 15 mean importance for the variables of each model" /> <p class="caption">Figure 5. Top 15 mean importance for the variables of each model</p> </div> --- ## Using the top 15 variables of each model - We run a standard Random Forest model using the 15 most important variables for each model - Which model gives us the best predictions if we have a limit of variables to use? <table class="table table-condensed table-hover" style="width: auto !important; margin-left: auto; margin-right: auto;"> <caption>Table 5. Summary of root mean squared error in the test set when using the top 15 variables of each model in the 50 datasets</caption> <thead> <tr> <th style="text-align:left;"> Model </th> <th style="text-align:right;"> Max. </th> <th style="text-align:right;"> Min. </th> <th style="text-align:left;"> Mean </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 3rd: Gamma, importance scores and correlation </td> <td style="text-align:right;width: 4cm; "> 0.2404 </td> <td style="text-align:right;"> 0.1845 </td> <td style="text-align:left;"> <span style="display: inline-block; direction: rtl; border-radius: 4px; padding-right: 2px; background-color: lightgreen; width: 96.55%">0.2157</span> </td> </tr> <tr> <td style="text-align:left;"> 2nd: Correlation </td> <td style="text-align:right;width: 4cm; "> 0.2527 </td> <td style="text-align:right;"> 0.1872 </td> <td style="text-align:left;"> <span style="display: inline-block; direction: rtl; border-radius: 4px; padding-right: 2px; background-color: lightyellow; width: 100.00%">0.2234</span> </td> </tr> <tr> <td style="text-align:left;"> 1st: Using lambda and gamma </td> <td style="text-align:right;width: 4cm; "> 0.2458 </td> <td style="text-align:right;"> 0.1913 </td> <td style="text-align:left;"> <span style="display: inline-block; direction: rtl; border-radius: 4px; padding-right: 2px; background-color: lightyellow; width: 97.76%">0.2184</span> </td> </tr> </tbody> </table> --- class: inverse, middle, center # 5. Conclusions and Final Remarks --- # 5. Conclusions and Final Remarks - Variable selection in Random Forests is still a topic to be explored - **The GRRF, as proposed in (Deng and Runger, 2012), does not perform the most strict regularization as desired.** - The third model performed well in the real dataset: - It selected way less variables, - The model kept a similar prediction power than bigger models, - The selected variables were not overall correlated. - For categorical predictor variables, correlation might not be the best measure to use - Next steps include proposing new ways of regularizing the trees: - Potentially considering the depth of each tree - Delimiting the maximum number of variables to select `Code: https://github.com/brunaw/regularization-rf` --- class: center, middle ## Acknowledgements This work was supported by a Science Foundation Ireland Career Development Award grant number: 17/CDA/4695 <img src="img/SFI_logo.jpg" width="50%" height="40%" style="display: block; margin: auto;" /> --- # Bibliography <p><cite>Breiman, L. (2001). “Random Forests”. In: <em>Machine Learning</em>. ISSN: 1098-6596. DOI: <a href="https://doi.org/10.1017/CBO9781107415324.004">10.1017/CBO9781107415324.004</a>. eprint: arXiv:1011.1669v3.</cite></p> <p><cite><a id='bib-guided'></a><a href="#cite-guided">Deng, H. and G. C. Runger</a> (2012). “Gene selection with guided regularized random forest”. In: <em>CoRR</em> abs/1209.6425. eprint: 1209.6425. URL: <a href="http://arxiv.org/abs/1209.6425">http://arxiv.org/abs/1209.6425</a>.</cite></p> <p><cite><a id='bib-Friedman1991'></a><a href="#cite-Friedman1991">Friedman, J. H.</a> (1991). “Rejoinder: Multivariate Adaptive Regression Splines”. In: <em>The Annals of Statistics</em>. ISSN: 0090-5364. DOI: <a href="https://doi.org/10.1214/aos/1176347973">10.1214/aos/1176347973</a>. eprint: arXiv:1306.3979v1.</cite></p> <p><cite><a id='bib-HastieTrevor'></a><a href="#cite-HastieTrevor">Hastie, T, R. Tibshirani, and J. Friedman</a> (2009). “The Elements of Statistical Learning”. In: <em>Elements</em> 1, pp. 337–387. ISSN: 03436993. DOI: <a href="https://doi.org/10.1007/b94608">10.1007/b94608</a>. eprint: 1010.3003. URL: <a href="http://www.springerlink.com/index/10.1007/b94608">http://www.springerlink.com/index/10.1007/b94608</a>.</cite></p> --- class: center, middle, inverse # Thanks! <img src= "https://s3.amazonaws.com/kleebtronics-media/img/icons/github-white.png", width="50", height="50", align="middle"> <b> [@brunaw](https://github.com/brunaw)