Regularization Methods in Random Forests

class: center, middle, inverse, title-slide

# Regularization Methods in Random Forests
## 39th Conference On Applied Statistics In Ireland
### Bruna Wundervald, Andrew Parnell & Katarina Domijan
### May, 2019

---

class: middle

# Outline

1. Motivation
  2. Tree-based models
    - Trees
    - Random Forests
  3. Regularization in Random Forests
    - Guided Regularization in Random Forests (GRRF)
  4. Applying the GRRF
    - Simulated data 
    - Real data 
  5. Conclusions and Final Remarks

---
class: inverse, middle, center

# 1.  Motivation

---
# 1.  Motivation

- Predictors can be hard or economically expensive to obtain.

- Feature selection `$\neq$` Shrinkage/regularization: 'shrinks' the regression coefficients towards zero.

- For tree-based methods, there is not yet a well established 
regularization procedure in the literature.

- We are interested in tree models when there are many more predictors than observations:
  - Simulated data 
  - Real data

<div class="figure" style="text-align: center">
<img src="img/dim.png" alt="Figure 1. Big P, small n" width="30%" />
<p class="caption">Figure 1. Big P, small n</p>
</div>

---
class: inverse, middle, center

# 2. Tree-based models

---
# 2. Tree-based models
## Trees

<div class="figure" style="text-align: center">
<img src="img/trees.png" alt="Figure 1. Example of a decision tree. " width="50%" />
<p class="caption">Figure 1. Example of a decision tree. </p>
</div>

---
## Trees

Consider a continuous variable of interest `$Y_i \in \mathbb{R}$` 
and `$\mathbf{x} = (x_{i1},\dots, x_{ip})'$` the set 
of predictor features, `$i = 1 \dots n$`.

- Each estimated rule has the form: `$x_j > x_{j,th}$`, where 
`$x_j$` describes the value of the feature at `$j$` and `$x_{j,th}$` is the 
decision cut point. 
- The model predicts `$Y$` with a constant `$c_m$` in each splitted region `$R_m$`,
usually the mean of `$y$`, or

`\begin{equation}
\hat f(\mathbf{x_i}) =  \sum_{m = 1}^{M} c_m I\{\mathbf{x_i} \in R_m  \},
\end{equation}`

where `$\mathbf{x}$` represents the set of predictor variables.

- The minimized measure is the residual sum of squares, given by

`\begin{equation}
RSS_{tree} =  \sum_{j = 1}^{J} \sum_{i \in R_j} (y_i - \hat y_{R_j})^2
\end{equation}`

where `$\hat y_{R_j}$` is the mean response in the *j*th region of the predictors'space.

---
## Random Forests

- It is an average of many trees grown in Bootstrap samples.
- Simple way to reduce variance in tree models: 
  - take many training sets from the population with bootstrap resampling,
  - build a separate prediction for each dataset and 
  - average their final predictions, resulting in

`\begin{equation}
\hat f_{avg}(\mathbf{x}) = \frac{1}{B} \sum_{b=1}^{B} \hat 
f^{(b)}(\mathbf{x}),
\end{equation}`

(Hastie, Tibshirani, and Friedman, 2009)

<b> Variable importance:</b> improvement in the splitting criteria (RSS) for 
  each variable
  
- Those values are accumulated over all of the trees

- It facilitates feature selection in random forests

- Unwanted behavior when in the presence of highly correlated variables: 
the importances are split between the correlated features.

---
class: inverse, middle, center
#  3. Regularization

---
#  3. Regularization

- Regularized regression consists in estimating a penalized function of the form

`\begin{equation}
\underset{f \in H}{min} \Big[ \sum_{i = 1}^{N}
L(y_i, f(x_i)) + \lambda J(f) \Big ], 
\end{equation}`

where `$L(y, f(x))$` is the chosen loss function, `$J(f)$` is a penalty
functional, and `$H$` is a space of functions on which `$J(f)$` is defined
(Hastie, Tibshirani, and Friedman, 2009).

- Produces models that are more parsimonious and have similar 
prediction error as the full model.

- It is usually robust enough to not be influenced by the correlated variables.

---
## Regularization in Random Forests

- One option is presented in (Deng and Runger, 2012):
  - The authors **penalise the gain (RSS reduction)** of each 
  variable, for each tree when building a random forest.

- The main idea is to weigh the gain of each variable, with

`$$\begin{equation} Gain_{R}(X_i, v) =  \begin{cases} \lambda_i Gain(X_i, v), i \notin F \text{ and} \\ Gain(X_i, v), i \in F,  \end{cases} \end{equation}$$`

where `$F$` represents the set of indices used in the previous nodes and 
`$\lambda_i \in (0, 1]$` is the penalization applied to the splitting.

- The variables will only get picked if their gain is **very** high.

---
## How is `$\lambda_i$` chosen?

The Guided Regularized Random Forests (GRRF) proposes the regularization
parameter `$\lambda_i$` as:

`\begin{equation}
\lambda_i = (1 - \gamma)\lambda_0 + \gamma Imp'_{i},  
\end{equation}`

being that `$\lambda_0$` represents the baseline regularization parameter 
and `$\gamma \in [0, 1]$`, and `$Imp'_{i}$` is a standardized importance measure obtained from a random forest.

- Larger `$\gamma$` = smaller `$\lambda_i$` =  larger penalty on `$Gain(X_i, v)$`

- <b>We generalize the method to</b>

$$ \lambda_i = (1 - \gamma) \lambda_0(v) + \gamma g(X_i), $$

where `$g(X_i)$` is some function of the predictors and `$\lambda_0(v)$` can
depend on some characteristic of the tree.

- Gives us more flexibility regarding the weights 
of the gains for each variable.

---
class: inverse, middle, center

#  4. Applying the GRRF

---
## Methods: models

- Our next results are using the GRRF with three main configurations:

1. Fixing `$\gamma = 0.9$`,  `$\lambda_0(v) = 1$` and `$g(X_i) = Imp_{i}^{'}$`, or
  `$$\lambda_i = (1 - 0.9) 1 + 0.9 Imp_{i}^{'}$$`
  
  2. Fixing `$\gamma = 1$`, `$\lambda_0(v) = 0$` and `$g(X_i) = |corr(X_i, y)|$`, or 
  `$$\lambda_i = (1 - 1) 0 + 1 |corr(X_i, y)|$$`
  
  3.  Fixing `$\gamma = 0.9$`, `$\lambda_0(v) = 0$` and 
`$\begin{equation} g(X_i) =  \begin{cases} |corr(X_i, y)| Imp_{i}^{'} \textbf{,   if } |corr(X_i, y)| > 0.5 \text{ and} \\ Imp_{i}^{'} 0.2 \textbf{,   if } |corr(X_i, y)| \leq 0.5 \end{cases} \end{equation}$`, or

`$$\begin{equation} \lambda_i =  \begin{cases} (1 - 0.9) 0 + 0.9 Imp_{i}^{'}  |corr(X_i, y)| \textbf{,   if } |corr(X_i, y)| > 0.5  \\  (1 - 0.9) 0 + 0.9 Imp_{i}^{'} 0.2 \textbf{,   if } |corr(X_i, y)| \leq 0.5  \\ \end{cases} \end{equation}$$`

we <b>weigh down</b> the variables that were not much correlated to the
response, using the Spearman correlation.

---
## Methods: simulating data

Using the model equation proposed in 
(Friedman, 1991) , we simulated a 
response variable `$Y$` and its relationship to
a matrix of predictors `$\mathbf{X}$` as

`\begin{equation}
y_i = 10 sin(\pi x_{i1} x_{i2}) + 20 (x_{i3} -
0.5)^{2} + 10 x_{i4} + 5 x_{i5} +
 \epsilon_i, \thinspace
\epsilon_i \stackrel{iid}\sim N(0, \sigma^2),
\end{equation}`

where `$\mathbf{X} \in [0, 1]$`, so the predictors
were randomly drawn from a standard Uniform distribution.

- Creates nonlinear relationships and interactions between the response and the predictors.

- The five true predictors were added to:

- 25 variables correlated to one of the true predictors (randomly selected);
  - 30 different variables drawn from a Normal distribution, with a random mean and standard deviation (pure noise);

- **50** different datasets of the same form were simulated and split into
train (75%) and test (25%) sets.

**Why should we care about correlated predictors?**

---

<div class="figure" style="text-align: center">
<img src="img/rf_comparison.png" alt="Figure 2. Mean variable importance in a Random Forest applied to the 50 datasets, with the correlated variables and without them." width="80%" />
<p class="caption">Figure 2. Mean variable importance in a Random Forest applied to the 50 datasets, with the correlated variables and without them.</p>
</div>

Clear split of the importance between the correlated variables in a standard Random Forest! **Misleading when we need to select variables.**

---

## Applying the models: simulated data

<div class="figure" style="text-align: center">
<img src="img/sim_corr_results.png" alt="Figure 3. Root mean squared errors and counts of final variables in each fit using the simulated data " width="100%" />
<p class="caption">Figure 3. Root mean squared errors and counts of final variables in each fit using the simulated data </p>
</div>

---

## Results: simulated data

- All models selected a small number of variables. 
- Fewer variables for the third method: ideal scenario when we need to do
regularization

**Given the correlated variables, which were selected by the models?**

<table class="table table-condensed table-hover" style="width: auto !important; margin-left: auto; margin-right: auto;">
<caption>Table 1. Percentage of variables for each model that were the true ones or correlated to the true ones.</caption>
 <thead>
  <tr>
   <th style="text-align:left;"> Model </th>
   <th style="text-align:left;"> Percentage </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> 3rd: Gamma, importance scores and correlation </td>
   <td style="text-align:left;width: 4cm; "> <span style="display: inline-block; direction: rtl; border-radius: 4px; padding-right: 2px; background-color: lightgreen; width: 100.00%">0.933</span> </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 2nd: Correlation </td>
   <td style="text-align:left;width: 4cm; "> <span style="display: inline-block; direction: rtl; border-radius: 4px; padding-right: 2px; background-color: lightyellow; width: 98.61%">0.920</span> </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 1st: Using lambda and gamma </td>
   <td style="text-align:left;width: 4cm; "> <span style="display: inline-block; direction: rtl; border-radius: 4px; padding-right: 2px; background-color: lightyellow; width: 94.96%">0.886</span> </td>
  </tr>
</tbody>
</table>

- **We avoided the correlated predictors'issue!**

---
## Applying the models: real data

- Goal: predict the log of best race distance for bred racehorses

<div class="figure" style="text-align: center">
<img src="img/horse.jpg" alt="Figure 3. Racehorses." width="80%" />
<p class="caption">Figure 3. Racehorses.</p>
</div>

---

## Applying the models: real data

- Predictors: trinary SNP and general variables including sex, inbreeding, 
and region.

- Main issues:
  - Big P, small n!
  - Predictors are **correlated**,
  - Each variable is **very** expensive to obtain.

- Data:
  - Originally 48910 predictors and 835 observations
  
  - The dataset was previously filtered for the predictors that had at
  least some (Spearman) correlation with the response (>0.15)
    - Resulted in 3582 predictors
    
  - The filtered dataset was split into 50 different train (75%) and
test sets (25%).

---

## Results: real data

<div class="figure" style="text-align: center">
<img src="img/real_data_results.png" alt="Figure 4. Root mean squared errors and counts of final variables in each fit using the real data" width="100%" />
<p class="caption">Figure 4. Root mean squared errors and counts of final variables in each fit using the real data</p>
</div>

---

## Results: real data

<table>
<caption>Table 2. Summary of number of selected variables in each model.</caption>
 <thead>
  <tr>
   <th style="text-align:left;"> Model </th>
   <th style="text-align:right;"> Max. </th>
   <th style="text-align:right;"> Min. </th>
   <th style="text-align:right;"> Mean </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> 3rd: Gamma, importance scores and correlation </td>
   <td style="text-align:right;width: 4cm; "> 41 </td>
   <td style="text-align:right;"> 30 </td>
   <td style="text-align:right;"> 34.44 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 2nd: Correlation </td>
   <td style="text-align:right;width: 4cm; "> 1799 </td>
   <td style="text-align:right;"> 1785 </td>
   <td style="text-align:right;"> 1791.22 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 1st: Using lambda and gamma </td>
   <td style="text-align:right;width: 4cm; "> 118 </td>
   <td style="text-align:right;"> 92 </td>
   <td style="text-align:right;"> 106.54 </td>
  </tr>
</tbody>
</table>

<table>
<caption>Table 3. Summary of root mean squared errors for each model.</caption>
 <thead>
  <tr>
   <th style="text-align:left;"> Model </th>
   <th style="text-align:right;"> Max. </th>
   <th style="text-align:right;"> Min. </th>
   <th style="text-align:right;"> Mean </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> 3rd: Gamma, importance scores and correlation </td>
   <td style="text-align:right;width: 4cm; "> 0.254 </td>
   <td style="text-align:right;"> 0.201 </td>
   <td style="text-align:right;"> 0.233 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 2nd: Correlation </td>
   <td style="text-align:right;width: 4cm; "> 0.245 </td>
   <td style="text-align:right;"> 0.192 </td>
   <td style="text-align:right;"> 0.221 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 1st: Using lambda and gamma </td>
   <td style="text-align:right;width: 4cm; "> 0.251 </td>
   <td style="text-align:right;"> 0.191 </td>
   <td style="text-align:right;"> 0.226 </td>
  </tr>
</tbody>
</table>

---

## What if we have a limited number of variables to use?

<div class="figure" style="text-align: center">
<img src="img/mean_imp.png" alt="Figure 5. Top 15 mean importance for the variables of each model"  />
<p class="caption">Figure 5. Top 15 mean importance for the variables of each model</p>
</div>

---

## Using the top 15 variables of each model

- We run a standard Random Forest model using the 15 most important variables for each model

- Which model gives us the best predictions if we have a limit of variables to use?

<table class="table table-condensed table-hover" style="width: auto !important; margin-left: auto; margin-right: auto;">
<caption>Table 5. Summary of root mean squared error in the test set when using the top 15 variables of each model in the 50 datasets</caption>
 <thead>
  <tr>
   <th style="text-align:left;"> Model </th>
   <th style="text-align:right;"> Max. </th>
   <th style="text-align:right;"> Min. </th>
   <th style="text-align:left;"> Mean </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> 3rd: Gamma, importance scores and correlation </td>
   <td style="text-align:right;width: 4cm; "> 0.2404 </td>
   <td style="text-align:right;"> 0.1845 </td>
   <td style="text-align:left;"> <span style="display: inline-block; direction: rtl; border-radius: 4px; padding-right: 2px; background-color: lightgreen; width: 96.55%">0.2157</span> </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 2nd: Correlation </td>
   <td style="text-align:right;width: 4cm; "> 0.2527 </td>
   <td style="text-align:right;"> 0.1872 </td>
   <td style="text-align:left;"> <span style="display: inline-block; direction: rtl; border-radius: 4px; padding-right: 2px; background-color: lightyellow; width: 100.00%">0.2234</span> </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 1st: Using lambda and gamma </td>
   <td style="text-align:right;width: 4cm; "> 0.2458 </td>
   <td style="text-align:right;"> 0.1913 </td>
   <td style="text-align:left;"> <span style="display: inline-block; direction: rtl; border-radius: 4px; padding-right: 2px; background-color: lightyellow; width: 97.76%">0.2184</span> </td>
  </tr>
</tbody>
</table>

---
class: inverse, middle, center

#  5. Conclusions and Final Remarks

---
#  5. Conclusions and Final Remarks

- Variable selection in Random Forests is still a topic to be explored

- **The GRRF, as proposed in (Deng and Runger, 2012), does not perform the most strict regularization as desired.**

- The third model performed well in the real dataset:
  - It selected way less variables,
  - The model kept a similar prediction power than bigger models,
  - The selected variables were not overall correlated.

- For categorical predictor variables, correlation might not be the best 
measure to use

- Next steps include proposing new ways of regularizing the trees:
  - Potentially considering the depth of each tree 
  - Delimiting the maximum number of variables to select

`Code: https://github.com/brunaw/regularization-rf`

---
class: center, middle

## Acknowledgements

This work was supported by a Science Foundation Ireland Career Development Award grant number: 17/CDA/4695

---
# Bibliography

<p><cite>Breiman, L.
(2001).
&ldquo;Random Forests&rdquo;.
In: <em>Machine Learning</em>.
ISSN: 1098-6596.
DOI: <a href="https://doi.org/10.1017/CBO9781107415324.004">10.1017/CBO9781107415324.004</a>.
eprint: arXiv:1011.1669v3.</cite></p>

<p><cite><a id='bib-guided'></a><a href="#cite-guided">Deng, H. and G. C. Runger</a>
(2012).
&ldquo;Gene selection with guided regularized random forest&rdquo;.
In: <em>CoRR</em> abs/1209.6425.
eprint: 1209.6425.
URL: <a href="http://arxiv.org/abs/1209.6425">http://arxiv.org/abs/1209.6425</a>.</cite></p>

<p><cite><a id='bib-Friedman1991'></a><a href="#cite-Friedman1991">Friedman, J. H.</a>
(1991).
&ldquo;Rejoinder: Multivariate Adaptive Regression Splines&rdquo;.
In: <em>The Annals of Statistics</em>.
ISSN: 0090-5364.
DOI: <a href="https://doi.org/10.1214/aos/1176347973">10.1214/aos/1176347973</a>.
eprint: arXiv:1306.3979v1.</cite></p>

<p><cite><a id='bib-HastieTrevor'></a><a href="#cite-HastieTrevor">Hastie, T, R. Tibshirani, and J. Friedman</a>
(2009).
&ldquo;The Elements of Statistical Learning&rdquo;.
In: <em>Elements</em> 1, pp. 337&ndash;387.
ISSN: 03436993.
DOI: <a href="https://doi.org/10.1007/b94608">10.1007/b94608</a>.
eprint: 1010.3003.
URL: <a href="http://www.springerlink.com/index/10.1007/b94608">http://www.springerlink.com/index/10.1007/b94608</a>.</cite></p>

---
class: center, middle, inverse

# Thanks!

<b>

[@brunaw](https://github.com/brunaw)