Feature Selection via Gain Penalization in Random Forests

layout: true
  
<div class="my-footer"><span>https://github.com/brunaw/reg-rf-demo</span></div>

---
name: bookdown-title

### .fancy[Feature Selection via Gain Penalization in Random Forests]

---
class: inverse, middle

### .fancy[Summary]

- Introduction
    - Random Forests
    - Gain Penalization
    - Feature Selection
  
  - Implementation: `R` code
    - Data
    - First models
    - Feature selection
    - Final model

---
class: inverse, middle, center

### .fancy[Introduction]

---
# Introduction

- Oftentimes, we want to estimate machine learning models that use
just a few of the available predictors

- Random Forests are a form of tree-based ensembles

- Grows many trees in resamples of the data, averaging the results
at the end (bagged ensemble) (Breiman (1996)):

`$$\hat f(\mathbf{x}) = \sum_{n = 1}^{N_{tree}} \frac{1}{N_{tree}} \hat f_n(\mathbf{x}),$$`

where `$\hat f_n$` corresponds to the `$n$`-th tree

## Issues

- Use all or most of the features that are feed to them
 
- Struggle a lot to detect highly correlated features (Louppe (2014))

---

# Gain penalization

- A tree-gain weighting method (Deng and Runger (2012))
- When determining the next child node to be added to a decision tree, 
the gain (or the error reduction) of each feature 
is multiplied by a penalization parameter:

`$$\begin{equation} \text{Gain}_{R}(\mathbf{X}_{i}, t) = 
\begin{cases} \lambda \Delta(i, t), \thinspace  i \notin \mathbb{U} \text{ and} \\ \Delta(i, t), \thinspace  i \in \mathbb{U},  \end{cases}
\label{eq:grrf}\end{equation}$$`

where `$\mathbb{U}$` is the set of indices of the features previously 
used in the tree, `$\mathbf{X}_{i}$` is the candidate feature,
`$t$` is the candidate splitting point and `$\lambda \in (0, 1]$`

> A new split will only be made if, after the penalization,
the gain of adding this node is still higher than having 
no new child node in the tree

---

# Generalizing gain penalization

- In (Wundervald, Parnell, and Domijan (2020)), we proposed 
a generalization to the way the penalization coefficients
are calculated, such that we can have full control over it.

- Our `$\lambda_i$` is written as 
`$$\begin{equation}
\lambda_i = (1 - \gamma) \lambda_0 + \gamma g(\mathbf{x}_i), \label{eq:generalization} \end{equation}$$`

where `$\lambda_0 \in [0, 1)$` is interpreted as the 
baseline penalization,  `$g(\mathbf{x}_i)$` 
is a function of the  `$i$`-th feature, 
and `$\gamma \in [0, 1)$` is their mixture parameter, 
with `$\lambda_i \in [0, 1)$`.

- `$g(\mathbf{x}_i)$`  should represent relevant information about the 
features, based on some characteristic of interest

- Inspiration on the use of priors made in Bayesian methods

- Global-local penalization

---
--- 
# Feature Selection

The full feature selection procedure happens in 3 main steps:

1. Running a bunch of penalized random forests models
  with different hyperparameters and record their accuracies and final set of features
  
  2. For each training dataset, select the top-n (for this post we use n = 3) fitted models in terms of the accuracies, and run a "new" random
  forest for each of the feature sets used by them. This is done using all
  of the training sets so we can evaluate how these features perform in slightly different scenarios
  
  3. Finally, get the top-m set of models (here m = 30) from these
  new ones, check which features were the most used between them 
  and run a final random forest model with this feature set.
  
  - Here we select only the 15 most used features from the top 30 models

---

### .fancy[Implementation]

---
# Data

- `gravier` dataset 
  (Gravier, Pierron, Vincent-Salomon, Gruel, Raynal, Savignoni, De Rycke, Pierga, Lucchesi, Reyal, and others (2010))
- Goal: to predict whether 168 breast cancer patients had a diagnosis
labelled "poor" (~66%) or "good" (~33%), with 2905 predictors available
- `tidyverse` + `tidymodels`

```r
library(tidyverse)
library(tidymodels)
library(infotheo) # For the mutual information function
set.seed(2021)

# Loading data and creating a 5-fold CV object
data('gravier', package = 'datamicroarray')

gravier <- data.frame(class = gravier$y, gravier$x)
folds <- rsample::vfold_cv(gravier, v = 5) %>% 
  dplyr::mutate(train =  map(splits, training),
                test  = map(splits, testing))
```
]]

---
# Modelling functions

> `c++` implementantion at https://github.com/imbs-hl/ranger

```r
# A function that runs the penalized random forests models 
modelling <- function(train, reg_factor = 1, mtry = 1){
  rf_mod <- 
    rand_forest(trees = 500, mtry = (mtry * ncol(train)) - 1) %>% 
    set_engine("ranger", importance = "impurity", 
               regularization.factor = reg_factor) %>% 
    set_mode("classification") %>% 
    parsnip::fit(class ~ ., data = train)
  return(rf_mod)
}
```
]

```r
# A function that receives the mixing parameters
# and calculates lambda_i with the chose g(x_i)
penalization <- function(gamma, lambda_0, data = NULL, imps = NULL, type = "rf"){
  if(type == "rf"){
    # Calculating the normalized importance values 
    imps <- imps/max(imps)
    imp_mixing <- (1 - gamma) * lambda_0 + imps * gamma 
    return(imp_mixing)
  } else if(type == "MI"){
    mi <- function(data, var) mutinformation(c(data$class), data %>% pull(var))
    # Calculating the normalized mutual information values
    disc_data  <- infotheo::discretize(data) 
    disc_data$class <- as.factor(data$class)
    names_data <- names(data)[-1]
    mi_vars <- names_data  %>% map_dbl(~{mi(data = disc_data, var = .x) })
    mi_mixing <- (1 - gamma) * lambda_0 + gamma * (mi_vars/max(mi_vars))
    return(mi_mixing)  
  }}
```
]

```r
# Setting all hyperparameters ---
mtry <-  tibble(mtry = c(0.20, 0.45, 0.85))  
gamma_f  <-  c(0.3, 0.5, 0.8)
lambda_0_f <- c(0.35, 0.75)

parameters <- mtry %>% tidyr::crossing(lambda_0_f, gamma_f)
# Adds gamma_f and lambda_0_f and run the functions with them ------
folds_imp <- folds %>% 
  dplyr::mutate(
    # Run the standard random forest model for the 5 folds
    model = purrr::map(train, modelling), 
    importances_std = purrr::map(model, ~{.x$fit$variable.importance}))  %>%
  tidyr::expand_grid(parameters) %>% 
  dplyr::mutate(imp_rf = purrr::pmap(
    list(gamma_f, lambda_0_f, train, importances_std), type = "rf", penalization), 
    imp_mi = purrr::pmap(
      list(gamma_f, lambda_0_f, train, importances_std), type = "MI", penalization)) 
```
]

]

---
class: middle

## Looking at the accuracy of the standard random forest

---
# Running models

```r
run_all_models <-  folds_imp %>%   
  dplyr::select(id, model, train, test,  imp_rf, imp_mi, mtry, lambda_0_f, gamma_f) %>% 
  tidyr::gather(type, importance, -train, -test, -mtry,-id, -model, -lambda_0_f, -gamma_f) %>% 
  dplyr::mutate(fit_penalized_rf = purrr::pmap(list(train, importance, mtry), modelling)) 
```
]

```r
results <- run_all_models %>% 
  dplyr::mutate(
    model_importance = purrr::map(fit_penalized_rf, ~{.x$fit$variable.importance}),
    n_var = purrr::map_dbl(model_importance, n_vars),
    accuracy = 1 - purrr::map_dbl(fit_penalized_rf, ~{ .x$fit$prediction.error}),
    accuracy_test = purrr::map2_dbl(
      .x = fit_penalized_rf, .y = test, ~{ acc_test(.x, .y)})) 
```
]

<table class="table table-condensed table-hover" style="width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> id </th>
   <th style="text-align:right;"> mtry </th>
   <th style="text-align:right;"> lambda_0_f </th>
   <th style="text-align:right;"> gamma_f </th>
   <th style="text-align:left;"> type </th>
   <th style="text-align:right;"> n_var </th>
   <th style="text-align:right;"> accuracy </th>
   <th style="text-align:right;"> accuracy_test </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> Fold1 </td>
   <td style="text-align:right;"> 0.85 </td>
   <td style="text-align:right;"> 0.75 </td>
   <td style="text-align:right;"> 0.5 </td>
   <td style="text-align:left;"> imp_rf </td>
   <td style="text-align:right;"> 5 </td>
   <td style="text-align:right;"> 0.8979667 </td>
   <td style="text-align:right;"> 0.824 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Fold1 </td>
   <td style="text-align:right;"> 0.20 </td>
   <td style="text-align:right;"> 0.75 </td>
   <td style="text-align:right;"> 0.3 </td>
   <td style="text-align:left;"> imp_rf </td>
   <td style="text-align:right;"> 32 </td>
   <td style="text-align:right;"> 0.8786310 </td>
   <td style="text-align:right;"> 0.794 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Fold1 </td>
   <td style="text-align:right;"> 0.45 </td>
   <td style="text-align:right;"> 0.75 </td>
   <td style="text-align:right;"> 0.8 </td>
   <td style="text-align:left;"> imp_rf </td>
   <td style="text-align:right;"> 12 </td>
   <td style="text-align:right;"> 0.9001406 </td>
   <td style="text-align:right;"> 0.765 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Fold1 </td>
   <td style="text-align:right;"> 0.45 </td>
   <td style="text-align:right;"> 0.35 </td>
   <td style="text-align:right;"> 0.3 </td>
   <td style="text-align:left;"> imp_rf </td>
   <td style="text-align:right;"> 11 </td>
   <td style="text-align:right;"> 0.8998857 </td>
   <td style="text-align:right;"> 0.765 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Fold1 </td>
   <td style="text-align:right;"> 0.45 </td>
   <td style="text-align:right;"> 0.35 </td>
   <td style="text-align:right;"> 0.5 </td>
   <td style="text-align:left;"> imp_rf </td>
   <td style="text-align:right;"> 13 </td>
   <td style="text-align:right;"> 0.8995989 </td>
   <td style="text-align:right;"> 0.765 </td>
  </tr>
</tbody>
</table>

]]

---
<div class="figure" style="text-align: center">
<img src="presentation_files/figure-html/unnamed-chunk-10-1.png" alt="Test accuracies for each combination of mtry, type of `$g(\mathbf{x}_i)$`, and `$\gamma$`."  />
<p class="caption">Test accuracies for each combination of mtry, type of `$g(\mathbf{x}_i)$`, and `$\gamma$`.</p>
</div>

---

<div class="figure" style="text-align: center">
<img src="presentation_files/figure-html/unnamed-chunk-11-1.png" alt="Final number of variables used for each combination of mtry, type of `$g(\mathbf{x}_i)$`, and `$\gamma$`"  />
<p class="caption">Final number of variables used for each combination of mtry, type of `$g(\mathbf{x}_i)$`, and `$\gamma$`</p>
</div>

---

# Feature selection

```r
best_models <- results %>% 
  arrange(desc(accuracy_test), desc(accuracy), n_var) %>% 
  group_by(id) %>% 
  slice(1:3) %>% 
  ungroup() %>% 
  mutate(new_formula = map(model_importance, get_formula))

# Re-evaluating selected variables -----------------
reev <- tibble(forms = best_models$new_formula) %>% 
  tidyr::expand_grid(folds) %>% 
  dplyr::mutate(reev_models = purrr::map2(train, forms, modelling_reev))

results_reev <- reev %>% 
  dplyr::mutate(feat_importance = purrr::map(reev_models, ~{.x$fit$variable.importance}),
                n_var = purrr::map_dbl(feat_importance, n_vars),
                accuracy = 1 - purrr::map_dbl(reev_models, ~{ .x$fit$prediction.error}),
                accuracy_test = purrr::map2_dbl(.x = reev_models, .y = test, ~{ acc_test(.x, test = .y)})) 
```
]

]

---

# Final models

```r
selected_vars <- results_reev %>% 
  arrange(desc(accuracy_test), desc(accuracy), n_var) %>% 
  slice(1:30) %>% 
  mutate(ind = 1:n(), vars = map(feat_importance, get_vars)) %>% 
  dplyr::select(ind, vars) %>% 
  unnest() %>% 
  group_by(vars) %>% 
  summarise(count = n()) %>% 
  arrange(desc(count))

# Select the final 15 features
final_vars <- selected_vars %>% slice(1:15) %>% pull(vars)
# Create the final formula 
final_form <- paste("class ~ ", paste0(final_vars, collapse = ' + ')) %>%
  as.formula()
```
]

```r
# Create the 20 new training and test sets
set.seed(2021)
folds_20 <- rsample::vfold_cv(gravier, v = 20) %>% 
  dplyr::mutate(train =  map(splits, training), test  = map(splits, testing))

# Run the final model for the new train-test sets
final_results <- folds_20$splits %>% map(~{
  train <-  training(.x)
  test <-  testing(.x)
  
  rf <- rand_forest(trees = 500, mtry = 7) %>%
    set_engine("ranger", importance = "impurity") %>% 
    set_mode("classification") %>% 
    parsnip::fit(final_form, data = train)
  
  accuracy_test <- acc_test(rf, test = test)
  list(accuracy_test = accuracy_test, 
       accuracy = 1 - rf$fit$prediction.error, 
       imp = rf$fit$variable.importance)
})
```
]

<table class="table table-condensed table-hover" style="width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> type </th>
   <th style="text-align:right;"> mean </th>
   <th style="text-align:right;"> median </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> accuracy </td>
   <td style="text-align:right;"> 0.8974934 </td>
   <td style="text-align:right;"> 0.8971951 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> accuracy_test </td>
   <td style="text-align:right;"> 0.8681000 </td>
   <td style="text-align:right;"> 0.8820000 </td>
  </tr>
</tbody>
</table>
]]

---
class: middle

<div class="figure" style="text-align: center">
<img src="presentation_files/figure-html/unnamed-chunk-17-1.png" alt="Average importance values for the final selected variables"  />
<p class="caption">Average importance values for the final selected variables</p>
</div>

---

### .fancy[Conclusions]

---

# Conclusions

- Our gain penalization method can be very useful to select 
features for random forests

- In the example, 
the final test accuracy (average) is higher than what
was seen in the previous plots, but now using only 15 features

- These results are better than what was previously observed in the
literature for this data  (López, Maldonado, and Carrasco (2018); Takada, Suzuki, and Fujisawa (2018); Huynh, Do, and others (2020))

Links:
  - [To paper](https://ieeexplore.ieee.org/document/9229097)
  - [To full code](https://github.com/brunaw/reg-rf-demo/tree/master/code)
  - [To presentation repository](https://github.com/brunaw/reg-rf-demo/tree/master/presentation)

---
 
# References

<p><cite><a id='bib-Breiman1996'></a><a href="#cite-Breiman1996">Breiman, L.</a>
(1996).
&ldquo;Bagging Predictors&rdquo;.
In: <em>Machine Learning</em> 24.2, pp. 123&ndash;140.
ISSN: 1573-0565.
DOI: <a href="https://doi.org/10.1023/A:1018054314350">10.1023/A:1018054314350</a>.
URL: <a href="https://doi.org/10.1023/A:1018054314350">https://doi.org/10.1023/A:1018054314350</a>.</cite></p>

<p><cite><a id='bib-rrf_paper'></a><a href="#cite-rrf_paper">Deng, H. and G. Runger</a>
(2012).
&ldquo;Feature selection via regularized trees&rdquo;.
In: 
<em>The 2012 International Joint Conference on Neural Networks (IJCNN)</em>.
IEEE.
, pp. 1&ndash;8.</cite></p>

<p><cite><a id='bib-gravier2010prognostic'></a><a href="#cite-gravier2010prognostic">Gravier, E, G. Pierron, A. Vincent-Salomon, et al.</a>
(2010).
&ldquo;A prognostic DNA signature for T1T2 node-negative breast cancer patients&rdquo;.
In: <em>Genes, chromosomes and cancer</em> 49.12, pp. 1125&ndash;1134.</cite></p>

<p><cite><a id='bib-phdthesisRF'></a><a href="#cite-phdthesisRF">Louppe, G.</a>
(2014).
&ldquo;Understanding Random Forests: From Theory to Practice&rdquo;.
DOI: <a href="https://doi.org/10.13140/2.1.1570.5928">10.13140/2.1.1570.5928</a>.</cite></p>

<p><cite><a id='bib-wundervald2020generalizing'></a><a href="#cite-wundervald2020generalizing">Wundervald, B, A. C. Parnell, and K. Domijan</a>
(2020).
&ldquo;Generalizing Gain Penalization for Feature Selection in Tree-Based Models&rdquo;.
In: <em>IEEE Access</em> 8, pp. 190231&ndash;190239.</cite></p>

---

<font size="80">Thanks! </font>

<b> 
<font size="80">http://brunaw.com/ </font>

<p>