The tidyverse for Machine Learning

# **The `tidyverse` for Machine Learning**

### Bruna Wundervald <br> Maynooth University

#### R-Ladies Helsinki Meetup, June 2020

]

<div class="column">
<img src="https://raw.githubusercontent.com/rstudio/hex-stickers/5fc34f4a5775bcefb5c31c33d750c435c4871a84/SVG/purrr.svg" width="100">
  </div>
  
<div class="column">
<img src="https://raw.githubusercontent.com/rstudio/hex-stickers/5fc34f4a5775bcefb5c31c33d750c435c4871a84/SVG/tibble.svg" width="100">
  </div>  
  
<div class="column">
<img src="https://raw.githubusercontent.com/rstudio/hex-stickers/5fc34f4a5775bcefb5c31c33d750c435c4871a84/SVG/tidymodels.svg" width="100">
  </div>  
</div>
]

---
class: inverse, middle

- Ph.D. Candidate in Statistics at the 
  [Hamilton Institute, Maynooth University](https://www.maynoothuniversity.ie/hamilton)
  
  - Especially interested in tree-based models:
    - Regularization for tree-based models
    - Bayesian Additive Regression Trees (BART)

]

# Find me

[GitHub: @brunaw](http://github.com/brunaw)  
[Site: http://brunaw.com/](http://brunaw.com/)  
[Twitter: @bwundervald](http://twitter.com/bwundervald)

]

---
class: middle

# Summary

- Tree-based models
- Gain penalization for tree regularization
- The data
- Modelling:
  - Train and test splits
  - Creating a model list
  - Building a modelling function
  - Training all models 
  - Evaluating the models
  
#### Find this talk at: http://brunaw.com/slides/rladies-helsinki/talk.html

#### GitHub: https://github.com/brunaw/tidyverse-for-ml

---

# Motivation

Using the `tidyverse` for Machine Learning gives us:

- All tools needed 
- Clear & consistent syntax
- Reproducibility advantages: all elements of our models 
in only one object 
  - Train and test sets, tuning parameters, models used, 
  evaluation metrics, etc

# Basic ML Steps

- Train and test separation
  - Model definition
  - Model evaluation

---
class: inverse, middle

# Tree-based models

---
# Tree-based models

.pull-left[
<img src="img/trees.png" width="80%" height="30%" style="display: block; margin: auto;" />
]

.pull-right[
Suppose we have a response variable `$Y$` (continuous or class), and
a set of predictor variables `$\mathbf{X}$`.

- Trees stratify the predictors'space into regions 
- Uses binary splitting rules to find the regions

]

---

# Trees: the algorithm

Recursive binary splitting:

1. Select the predictor `$X_j$` and the cutpoint `$s$` such that 
the split `$\{X | X_j <  s\}$` and `$\{X | X_j \geq  s\}$` leads to
the greatest reduction in the variance of `$Y$`. 
    - All predictors and all available cutpoints are tested
  
  2. For each region found, predict either the mean of `$Y$` in the region
(continuous case) or the most common class (classification case).

3. Continue until some criterion is reached
    - Example: continue until no region contains more than 5 observations

---

# Gain penalization for tree regularization

- Firstly presented in Deng and Runger (2013):
  - The authors **penalise the gain (RSS reduction)** of each 
  variable, for each tree when building a model

- The main idea is to weigh the gain of each variable, with

`$$\begin{equation} Gain_{R}(X_i, v) =  \begin{cases} \lambda_i Gain(X_i, v), i \notin F \text{ and} \\ Gain(X_i, v), i \in F,  \end{cases} \end{equation}$$`

where `$F$` represents the set of indices used in the previous nodes and 
`$\lambda_i \in (0, 1]$` is the penalization applied to the splitting.

- The variables will only get picked if their gain is **very** high.

More details:  *Wundervald, B, A. Parnell, and K. Domijan (2020). “Generalizing Gain Penalization for Feature Selection in Tree-based Models”. In: arXiv e-prints, p. arXiv:2006.07515. arXiv: 2006.07515*

---
### Modelling: tree-based methods

- Trees (CART): 1 tree, `$\texttt{mtry}$` = # all available variables
  
  - *Bagging*: average of many trees, `$\texttt{mtry}$` = # all available variables
  
  - Regularized *Bagging*: same as above, but with a
  variable gain penalized by a factor between 0 and 1 
  
  - Regularized *Bagging*, with depth penalization: 
  same as above, but with an "extra" penalization when a new variable
 is to be picked in a deep node of a tree:
 
`$$\begin{equation} Gain_{R}(\mathbf{X}_{i}, t, \mathbb{T}) =  \begin{cases} \lambda_{i}^{d_{\mathbb{T}}} \Delta(i, t), \thinspace i \notin F \text{ and} \\ \Delta(i, t), \thinspace i \in  F, \end{cases} \end{equation}$$`

where `$d_{\mathbb{T}}$` is the current depth of the `$\mathbb{T}$` tree, 
`$\mathbb{T} = (1, \dots, \texttt{ntree})$`, for the `$i$`-th feature.

- Random Forest: average of many trees, `$\texttt{mtry} \approx \sqrt{\text{# all available variables}}$`
  
- Regularized Random Forests: average of many trees, `$\texttt{mtry} \approx \text{# all available variables}/2$`, 
  variable gain penalized by a factor between 0 and 1

- Regularized Random Forests with depth penalization: same 
 as above, but with an "extra" penalization when a new variable
 is to be picked in a deep node of a tree

---
# The data

- Target variable (**ridership**): daily number of people entering the Clark and Lake train station in Chicago (in thousands)

> Goal: to predict this variable and find the optimal variables
for that

- 50 Predictors:
  - Current date  
  - The 14-day lagged ridership at this and other stations (units: thousands of rides/day)
  - Weather information
  - Sport teams schedules
  - +

---
class: middle

# Loading data and visualizing

```{r
library(tidyverse)
library(tidymodels)
library(ranger) # for tree-based models

data <- dials::Chicago

data %>%  
  ggplot(aes(x = ridership)) +
  geom_density(fill = "#ff6767", alpha = 1) +
  labs(x = "Target Variable", y = "Density") +
  theme_classic(18) 
```

]

<img src="img/density.png" width="60%" style="display: block; margin: auto;" />
]

---
class: middle

- Interesting distribution!
- Good for tree-based models

---
class: inverse, middle

# Modelling

---

# Step 1. Train (75%) and test (25%) splits

```{r
data_tibble <- rep(list(data), 10) %>% 
  enframe(name = 'index', value = 'data') %>% 
  mutate(train_test = purrr::map(data, initial_split, prop = 3/4))
```

```
# A tibble: 10 x 3
  index data                  train_test         
  <int> <list>                <list>             
1     1 <tibble [5,698 × 50]> <split [4.3K/1.4K]>
2     2 <tibble [5,698 × 50]> <split [4.3K/1.4K]>
3     3 <tibble [5,698 × 50]> <split [4.3K/1.4K]>
# … with 7 more rows
```

- The `train_test` column is a list with two elements: the train and test sets

---

# Step 2. Creating our model list

```{r
models <- list(
 tree = list(mtry = ncol(data) - 1, trees = 1, reg.factor = 1, depth = FALSE),
 
 bagging = list(mtry = ncol(data) - 1, trees = 100, reg.factor = 1, depth = FALSE), 
 
 bagging_reg = list(mtry = ncol(data) - 1, trees = 100, reg.factor = 0.7, depth = FALSE),
 
 bagging_reg_dep = list(mtry = ncol(data) - 1, trees = 100, reg.factor = 0.7, depth = TRUE), 
 
 forest = list(mtry = sqrt(ncol(data) - 1), trees = 100, reg.factor = 1, depth = FALSE),
 
 reg_forest =  list(mtry = (ncol(data) - 1)/2, trees = 100, reg.factor = 0.7, depth = FALSE),
 
 reg_forest_dep =  list(mtry = (ncol(data) - 1)/2, trees = 100, reg.factor = 0.7, depth = TRUE),
 
 reg_forest2 =  list(mtry = (ncol(data) - 1)/2, trees = 100, reg.factor = 0.2, depth = FALSE), 
 
 reg_forest_dep2 = list(mtry = (ncol(data)-1)/2, trees = 100, reg.factor = 0.2, depth = TRUE)) %>% 
 enframe(name = 'model', value = 'parameters')
```

---

Adding the models to our main `tibble`:

```{r
data_tibble <- data_tibble %>% 
*  crossing(models) %>% 
   arrange(model)
```

# What do we have so far?

A tibble with:
  - All the train and test data
  - Their combinations with the all the model configurations

```
# A tibble: 90 x 5
  index data                  train_test          model   parameters      
  <int> <list>                <list>              <chr>   <list>          
1     1 <tibble [5,698 × 50]> <split [4.3K/1.4K]> bagging <named list [4]>
2     2 <tibble [5,698 × 50]> <split [4.3K/1.4K]> bagging <named list [4]>
3     3 <tibble [5,698 × 50]> <split [4.3K/1.4K]> bagging <named list [4]>
# … with 87 more rows
```

---

# Step 3. Building a modelling function

- We will use the `ranger` package through the `parsnip` interface
  - There are main arguments and package-specific arguments to be
  set
  
```{r
modelling <- function(train, mtry = NULL, trees = NULL, 
                      reg.factor = 1, depth = FALSE, 
                      formula = ridership ~ ., mode = "regression") {
  
  model_setup <- parsnip::rand_forest(mode = mode, mtry = mtry, trees = trees) %>% 
    parsnip::set_engine("ranger", regularization.factor = reg.factor, 
                                  regularization.usedepth = depth)

us_hol <- timeDate::listHolidays() %>% str_subset("(^US)|(Easter)")
  # Recipe 
  rec <- recipe(formula, data = train) %>% 
    step_holiday(date, holidays = us_hol) %>%  # Include US holidays
    step_date(date) %>% step_rm(date) 
  # Preparation
  preps <- prep(rec, verbose = FALSE)
  # Final fit!   
  fit <- model_setup %>% parsnip::fit(formula, data = juice(preps))
  return(fit)
}
```

---

# Step 3. Building a modelling function

## How do our model configuration looks like?

```
  #  Random Forest Model Specification (regression)
  #  
  #  Main Arguments:
  #   mtry = mtry
  #   trees = trees
  #  
  #  Engine-Specific Arguments:
  #   regularization.factor = reg.factor
  #   regularization.usedepth = depth
  #  
  #  Computational engine: ranger

```

---

# Step 4. Training all models (90) at once -- might be slow (!)

```{r

training_models <- data_tibble %>% 
  mutate(
    all_parameters = 
      map2(parameters, map(train_test, training), 
           ~list_modify(.x, train = .y)))  %>% 
  mutate(model_trained = invoke_map(modelling, all_parameters))

```

---

# Which are the best models?

- Metrics:

- Root Mean Squared Error
  - Total number of variables used in the model
  - R-squared
  
```{r
rmse <- function(model, test, formula = ridership ~ .,
        us_hol = timeDate::listHolidays() %>% str_subset("(^US)|(Easter)")){
  
  rec <- recipe(formula, data = test) %>% 
    step_holiday(date, holidays = us_hol) %>%  # Include US holidays
    step_date(date) %>% 
    step_rm(date) 
  
  preps <- prep(rec, verbose = FALSE)
  pp <- predict(model, juice(preps))
  sqrt(mean((pp$.pred - test$ridership)^2))
}

n_variables <- function(model){
  length(unique((unlist(model$fit$forest$split.varIDs))))
}
```

---

# Step 5. Evaluating models

```r
results <- training_models %>% 
  mutate(rmse = map2_dbl(.x = model_trained, .y = map(train_test, testing), rmse), 
    n_variables = map_int(model_trained, n_variables),
    rsquared = map_dbl(model_trained, ~{.x$fit$r.squared})) 
```

<table class="table table-condensed table-hover" style="margin-left: auto; margin-right: auto;">
<caption>Mean results per model combination</caption>
 <thead>
  <tr>
   <th style="text-align:left;"> model </th>
   <th style="text-align:left;"> rmse </th>
   <th style="text-align:left;"> n_variables </th>
   <th style="text-align:left;"> rsquared </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> bagging </td>
   <td style="text-align:left;width: 2cm; "> <span style="display: inline-block; direction: rtl; border-radius: 4px; padding-right: 2px; background-color: lightgreen; width: 60.47%">1.56</span> </td>
   <td style="text-align:left;width: 2cm; "> 71 </td>
   <td style="text-align:left;"> <span style="display: inline-block; direction: rtl; border-radius: 4px; padding-right: 2px; background-color: lightgreen; width: 100.00%">0.94</span> </td>
  </tr>
  <tr>
   <td style="text-align:left;"> bagging_reg </td>
   <td style="text-align:left;width: 2cm; "> 2.29 </td>
   <td style="text-align:left;width: 2cm; "> 13.5 </td>
   <td style="text-align:left;"> 0.88 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> bagging_reg_dep </td>
   <td style="text-align:left;width: 2cm; "> 2.58 </td>
   <td style="text-align:left;width: 2cm; "> <span style="display: inline-block; direction: rtl; border-radius: 4px; padding-right: 2px; background-color: lightgreen; width: 16.48%">11.7</span> </td>
   <td style="text-align:left;"> 0.83 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> forest </td>
   <td style="text-align:left;width: 2cm; "> 2.12 </td>
   <td style="text-align:left;width: 2cm; "> 71 </td>
   <td style="text-align:left;"> 0.89 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> reg_forest </td>
   <td style="text-align:left;width: 2cm; "> 2.22 </td>
   <td style="text-align:left;width: 2cm; "> 26.8 </td>
   <td style="text-align:left;"> 0.88 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> reg_forest_dep </td>
   <td style="text-align:left;width: 2cm; "> 2.31 </td>
   <td style="text-align:left;width: 2cm; "> 24.4 </td>
   <td style="text-align:left;"> 0.87 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> reg_forest_dep2 </td>
   <td style="text-align:left;width: 2cm; "> 2.37 </td>
   <td style="text-align:left;width: 2cm; "> 25.9 </td>
   <td style="text-align:left;"> 0.86 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> reg_forest2 </td>
   <td style="text-align:left;width: 2cm; "> 2.41 </td>
   <td style="text-align:left;width: 2cm; "> 24.9 </td>
   <td style="text-align:left;"> 0.86 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> tree </td>
   <td style="text-align:left;width: 2cm; "> 2.41 </td>
   <td style="text-align:left;width: 2cm; "> 64.5 </td>
   <td style="text-align:left;"> 0.86 </td>
  </tr>
</tbody>
</table>

---

## A closer look at the RMSE

---

# The final object

```
# A tibble: 90 x 10
  index data  train_test model parameters all_parameters model_trained  rmse
  <int> <lis> <list>     <chr> <list>     <list>         <list>        <dbl>
1     1 <tib… <split [4… bagg… <named li… <named list [… <fit[+]>       1.44
2     2 <tib… <split [4… bagg… <named li… <named list [… <fit[+]>       1.44
3     3 <tib… <split [4… bagg… <named li… <named list [… <fit[+]>       1.56
4     4 <tib… <split [4… bagg… <named li… <named list [… <fit[+]>       1.60
5     5 <tib… <split [4… bagg… <named li… <named list [… <fit[+]>       1.50
# … with 85 more rows, and 2 more variables: n_variables <int>, rsquared <dbl>
```

---

# Conclusions

.pull-left[
  - We can build a unique object to store everything at the same time:
  data, train, test, seeds, hyperparameters, fitted models, results, evaluation metrics, computational details, etc

- Very useful to quickly compare models
  - Reproducibility (papers, reports)

]

---

--- 
 
# References

<p><cite><a id='bib-guided'></a><a href="#cite-guided">Deng, H. and G. Runger</a>
(2013).
&ldquo;Gene selection with guided regularized random forest&rdquo;.
In: <em>Pattern Recognition</em> 46.12, pp. 3483&ndash;3489.</cite></p>

<p><cite>Wundervald, B, A. Parnell, and K. Domijan
(2020).
&ldquo;Generalizing Gain Penalization for Feature Selection in Tree-based Models&rdquo;.
In: <em>arXiv e-prints</em>, p. arXiv:2006.07515.
arXiv: 2006.07515 [stat.ML].</cite></p>

---

<font size="30">Thanks! </font>

<b>

<color="FFFFFF">  https://github.com/brunaw </color>