Introduction to Random Forests in R

# Introduction to Random Forests in R
## R-Ladies Dublin Meetup
### Bruna Wundervald
### June, 2019

---

# Outline

- Introduction
- Trees
  - Model
  - Using in `R`
- Random Forests
  - Model 
  - Using in `R`
- Resources

---

.pull-left[
<img src="img/bruna.jpg" width="70%" style="display: block; margin: auto;" />
] .pull-right[
**Bruna Wundervald**
- Ph.D. Candidate in Statistics at Maynooth University.
- Twitter: @bwundervald
- GitHub: @brunaw
]

Link for this presentation:  http://brunaw.com/slides/rladies-dublin/RF/intro-to-rf.html

---

# Introduction

- Goal: predict continous or class variables

- How?
  - Using data split into train and test set
  - Tree-based models to create predictions based on a set
  of explanatory/predictor variables

---

# Pre-requisites

.pull-left[
<img src="img/pipe.png" width="50%" height="30%" style="display: block; margin: auto;" />

]

- `%>%`: applies to what is on the right the operations at the left
- `dplyr`: a set of (very) useful functions for data manipulation 
- `ggplot2`: graphic tools, following the [grammar of graphics](https://byrneslab.net/classes/biol607/readings/wickham_layered-grammar.pdf)

]

---
class: inverse, middle, center

## Trees 🌳

---
class: middle

# Trees: the model

---

# Trees: the model

Suppose we have a response variable `$Y$` (continuous or class), and
a set of predictor variables `$\mathbf{X}$`.

- Trees stratify the predictors'space into regions 
- Uses binary splitting rules to find the regions

.pull-left[
<img src="img/vars_space.png" width="80%" height="30%" style="display: block; margin: auto;" />
]
.pull-right[
<img src="img/vars_space2.png" width="80%" height="30%" style="display: block; margin: auto;" />
]

Breaks in the predictors'space (Hastie, Tibshirani, and Friedman, 2009)

---

# Trees: the algorithm

Recursive binary splitting:

- Select the predictor `$X_j$` and the cutpoint `$s$` such that 
the split `$\{X | X_j < s\}$` and `$\{X | X_j \geq s\}$` leads to
the greatest reduction in the variance of `$Y$`. 
 - All predictors and all available cutpoints are tested
 
 - For each region found, predict either the mean of `$Y$` in the region
(continuous case) or the most common class (classification case).

- Continue until some criterion is reached
    - Example: continue until no region contains more than 5 observations

---

# Trees: the algorithm

- Continuous response: the goal is to find the tree that minimizes the
sum of squared errors, or

`$$\sum_{j = 1}^{J} \sum_{i \in R_j} (y_i - \hat y_{R_j})^2,$$`

where `$R$` are the `$J = 1,\dots, R$` regions found by the tree.

- Class/factor response: the goal is to find the tree that best
separates the classes, minimizing the **Gini index**

`$$G = \sum_{k = 1}^{K} \hat p_{jk} ( 1 - \hat p_{jk})$$`
where `$K$` is the total of classes  in the `$j$`-th region.

---

# Trees: advantages

- Very interpretable: close to the human-decision process
- Can be displayed graphically 
- Handle well non-linearities **between** the predictors

---
# Trees: in `R`

Using the `diamonds` data, from the `ggplot2` package, split in 
**train** and **test** sets:

```r
set.seed(2019)
data <- diamonds %>% 
 mutate(set = ifelse(runif(nrow(.)) > 0.75, "test", "train"))

train <- data %>% filter(set == "train") %>% select(-set)
test <- data %>% filter(set == "test") %>% select(-set)
glimpse(data)
```

```
## Observations: 53,940
## Variables: 11
## $ carat <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.…
## $ cut <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Goo…
## $ color <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J,…
## $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1,…
## $ depth <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59…
## $ table <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, …
## $ price <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 3…
## $ x <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.…
## $ y <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.…
## $ z <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.…
## $ set <chr> "test", "train", "train", "train", "train", "train", "te…
```

---

**Goal**: predict the price of the diamonds, price in US dollars (\$326--\$18,823)

Variables:

- `carat`: weight of the diamond (0.2--5.01)
 - `cut`: quality of the cut (Fair, Good, Very Good, Premium, Ideal)
 - `color`: diamond colour, from D (best) to J (worst)
 - `clarity`: a measurement of how clear the diamond is 
 (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))
 - `x`: length in mm (0--10.74)
 - `y`: width in mm (0--58.9)
 - `z`: depth in mm (0--31.8)
 - `depth`: total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43--79)
 - `table`: width of top of diamond relative to widest point (43--95)

---

```r
train %>% 
  ggplot(aes(price)) +
  geom_density(fill = "#ff6767", alpha = 0.5) +
  theme_bw(22)
```

---
# Trees: in `R`

`rpart` + `rpart.plot` packages!

```r
library(rpart)
library(rpart.plot)

first_model <- rpart(price ~ color, 
 data = train)
first_model
```

```
## n= 40533 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
## 1) root 40533 649205500000 3950.496  
##   2) color=D,E,F,G 28139 389702800000 3556.392 *
##   3) color=H,I,J 12394 245209600000 4845.259 *
```

---
class: middle

# Plotting the tree

```r
rpart.plot(first_model)
```

---

```r
# Adding now a few more predictors

second_model <- rpart(price ~ cut + color + depth + carat, 
 data = train)
second_model
```

```
## n= 40533 
## 
## node), split, n, deviance, yval
## * denotes terminal node
## 
## 1) root 40533 649205500000 3950.496 
## 2) carat< 0.995 26120 32647670000 1635.811 
## 4) carat< 0.625 18529 4916240000 1051.213 *
## 5) carat>=0.625 7591 5942264000 3062.765 *
## 3) carat>=0.995 14413 222997400000 8145.291 
## 6) carat< 1.495 9747 46499840000 6147.496 *
## 7) carat>=1.495 4666 56331320000 12318.570 
## 14) carat< 1.915 3006 26690610000 10896.170 *
## 15) carat>=1.915 1660 12545830000 14894.300 *
```

---

```r
rpart.plot(second_model)
```

---

## Variable importance

- A measure of "how important" each variable is for the final result
  - Represents the decrease in the variance/impurity caused by the 
  addition of a variable to the tree

```r
first_model$variable.importance/max(first_model$variable.importance)
```

```
## color 
##     1
```

```r
second_model$variable.importance/max(second_model$variable.importance)
```

```
##       carat       color       depth         cut 
## 1.000000000 0.030817117 0.004239303 0.001158316
```

(Much better seen as a normalized measure!)

- Clearly, the `carat` dominates the tree

---

# Predictions

```r
pred_first <- predict(first_model, test)
pred_second <- predict(second_model, test)

rmse <- function(x) sqrt(sum((x - test$price)^2))

rmse(pred_first)
```

```
## [1] 452219.3
```

```r
rmse(pred_second)
```

```
## [1] 176960.1
```

Should we disregard the bad tree? **No**!

---

##  🌵🌳🌲🌿  Random Forests 🌴🌲🌵🌳

---

#  Random Forests

- It's a sum of "tweaked" trees
  - We grow **B** trees in bootstrap samples of the original data, 
  and later average their predictions

- The big difference: at each split, only `$m \approx \sqrt(p)$`
variables are tested

Why? It decorrelates the trees, avoiding that **strong** 
predictors always appear on top 
- It creates different trees at each time and all of them
contribute to the final prediction

Algorithm: same as for the tree, but with bootstrapped data

The algorithm is very naive, but how naive?

---

# Trees and Random Forests
Trees:
<img src="img/tree.png" width="25%" style="display: block; margin: auto;" />

Random Forests:
<img src="img/rf.png" width="90%" height="60%" style="display: block; margin: auto;" />

---

Imagine now that you have to buy the fruit trying to pick the one that tastes the better: 🍉 🍊 🍇

- A **Bayesian** person would: pick the fruit that he or she likes the most
(using previous information)

- A **frequentist** person would: ask the seller which one other people
bought the most

What the **random forest** 'would do': try all the fruits and decide 
later. Naturally slower!

---
# How to overcome slowness?

- Use fast programming languages (e.g. `c++`)
  
Packages for random forests in `R`, that are actually written in
`c` and `c++`:

- `randomForest`: written by the author, Leo Breiman
  - `ranger`: a faster version of the RF algorithm

Common arguments in the `ranger()` function: 
  - `mtry`: number of variables to try at each split (default to m = sqrt(p))
  - `num.trees`: number of trees to build (default to 500)
  - `min.node.size`: minimum number of observations in each node (default to 5)

---

# Random forests: in `R`

```r
library(ranger)

# only a tree
first_rf <- ranger(price ~ cut + color + depth + carat, 
 num.trees = 1, mtry = 4, data = train)

first_rf
```

```
## Ranger result
## 
## Call:
##  ranger(price ~ cut + color + depth + carat, num.trees = 1, mtry = 4,      data = train) 
## 
## Type:                             Regression 
## Number of trees:                  1 
## Sample size:                      40533 
## Number of independent variables:  4 
## Mtry:                             4 
## Target node size:                 5 
## Variable importance mode:         none 
## Splitrule:                        variance 
## OOB prediction error (MSE):       2236271 
## R squared (OOB):                  0.8603823
```

The summary is actually very clear!

# 'out-of-bag' error

- The bootstrapped samples used around `$\approx 2/3$` of the original
  observations for each tree
  - The remaining 1/3 is used to create predictions
    - Those predictions are averaged for each `$i$`-th observation 
    (in the regression case)
    
  - The resulting OOB error is a valid estimate of the error of the model

---

# Random forests: in `R`

```r
# A forest, adding all the variables as predictors
second_rf <- ranger(price ~ ., num.trees = 50, data = train,
 importance = "impurity")

second_rf
```

```
## Ranger result
## 
## Call:
##  ranger(price ~ ., num.trees = 50, data = train, importance = "impurity") 
## 
## Type:                             Regression 
## Number of trees:                  50 
## Sample size:                      40533 
## Number of independent variables:  9 
## Mtry:                             3 
## Target node size:                 5 
## Variable importance mode:         impurity 
## Splitrule:                        variance 
## OOB prediction error (MSE):       330411.3 
## R squared (OOB):                  0.9793714
```

---

# Predictions

```r
pred_rf_first <- predict(first_rf, test)$predictions
pred_rf_second <- predict(second_rf, test)$predictions
rmse(pred_rf_first)
```

```
## [1] 171772.8
```

```r
rmse(pred_rf_second)
```

```
## [1] 64745.46
```

Much better!

---

## How are the predictions compared to the observed data?

```r
test %>% 
  mutate(predicted = predict(second_rf, test)$predictions) %>% 
  ggplot(aes(predicted, price)) +
  geom_point(colour = "#ff6767", alpha = 0.3) +
  labs(title = "Predicted and observed") +  theme_bw(18)
```

---

## Evaluating importances

```r
imps <- data.frame(var = names(train)[-7],
 imps = second_rf$variable.importance/max(second_rf$variable.importance))
imps %>% 
 ggplot(aes(imps, x = reorder(var, imps))) +
 geom_point(size = 3, colour = "#ff6767") +
 coord_flip() +
 labs(x = "Predictors", y = "Importance scores") +
 theme_bw(18) 
```

---

# Careful!

Random forests are **highly sensitive** to correlated 
predictors: it splits their importance

```r
corrplot::corrplot(cor(diamonds %>% select_if(is.numeric), 
                       method = "spearman"))
```

---

# Variable selection: removing the correlated predictors

```r
third_rf <- ranger(price ~ table + depth + carat +
 color + clarity + cut, num.trees = 50, 
 importance = "impurity", data = train)
third_rf
```

```
## Ranger result
## 
## Call:
##  ranger(price ~ table + depth + carat + color + clarity + cut,      num.trees = 50, importance = "impurity", data = train) 
## 
## Type:                             Regression 
## Number of trees:                  50 
## Sample size:                      40533 
## Number of independent variables:  6 
## Mtry:                             2 
## Target node size:                 5 
## Variable importance mode:         impurity 
## Splitrule:                        variance 
## OOB prediction error (MSE):       372112.6 
## R squared (OOB):                  0.9767678
```

Very similar to the previous model!

---
# Results

```r
pred_rf_third <- predict(third_rf, test)$predictions
rmse(pred_rf_third)
```

```
## [1] 67809.49
```

---

```r
imps <- data.frame(var = third_rf$forest$independent.variable.names,
 imps = third_rf$variable.importance/max(third_rf$variable.importance))

imps %>% 
  ggplot(aes(imps, x = reorder(var, imps))) +
  geom_point(size = 3, colour = "#ff6767") +
  coord_flip() +
  labs(x = "Predictors", y = "Importance scores") +
  theme_bw(18) 
```

---

```r
test %>% 
  mutate(predicted = predict(third_rf, test)$predictions) %>% 
  ggplot(aes(predicted, price)) +
  geom_point(colour = "#ff6767", alpha = 0.3) +
  labs(title = "Predicted and observed") + 
  theme_bw(18)
```

---

# Resources

- [Elements of Statistical Learning](https://web.stanford.edu/~hastie/Papers/ESLII.pdf): open book!

- [`ranger` package](https://github.com/imbs-hl/ranger/)

- [+ about trees](http://uc-r.github.io/regression_trees)

---

# References

<cite>Breiman, L.
(2001).
&ldquo;Random Forests&rdquo;.
In: Machine Learning.
ISSN: 1098-6596.
DOI: <a href="https://doi.org/10.1017/CBO9781107415324.004">10.1017/CBO9781107415324.004</a>.
eprint: arXiv:1011.1669v3.</cite>

<cite><a id='bib-HastieTrevor'></a><a href="#cite-HastieTrevor">Hastie, T, R. Tibshirani, and J. Friedman</a>
(2009).
&ldquo;The Elements of Statistical Learning&rdquo;.
In: Elements 1, pp. 337&ndash;387.
ISSN: 03436993.
DOI: <a href="https://doi.org/10.1007/b94608">10.1007/b94608</a>.
eprint: 1010.3003.
URL: <a href="http://www.springerlink.com/index/10.1007/b94608">http://www.springerlink.com/index/10.1007/b94608</a>.</cite>

<cite>Murphy, K. P.
(2012).
Machine learning: a probabilistic perspective.
MIT press.</cite>

---

Thanks!

<color="FFFFFF"> https://github.com/brunaw </color>