class: center, middle, inverse, title-slide # Introduction to Random Forests in R ## R-Ladies Dublin Meetup ### Bruna Wundervald ### June, 2019 --- class: middle # Outline - Introduction - Trees - Model - Using in `R` - Random Forests - Model - Using in `R` - Resources --- class: middle .pull-left[ <img src="img/bruna.jpg" width="70%" style="display: block; margin: auto;" /> ] .pull-right[ **Bruna Wundervald** - Ph.D. Candidate in Statistics at Maynooth University. - Twitter: @bwundervald - GitHub: @brunaw ] Link for this presentation: http://brunaw.com/slides/rladies-dublin/RF/intro-to-rf.html --- # Introduction - Goal: predict continous or class variables - How? - Using data split into train and test set - Tree-based models to create predictions based on a set of explanatory/predictor variables --- # Pre-requisites .pull-left[ <img src="img/pipe.png" width="50%" height="30%" style="display: block; margin: auto;" /> <img src="img/dplyr.png" width="50%" height="30%" style="display: block; margin: auto;" /> ] .pull-right[ - `%>%`: applies to what is on the right the operations at the left - `dplyr`: a set of (very) useful functions for data manipulation - `ggplot2`: graphic tools, following the [grammar of graphics](https://byrneslab.net/classes/biol607/readings/wickham_layered-grammar.pdf) <br> <img src="img/ggplot2.png" width="50%" height="30%" style="display: block; margin: auto;" /> ] --- class: inverse, middle, center ## Trees π³ --- class: middle # Trees: the model <img src="img/trees.png" width="50%" height="30%" style="display: block; margin: auto;" /> --- # Trees: the model Suppose we have a response variable `\(Y\)` (continuous or class), and a set of predictor variables `\(\mathbf{X}\)`. - Trees stratify the predictors'space into regions - Uses binary splitting rules to find the regions .pull-left[ <img src="img/vars_space.png" width="80%" height="30%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="img/vars_space2.png" width="80%" height="30%" style="display: block; margin: auto;" /> ] Breaks in the predictors'space (Hastie, Tibshirani, and Friedman, 2009) --- # Trees: the algorithm Recursive binary splitting: - Select the predictor `\(X_j\)` and the cutpoint `\(s\)` such that the split `\(\{X | X_j < s\}\)` and `\(\{X | X_j \geq s\}\)` leads to the greatest reduction in the variance of `\(Y\)`. - All predictors and all available cutpoints are tested - For each region found, predict either the mean of `\(Y\)` in the region (continuous case) or the most common class (classification case). - Continue until some criterion is reached - Example: continue until no region contains more than 5 observations --- # Trees: the algorithm - Continuous response: the goal is to find the tree that minimizes the sum of squared errors, or `$$\sum_{j = 1}^{J} \sum_{i \in R_j} (y_i - \hat y_{R_j})^2,$$` where `\(R\)` are the `\(J = 1,\dots, R\)` regions found by the tree. - Class/factor response: the goal is to find the tree that best separates the classes, minimizing the **Gini index** `$$G = \sum_{k = 1}^{K} \hat p_{jk} ( 1 - \hat p_{jk})$$` where `\(K\)` is the total of classes in the `\(j\)`-th region. --- # Trees: advantages - Very interpretable: close to the human-decision process - Can be displayed graphically - Handle well non-linearities **between** the predictors <img src="img/nonlin.png" width="60%" style="display: block; margin: auto;" /> --- # Trees: in `R` Using the `diamonds` data, from the `ggplot2` package, split in **train** and **test** sets: ```r set.seed(2019) data <- diamonds %>% mutate(set = ifelse(runif(nrow(.)) > 0.75, "test", "train")) train <- data %>% filter(set == "train") %>% select(-set) test <- data %>% filter(set == "test") %>% select(-set) glimpse(data) ``` ``` ## Observations: 53,940 ## Variables: 11 ## $ carat <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.β¦ ## $ cut <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Gooβ¦ ## $ color <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J,β¦ ## $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1,β¦ ## $ depth <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59β¦ ## $ table <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, β¦ ## $ price <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 3β¦ ## $ x <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.β¦ ## $ y <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.β¦ ## $ z <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.β¦ ## $ set <chr> "test", "train", "train", "train", "train", "train", "teβ¦ ``` --- **Goal**: predict the price of the diamonds, price in US dollars (\$326--\$18,823) Variables: - `carat`: weight of the diamond (0.2--5.01) - `cut`: quality of the cut (Fair, Good, Very Good, Premium, Ideal) - `color`: diamond colour, from D (best) to J (worst) - `clarity`: a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best)) - `x`: length in mm (0--10.74) - `y`: width in mm (0--58.9) - `z`: depth in mm (0--31.8) - `depth`: total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43--79) - `table`: width of top of diamond relative to widest point (43--95) --- ```r train %>% ggplot(aes(price)) + geom_density(fill = "#ff6767", alpha = 0.5) + theme_bw(22) ``` <img src="intro-to-rf_files/figure-html/unnamed-chunk-10-1.png" style="display: block; margin: auto;" /> --- # Trees: in `R` `rpart` + `rpart.plot` packages! ```r library(rpart) library(rpart.plot) first_model <- rpart(price ~ color, data = train) first_model ``` ``` ## n= 40533 ## ## node), split, n, deviance, yval ## * denotes terminal node ## ## 1) root 40533 649205500000 3950.496 ## 2) color=D,E,F,G 28139 389702800000 3556.392 * ## 3) color=H,I,J 12394 245209600000 4845.259 * ``` --- class: middle # Plotting the tree ```r rpart.plot(first_model) ``` <img src="intro-to-rf_files/figure-html/unnamed-chunk-12-1.png" style="display: block; margin: auto;" /> --- ```r # Adding now a few more predictors second_model <- rpart(price ~ cut + color + depth + carat, data = train) second_model ``` ``` ## n= 40533 ## ## node), split, n, deviance, yval ## * denotes terminal node ## ## 1) root 40533 649205500000 3950.496 ## 2) carat< 0.995 26120 32647670000 1635.811 ## 4) carat< 0.625 18529 4916240000 1051.213 * ## 5) carat>=0.625 7591 5942264000 3062.765 * ## 3) carat>=0.995 14413 222997400000 8145.291 ## 6) carat< 1.495 9747 46499840000 6147.496 * ## 7) carat>=1.495 4666 56331320000 12318.570 ## 14) carat< 1.915 3006 26690610000 10896.170 * ## 15) carat>=1.915 1660 12545830000 14894.300 * ``` --- ```r rpart.plot(second_model) ``` <img src="intro-to-rf_files/figure-html/unnamed-chunk-14-1.png" style="display: block; margin: auto;" /> --- ## Variable importance - A measure of "how important" each variable is for the final result - Represents the decrease in the variance/impurity caused by the addition of a variable to the tree ```r first_model$variable.importance/max(first_model$variable.importance) ``` ``` ## color ## 1 ``` ```r second_model$variable.importance/max(second_model$variable.importance) ``` ``` ## carat color depth cut ## 1.000000000 0.030817117 0.004239303 0.001158316 ``` (Much better seen as a normalized measure!) - Clearly, the `carat` dominates the tree --- # Predictions ```r pred_first <- predict(first_model, test) pred_second <- predict(second_model, test) rmse <- function(x) sqrt(sum((x - test$price)^2)) rmse(pred_first) ``` ``` ## [1] 452219.3 ``` ```r rmse(pred_second) ``` ``` ## [1] 176960.1 ``` Should we disregard the bad tree? **No**! --- class: center, middle, inverse ## π΅π³π²πΏ Random Forests π΄π²π΅π³ --- # Random Forests - It's a sum of "tweaked" trees - We grow **B** trees in bootstrap samples of the original data, and later average their predictions - The big difference: at each split, only `\(m \approx \sqrt(p)\)` variables are tested Why? It decorrelates the trees, avoiding that **strong** predictors always appear on top - It creates different trees at each time and all of them contribute to the final prediction Algorithm: same as for the tree, but with bootstrapped data <font color = "#ff6767"> The algorithm is very naive, but how naive? </font> --- # Trees and Random Forests Trees: <img src="img/tree.png" width="25%" style="display: block; margin: auto;" /> Random Forests: <img src="img/rf.png" width="90%" height="60%" style="display: block; margin: auto;" /> --- <font size='45'> Imagine now that you have to buy the fruit trying to pick the one that tastes the better: π π π </font> - A **Bayesian** person would: pick the fruit that he or she likes the most (using previous information) - A **frequentist** person would: ask the seller which one other people bought the most What the **random forest** 'would do': try all the fruits and decide later. Naturally slower! --- # How to overcome slowness? - Use fast programming languages (e.g. `c++`) Packages for random forests in `R`, that are actually written in `c` and `c++`: - `randomForest`: written by the author, Leo Breiman - `ranger`: a faster version of the RF algorithm Common arguments in the `ranger()` function: - `mtry`: number of variables to try at each split (default to m = sqrt(p)) - `num.trees`: number of trees to build (default to 500) - `min.node.size`: minimum number of observations in each node (default to 5) --- # Random forests: in `R` ```r library(ranger) # only a tree first_rf <- ranger(price ~ cut + color + depth + carat, num.trees = 1, mtry = 4, data = train) first_rf ``` ``` ## Ranger result ## ## Call: ## ranger(price ~ cut + color + depth + carat, num.trees = 1, mtry = 4, data = train) ## ## Type: Regression ## Number of trees: 1 ## Sample size: 40533 ## Number of independent variables: 4 ## Mtry: 4 ## Target node size: 5 ## Variable importance mode: none ## Splitrule: variance ## OOB prediction error (MSE): 2236271 ## R squared (OOB): 0.8603823 ``` The summary is actually very clear! # 'out-of-bag' error - The bootstrapped samples used around `\(\approx 2/3\)` of the original observations for each tree - The remaining 1/3 is used to create predictions - Those predictions are averaged for each `\(i\)`-th observation (in the regression case) - The resulting OOB error is a valid estimate of the error of the model --- # Random forests: in `R` ```r # A forest, adding all the variables as predictors second_rf <- ranger(price ~ ., num.trees = 50, data = train, importance = "impurity") second_rf ``` ``` ## Ranger result ## ## Call: ## ranger(price ~ ., num.trees = 50, data = train, importance = "impurity") ## ## Type: Regression ## Number of trees: 50 ## Sample size: 40533 ## Number of independent variables: 9 ## Mtry: 3 ## Target node size: 5 ## Variable importance mode: impurity ## Splitrule: variance ## OOB prediction error (MSE): 330411.3 ## R squared (OOB): 0.9793714 ``` --- # Predictions ```r pred_rf_first <- predict(first_rf, test)$predictions pred_rf_second <- predict(second_rf, test)$predictions rmse(pred_rf_first) ``` ``` ## [1] 171772.8 ``` ```r rmse(pred_rf_second) ``` ``` ## [1] 64745.46 ``` Much better! --- ## How are the predictions compared to the observed data? ```r test %>% mutate(predicted = predict(second_rf, test)$predictions) %>% ggplot(aes(predicted, price)) + geom_point(colour = "#ff6767", alpha = 0.3) + labs(title = "Predicted and observed") + theme_bw(18) ``` <img src="intro-to-rf_files/figure-html/unnamed-chunk-23-1.png" style="display: block; margin: auto;" /> --- ## Evaluating importances ```r imps <- data.frame(var = names(train)[-7], imps = second_rf$variable.importance/max(second_rf$variable.importance)) imps %>% ggplot(aes(imps, x = reorder(var, imps))) + geom_point(size = 3, colour = "#ff6767") + coord_flip() + labs(x = "Predictors", y = "Importance scores") + theme_bw(18) ``` <img src="intro-to-rf_files/figure-html/unnamed-chunk-24-1.png" style="display: block; margin: auto;" /> --- # Careful! Random forests are **highly sensitive** to correlated predictors: it splits their importance ```r corrplot::corrplot(cor(diamonds %>% select_if(is.numeric), method = "spearman")) ``` <img src="intro-to-rf_files/figure-html/unnamed-chunk-25-1.png" style="display: block; margin: auto;" /> --- # Variable selection: removing the correlated predictors ```r third_rf <- ranger(price ~ table + depth + carat + color + clarity + cut, num.trees = 50, importance = "impurity", data = train) third_rf ``` ``` ## Ranger result ## ## Call: ## ranger(price ~ table + depth + carat + color + clarity + cut, num.trees = 50, importance = "impurity", data = train) ## ## Type: Regression ## Number of trees: 50 ## Sample size: 40533 ## Number of independent variables: 6 ## Mtry: 2 ## Target node size: 5 ## Variable importance mode: impurity ## Splitrule: variance ## OOB prediction error (MSE): 372112.6 ## R squared (OOB): 0.9767678 ``` Very similar to the previous model! --- # Results ```r pred_rf_third <- predict(third_rf, test)$predictions rmse(pred_rf_third) ``` ``` ## [1] 67809.49 ``` --- ```r imps <- data.frame(var = third_rf$forest$independent.variable.names, imps = third_rf$variable.importance/max(third_rf$variable.importance)) imps %>% ggplot(aes(imps, x = reorder(var, imps))) + geom_point(size = 3, colour = "#ff6767") + coord_flip() + labs(x = "Predictors", y = "Importance scores") + theme_bw(18) ``` <img src="intro-to-rf_files/figure-html/unnamed-chunk-28-1.png" style="display: block; margin: auto;" /> --- ```r test %>% mutate(predicted = predict(third_rf, test)$predictions) %>% ggplot(aes(predicted, price)) + geom_point(colour = "#ff6767", alpha = 0.3) + labs(title = "Predicted and observed") + theme_bw(18) ``` <img src="intro-to-rf_files/figure-html/unnamed-chunk-29-1.png" style="display: block; margin: auto;" /> --- # Resources - [Elements of Statistical Learning](https://web.stanford.edu/~hastie/Papers/ESLII.pdf): open book! - [`ranger` package](https://github.com/imbs-hl/ranger/) - [+ about trees](http://uc-r.github.io/regression_trees) <img src="img/esl.png" width="30%" style="display: block; margin: auto;" /> --- # References <p><cite>Breiman, L. (2001). “Random Forests”. In: <em>Machine Learning</em>. ISSN: 1098-6596. DOI: <a href="https://doi.org/10.1017/CBO9781107415324.004">10.1017/CBO9781107415324.004</a>. eprint: arXiv:1011.1669v3.</cite></p> <p><cite><a id='bib-HastieTrevor'></a><a href="#cite-HastieTrevor">Hastie, T, R. Tibshirani, and J. Friedman</a> (2009). “The Elements of Statistical Learning”. In: <em>Elements</em> 1, pp. 337–387. ISSN: 03436993. DOI: <a href="https://doi.org/10.1007/b94608">10.1007/b94608</a>. eprint: 1010.3003. URL: <a href="http://www.springerlink.com/index/10.1007/b94608">http://www.springerlink.com/index/10.1007/b94608</a>.</cite></p> <p><cite>Murphy, K. P. (2012). <em>Machine learning: a probabilistic perspective</em>. MIT press.</cite></p> --- class: bottom, center, inverse <font size="30">Thanks! </font> <img src= "https://s3.amazonaws.com/kleebtronics-media/img/icons/github-white.png", width="50", height="50", align="middle"> <b> <color="FFFFFF"> https://github.com/brunaw </color>