class: title-slide background-image: url(img/ap.png) background-size: cover .footnote[ # The `tidyverse` for Machine Learning ### Bruna Wundervald, ### `satRday` São Paulo, ### November, 2019 ] ??? --- name: hello class: inverse, left, bottom .pull-left[ [GitHub: @brunaw](http://github.com/brunaw) [Site: http://brunaw.com/](http://brunaw.com/) [Twitter: @bwundervald](http://twitter.com/bwundervald) ] <img style="border-radius: 50%;" src="https://github.com/brunaw.png" width="150px"/> # Find me - Ph.D. Candidate in Statistics at the [Hamilton Institute, Maynooth University](https://www.maynoothuniversity.ie/hamilton) - Especially interested in tree-based models: - Regularization in Random Forests - Bayesian Additive Regression Trees (BART) ??? Here is my contact information. --- class: inverse, right, bottom ## Links `pt-br`: http://brunaw.com/slides/satrday-sp/tidyverse-para-AM.html `en`: http://brunaw.com/slides/satrday-sp/tidyverse-for-ml.html GitHub Repository: https://github.com/brunaw/satRday-sp-talk --- # Introduction - Thanks to the `tidyverse`, nowadays it's much easier to create nested workflows of data wrangling and analysis in `R` - However, we can beyond that and use the `tidyverse` for the whole modeling process as well - How? <img src="img/pacotes.png" width="100%" style="display: block; margin: auto;" /> ??? --- # Tidy-data <img src="img/tidy_data.png" width="100%" style="display: block; margin: auto;" /> <img src="img/hadley.jpg" width="20%" style="display: block; margin: auto;" /> --- # Data - About the daily number of people using the Clark and Lake station in Chicago (in thousands) > Goal: to predict this variable and find the optimal variables for that - Predictors: - Date - Weather information - Sport teams schedules - + --- # Loading data and visualizing ```{r library(tidyverse) library(ranger) data <- dials::Chicago dim(data) ``` ``` [1] 5698 50 ``` ```{r data %>% ggplot(aes(x = ridership)) + geom_density(fill = "#919c4c", alpha = 0.8) + labs(x = "Response Variable", y = "Density") + theme_classic() ``` --- <img src="tidyverse-for-ml_files/figure-html/unnamed-chunk-5-1.png" width="60%" style="display: block; margin: auto;" /> --- .pull-left[ <img src="tidyverse-for-ml_files/figure-html/unnamed-chunk-6-1.png" width="95%" style="display: block; margin: auto;" /> ] .pull-right[ - Interesting distribution! - Good for tree-based models <img src="img/tree.png" width="70%" style="display: block; margin: auto;" /> ] --- ## Replicating the same dataset ```{r data_tibble <- rep(list(data), 10) %>% enframe(name = 'index', value = 'data') data_tibble ``` ``` # A tibble: 10 x 2 index data <int> <list> 1 1 <tibble [5,698 × 50]> 2 2 <tibble [5,698 × 50]> 3 3 <tibble [5,698 × 50]> 4 4 <tibble [5,698 × 50]> 5 5 <tibble [5,698 × 50]> 6 6 <tibble [5,698 × 50]> 7 7 <tibble [5,698 × 50]> 8 8 <tibble [5,698 × 50]> 9 9 <tibble [5,698 × 50]> 10 10 <tibble [5,698 × 50]> ``` .callout[The `data` column is now a list of tibbles!] --- ## Splitting in train (75%) and test (25%) sets ```{r train_test <- function(data){ data %>% mutate(base = ifelse(runif(n()) > 0.75, "test", "train")) %>% split(.$base) %>% purrr::map(~select(.x, -.data[["base"]])) } data_tibble <- data_tibble %>% * mutate(train_test = purrr::map(data, train_test)) print(data_tibble, n = 3) ``` ``` # A tibble: 10 x 3 index data train_test <int> <list> <list> 1 1 <tibble [5,698 × 50]> <named list [2]> 2 2 <tibble [5,698 × 50]> <named list [2]> 3 3 <tibble [5,698 × 50]> <named list [2]> # … with 7 more rows ``` .callout[The `train_test` column is a list with two elements: the train and test sets] --- <img src="img/next.jpeg" width="50%" style="display: block; margin: auto;" /> --- ## Modelling: tree-based methods - Many similar models with different hyperparameter configuration .pull-left[ <img src="img/vars_space2.png" width="70%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="img/vars_space.png" width="70%" style="display: block; margin: auto;" /> ] --- ## Modelling: tree-based methods - Trees (CART): 1 tree, `\(\texttt{mtry}\)` = all available variables - *Bagging*: average of many trees, `\(\texttt{mtry}\)` = # all available variables - Random Forest: average of many trees, `\(\texttt{mtry} \approx \sqrt{\text{# all available variables}}\)` - Regularized Random Forests: average of many trees,, `\(\texttt{mtry} \approx \text{# all available variables}/2\)`, variable gain penalized by a factor between 0 and 1 to regularize (More about Regularized Random Forests in: http://brunaw.com/slides/seminar-serie/presentation.html) --- Creating a function to fit all the models: ```{r modelling <- function(train, mtry = NULL, num.trees = NULL, regularization = 1, formula = ridership ~ .) { ranger::ranger(formula, data = train, num.trees = num.trees, mtry = mtry, importance = "impurity", regularization.factor = regularization) } ``` > Note: this is the 0.11.8 version of the `ranger` package, available at https://github.com/imbs-hl/ranger --- Nesting models: ```{r models <- list( tree = list(mtry = ncol(data) - 1, num.trees = 1, regularization = 1), bagging = list(mtry = ncol(data) - 1, num.trees = 100, regularization = 1), forest = list(mtry = sqrt(ncol(data) - 1), num.trees = 100, regularization = 1), regularized_forest07 = list(mtry = (ncol(data) - 1)/2, num.trees = 100, regularization = 0.7), regularized_forest02 = list(mtry = (ncol(data) - 1)/2, num.trees = 100, regularization = 0.2)) %>% enframe(name = 'model', value = 'parameters') models ``` ``` # A tibble: 5 x 2 model parameters <chr> <list> 1 tree <named list [3]> 2 bagging <named list [3]> 3 forest <named list [3]> 4 regularized_forest07 <named list [3]> 5 regularized_forest02 <named list [3]> ``` --- Adding the models to our main `tibble`: ```{r data_tibble <- data_tibble %>% * crossing(models) %>% arrange(model) data_tibble ``` ``` # A tibble: 50 x 5 index data train_test model parameters <int> <list> <list> <chr> <list> 1 1 <tibble [5,698 × 50]> <named list [2]> bagging <named list [3]> 2 2 <tibble [5,698 × 50]> <named list [2]> bagging <named list [3]> 3 3 <tibble [5,698 × 50]> <named list [2]> bagging <named list [3]> 4 4 <tibble [5,698 × 50]> <named list [2]> bagging <named list [3]> 5 5 <tibble [5,698 × 50]> <named list [2]> bagging <named list [3]> 6 6 <tibble [5,698 × 50]> <named list [2]> bagging <named list [3]> 7 7 <tibble [5,698 × 50]> <named list [2]> bagging <named list [3]> 8 8 <tibble [5,698 × 50]> <named list [2]> bagging <named list [3]> 9 9 <tibble [5,698 × 50]> <named list [2]> bagging <named list [3]> 10 10 <tibble [5,698 × 50]> <named list [2]> bagging <named list [3]> # … with 40 more rows ``` --- Finally training all the models at once! There are many models to run, so now you go grab a coffee, read a magazine... ```{r training_models <- data_tibble %>% mutate( full_parameters = * map2(parameters, map(train_test, "train"), ~list_modify(.x, train = .y)), * train_model = invoke_map(modelling, full_parameters)) print(training_models, n = 5) ``` ``` # A tibble: 50 x 7 index data train_test model parameters full_parameters train_model <int> <list> <list> <chr> <list> <list> <list> 1 1 <tibble … <named lis… bagg… <named lis… <named list [4… <ranger> 2 2 <tibble … <named lis… bagg… <named lis… <named list [4… <ranger> 3 3 <tibble … <named lis… bagg… <named lis… <named list [4… <ranger> 4 4 <tibble … <named lis… bagg… <named lis… <named list [4… <ranger> 5 5 <tibble … <named lis… bagg… <named lis… <named list [4… <ranger> # … with 45 more rows ``` --- <img src="img/thousand.jpeg" width="70%" style="display: block; margin: auto;" /> --- # Which are the best models? - Metrics: - Root Mean Squared Error - Total number of variables used in the model - R-squared ```{r rmse <- function(model, test){ pp <- predict(model, test) sqrt(mean((pp$predictions - test$ridership)^2)) } number_variables <- function(model){ sum(model$variable.importance > 0) } ``` --- Results! ```{r results <- training_models %>% mutate( * rmse = map2_dbl(.x = train_model, * .y = map(train_test, "test"), * ~rmse(model = .x, test = .y)), * number_variables = map_int(train_model, number_variables), * rsquared = map_dbl(train_model, "r.squared")) ``` <table class="table table-condensed table-hover" style="width: auto !important; margin-left: auto; margin-right: auto;"> <caption>Mean results per model combination</caption> <thead> <tr> <th style="text-align:left;"> model </th> <th style="text-align:left;"> rmse </th> <th style="text-align:left;"> number_variables </th> <th style="text-align:left;"> rsquared </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> bagging </td> <td style="text-align:left;width: 3cm; "> 2.738 </td> <td style="text-align:left;"> 49 </td> <td style="text-align:left;"> 0.827 </td> </tr> <tr> <td style="text-align:left;"> forest </td> <td style="text-align:left;width: 3cm; "> <span style="display: inline-block; direction: rtl; border-radius: 4px; padding-right: 2px; background-color: lightgreen; width: 72.23%">2.721</span> </td> <td style="text-align:left;"> 49 </td> <td style="text-align:left;"> <span style="display: inline-block; direction: rtl; border-radius: 4px; padding-right: 2px; background-color: lightgreen; width: 100.00%">0.830</span> </td> </tr> <tr> <td style="text-align:left;"> regularized_forest02 </td> <td style="text-align:left;width: 3cm; "> 2.757 </td> <td style="text-align:left;"> <span style="display: inline-block; direction: rtl; border-radius: 4px; padding-right: 2px; background-color: lightgreen; width: 30.41%">14.9</span> </td> <td style="text-align:left;"> 0.824 </td> </tr> <tr> <td style="text-align:left;"> regularized_forest07 </td> <td style="text-align:left;width: 3cm; "> 2.748 </td> <td style="text-align:left;"> 19.5 </td> <td style="text-align:left;"> <span style="display: inline-block; direction: rtl; border-radius: 4px; padding-right: 2px; background-color: lightgreen; width: 100.00%">0.830</span> </td> </tr> <tr> <td style="text-align:left;"> tree </td> <td style="text-align:left;width: 3cm; "> 3.767 </td> <td style="text-align:left;"> 48.2 </td> <td style="text-align:left;"> 0.677 </td> </tr> </tbody> </table> --- All the elements in only one object! ```r results ``` ``` # A tibble: 50 x 10 index data train_test model parameters full_parameters train_model <int> <lis> <list> <chr> <list> <list> <list> 1 1 <tib… <named li… bagg… <named li… <named list [4… <ranger> 2 2 <tib… <named li… bagg… <named li… <named list [4… <ranger> 3 3 <tib… <named li… bagg… <named li… <named list [4… <ranger> 4 4 <tib… <named li… bagg… <named li… <named list [4… <ranger> 5 5 <tib… <named li… bagg… <named li… <named list [4… <ranger> 6 6 <tib… <named li… bagg… <named li… <named list [4… <ranger> 7 7 <tib… <named li… bagg… <named li… <named list [4… <ranger> 8 8 <tib… <named li… bagg… <named li… <named list [4… <ranger> 9 9 <tib… <named li… bagg… <named li… <named list [4… <ranger> 10 10 <tib… <named li… bagg… <named li… <named list [4… <ranger> # … with 40 more rows, and 3 more variables: rmse <dbl>, # number_variables <int>, rsquared <dbl> ``` --- # Conclusions - The `tidyverse` makes the modeling workflow in `R` very clear and compact - We can build a unique object to store everything at the same time: data, train, test, seeds, hyperparameters, fitted models, results, evaluation metrics, computational details, etc - Very useful to quickly compare models - Reproducibility (papers, reports) --- # Conclusions <img src="img/purrrr.jpg" width="40%" style="display: block; margin: auto;" /> --- class: inverse, center, middle # Thanks! <img src= "https://s3.amazonaws.com/kleebtronics-media/img/icons/github-white.png", width="50", height="50", align="middle"> <b>[@brunaw](https://github.com/brunaw)<b>