class: title-slide, center .pull-left[ # An introduction to the `tidymodels` package ## Bruna Wundervald, National University of Ireland, Maynooth ### Young-ISA Webinar ### January 28th, 2021 ] .pull-right[ <br> <br> <div class="row"> <div class="row"> <img src="https://raw.githubusercontent.com/rstudio/hex-stickers/master/SVG/tidymodels.svg" width="200"> </div> <div class="column"> <img src="https://raw.githubusercontent.com/rstudio/hex-stickers/master/SVG/parsnip.svg" width="300"> </div> <div class="column"> <img src="https://raw.githubusercontent.com/rstudio/hex-stickers/master/SVG/tune.svg" width="200"> </div> <div class="column"> <img src="https://raw.githubusercontent.com/rstudio/hex-stickers/master/SVG/rsample.svg" width="300"> </div> <div class="column"> <img src="https://raw.githubusercontent.com/rstudio/hex-stickers/master/SVG/recipes.svg" width="300"> </div> </div> ] ??? Welcome to the webinar on sharing on short notice Where we'll show you how to get your teaching materials online with R Markdown. --- layout: true <a class="footer-link" href="http://brunaw.com/tidymodels-webinar/slides/slides.html">bit.ly/2YioY23</a> --- name: clouds class: center, middle background-image: url(images/sea.jpg) background-size: cover <style type="text/css"> .panelset { --panel-tab-font-family: Work Sans; --panel-tab-background-color-active: #fffbe0; --panel-tab-border-color-active: #023d4d; } .panelset .panel-tabs .panel-tab > a { color: #023d4d; } </style> ## .big-text[Hello] ### Bruna Wundervald <img style="border-radius: 50%;" src="https://avatars.githubusercontent.com/u/18500161?s=460&u=34b7f4888b6fe48b3c208beb51c69c146ae050cf&v=4" width="150px"/> [GitHub: @brunaw](https://github.com/brunaw) [Twitter: @bwundervald](https://twitter.com/bwundervald) [Page: http://brunaw.com/](http://brunaw.com/) --- class: middle, center ## .big-text[Today] <div class="flex" style="margin: 0 1em;"> <div class="column"> <img src="https://raw.githubusercontent.com/rstudio/hex-stickers/master/SVG/tidymodels.svg"" style="width: 75%;"> </div> ??? Here's who I know you are... -- <div class="column"style="margin: 0 1em;"> <h3> Design matrices </h3> <img src="https://raw.githubusercontent.com/rstudio/hex-stickers/master/SVG/recipes.svg" style="width: 65%;"> </div> ??? -- <div class="column"style="margin: 0 1em;"> <h3> Resampling </h3> <img src="https://raw.githubusercontent.com/rstudio/hex-stickers/master/SVG/rsample.svg" style="width: 65%;"> </div> ??? -- <div class="column"style="margin: 0 1em;"> <h3> Model interfaces </h3> <img src="https://raw.githubusercontent.com/rstudio/hex-stickers/master/SVG/parsnip.svg" style="width: 65%;"> </div> ??? -- <div class="column"style="margin: 0 1em;"> <h3> Tuning hyperparameters </h3> <img src="https://raw.githubusercontent.com/rstudio/hex-stickers/master/SVG/tune.svg" style="width: 65%;"> </div> ??? You have R Markdown files for teaching... --- class: middle # The `tidymodels` package - Created by [Max Kuhn](https://github.com/topepo) and the `tidymodels` team - Aims to be a unified collection of packages for modelling & machine learning in `R` - Easily integrates with the `tidyverse` packages - Highly reusable infrastructure & reproducibility > The packages presented today are only the **main** ones --- background-color: #fef9c8 ## Steps 1. Train and test separation with `rsample` 2. Model specification and fitting with `parsnip` 3. Feature engineering with `recipes` 4. Hyperparameter tuning with `tune` ## Good practice: suffixes - `_mod` for a `parsnip` model specification - `_fit` for a fitted model - `_rec` for a recipe - `_tune` for a tuning object --- class: middle .pull-left[ ## Data: Ames Housing A data set from De Cock (2011) with 82 columns recorded for 2,930 properties in Ames IA. Target variable: Sale price Predictors: - Location (e.g. neighborhood, lat and long) - House elements (garage, year built, air conditioner, number of bedrooms/baths, etc) ] .pull-right[ <img src="images/ames.png" width="760" style="display: block; margin: auto auto auto 0;" /> ] --- class: middle ## Loading the data and the packages ```r # Loading libraries library(tidyverse) library(tidymodels) data(ames, package = "modeldata") ames <- ames %>% mutate(Sale_Price = log10(Sale_Price)) ``` --- class: middle ## A peek at the distribution of the sale prices <img src="slides_files/figure-html/unnamed-chunk-4-1.png" width="504" style="display: block; margin: auto;" /> --- class: middle background-color: #fef9c8 # 1. Train and test separation with `rsample` .panelset[ .panel[.panel-name[Train and test split] ```r set.seed(2021) data_split <- initial_split(ames, strata = "Sale_Price", prop = 0.8) # Separating train and test ames_train <- training(data_split) ames_test <- testing(data_split) ``` ] .panel[.panel-name[Result] ```r data_split ## <Analysis/Assess/Total> ## <2346/584/2930> ``` ] .panel[.panel-name[Training set ] ```r ames_train %>% slice(1:3) ## # A tibble: 3 x 74 ## MS_SubClass MS_Zoning Lot_Frontage Lot_Area Street Alley Lot_Shape ## <fct> <fct> <dbl> <int> <fct> <fct> <fct> ## 1 One_Story_… Resident… 141 31770 Pave No_A… Slightly… ## 2 One_Story_… Resident… 80 11622 Pave No_A… Regular ## 3 One_Story_… Resident… 93 11160 Pave No_A… Regular ## # … with 67 more variables: Land_Contour <fct>, Utilities <fct>, ## # Lot_Config <fct>, Land_Slope <fct>, Neighborhood <fct>, Condition_1 <fct>, ## # Condition_2 <fct>, Bldg_Type <fct>, House_Style <fct>, Overall_Cond <fct>, ## # Year_Built <int>, Year_Remod_Add <int>, Roof_Style <fct>, Roof_Matl <fct>, ## # Exterior_1st <fct>, Exterior_2nd <fct>, Mas_Vnr_Type <fct>, ## # Mas_Vnr_Area <dbl>, Exter_Cond <fct>, Foundation <fct>, Bsmt_Cond <fct>, ## # Bsmt_Exposure <fct>, BsmtFin_Type_1 <fct>, BsmtFin_SF_1 <dbl>, ## # BsmtFin_Type_2 <fct>, BsmtFin_SF_2 <dbl>, Bsmt_Unf_SF <dbl>, ## # Total_Bsmt_SF <dbl>, Heating <fct>, Heating_QC <fct>, Central_Air <fct>, ## # Electrical <fct>, First_Flr_SF <int>, Second_Flr_SF <int>, ## # Gr_Liv_Area <int>, Bsmt_Full_Bath <dbl>, Bsmt_Half_Bath <dbl>, ## # Full_Bath <int>, Half_Bath <int>, Bedroom_AbvGr <int>, Kitchen_AbvGr <int>, ## # TotRms_AbvGrd <int>, Functional <fct>, Fireplaces <int>, Garage_Type <fct>, ## # Garage_Finish <fct>, Garage_Cars <dbl>, Garage_Area <dbl>, ## # Garage_Cond <fct>, Paved_Drive <fct>, Wood_Deck_SF <int>, ## # Open_Porch_SF <int>, Enclosed_Porch <int>, Three_season_porch <int>, ## # Screen_Porch <int>, Pool_Area <int>, Pool_QC <fct>, Fence <fct>, ## # Misc_Feature <fct>, Misc_Val <int>, Mo_Sold <int>, Year_Sold <int>, ## # Sale_Type <fct>, Sale_Condition <fct>, Sale_Price <dbl>, Longitude <dbl>, ## # Latitude <dbl> ``` ] ] --- class: middle # Resampling - more options: - Bootstrap - Cross-validation: - V-fold - Leave-one-out - Nested - Monte Carlo --- class: middle background-color: #fef9c8 # 2. Model specification and fitting with `parsnip` 1. Create a model specification: the type of model you want to run (lm, random forest, ...) 2. Set an engine: the package used to run this model 3. Fit the model **List of available models:** https://www.tidymodels.org/find/parsnip/ --- class: middle .panelset[ .panel[.panel-name[Setup and fit] ```r model_setup <- rand_forest(mode = "regression", trees = 100) rf_mod <- set_engine(model_setup, "ranger") rf_fit <- fit( rf_mod, Sale_Price ~ Longitude + Latitude, data = ames_train ) ``` ] .panel[.panel-name[Result] ``` ## Ranger result ## ## Call: ## ranger::ranger(formula = Sale_Price ~ Longitude + Latitude, data = data, num.trees = ~100, regularization.factor = ~0.2, num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1)) ## ## Type: Regression ## Number of trees: 100 ## Sample size: 2346 ## Number of independent variables: 2 ## Mtry: 1 ## Target node size: 5 ## Variable importance mode: none ## Splitrule: variance ## OOB prediction error (MSE): 0.009163911 ## R squared (OOB): 0.7066803 ``` ] ] --- class: middle # `parsnip`: switching engines .panelset[ .panel[.panel-name[Setup and fit] ```r rf_mod <- model_setup %>% set_engine("randomForest") rf_fit <- fit( rf_mod, Sale_Price ~ Longitude + Latitude, data = ames_train ) ``` ] .panel[.panel-name[Result] ``` ## ## Call: ## randomForest(x = as.data.frame(x), y = y, ntree = ~100) ## Type of random forest: regression ## Number of trees: 100 ## No. of variables tried at each split: 1 ## ## Mean of squared residuals: 0.009115084 ## % Var explained: 70.81 ``` ] ] --- class: middle .panelset[ .panel[.panel-name[Making predictions] ```r test_pred <- rf_fit %>% predict(ames_test) %>% bind_cols(ames_test) rmse <- test_pred %>% rmse(Sale_Price, .pred) # (the rmse function comes from the yardstick package (!)) ``` ] .panel[.panel-name[Result] <img src="slides_files/figure-html/unnamed-chunk-13-1.png" width="504" style="display: block; margin: auto;" /> ] ] --- class: middle, center <img src="https://github.com/allisonhorst/stats-illustrations/raw/master/rstats-artwork/parsnip.png" style="display: block; margin: auto auto auto 0;" /> --- class: middle background-color: #fef9c8 # 3. Feature engineering with `recipes` 1. Create a `recipe()` to define the processing of the data, e.g.: - Create new classes, clean missing data, transform variables, etc 2. Calculate that in the training set with the `prep()` function 3. Apply the pre-processing with the `bake()` and `juice()` functions - The `bake()` function is used for 'new data', such as test sets --- class: middle .panelset[ .panel[.panel-name[The recipe] ```r mod_rec <- recipe( Sale_Price ~ Longitude + Latitude + Neighborhood + Central_Air + Year_Built, data = ames_train ) %>% # Factor levels that occur in <= 5% of data as "other" step_other(Neighborhood, threshold = 0.05) %>% # Create dummy variables for all factor variables step_dummy(all_nominal()) %>% # Adds an interaction term step_interact(~ starts_with("Central_Air"):Year_Built) ``` ] .panel[.panel-name[Results] ``` ## Data Recipe ## ## Inputs: ## ## role #variables ## outcome 1 ## predictor 5 ## ## Operations: ## ## Collapsing factor levels for Neighborhood ## Dummy variables from all_nominal() ## Interactions with starts_with("Central_Air"):Year_Built ``` ] .panel[.panel-name[Prepping] ```r ames_rec <- prep(mod_rec, training = ames_train, verbose = TRUE) ## oper 1 step other [training] ## oper 2 step dummy [training] ## oper 3 step interact [training] ## The retained training set is ~ 0.27 Mb in memory. ``` ] .panel[.panel-name[Fitting] ```r rf_mod <- rand_forest( mode = "regression", mtry = 5, trees = 500) %>% set_engine("ranger", regularization.factor = 0.5) rf_fit <- rf_mod %>% fit(Sale_Price ~ ., data = juice(ames_rec)) ``` ] .panel[.panel-name[New fit] ``` ## Ranger result ## ## Call: ## ranger::ranger(formula = Sale_Price ~ ., data = data, mtry = ~5, num.trees = ~500, regularization.factor = ~0.5, num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1)) ## ## Type: Regression ## Number of trees: 500 ## Sample size: 2346 ## Number of independent variables: 12 ## Mtry: 5 ## Target node size: 5 ## Variable importance mode: none ## Splitrule: variance ## OOB prediction error (MSE): 0.007954979 ## R squared (OOB): 0.745376 ``` ] ] --- class: middle, center <img src="https://github.com/allisonhorst/stats-illustrations/raw/master/rstats-artwork/recipes.png" style="display: block; margin: auto auto auto 0;" /> --- class: middle background-color: #fef9c8 # 4. Hyperparameter tuning with `tune` 1. Choose the hyperparameters to tune 2. Choose the tuning method 3. Run and select the best hyperparameters - We'll be doing grid search, but the package offers more options: - Bayesian optimization - Simulated annealing (`{finetune}` package) - Racing methods (`{finetune}` package) --- class: middle .panelset[ .panel[.panel-name[Tune setup] ```r ctrl <- control_grid(save_pred = TRUE) rf_mod <- rand_forest(mtry = tune()) %>% set_mode("regression") %>% set_engine("ranger", regularization.factor = tune()) rf_param <- parameters(rf_mod) ``` ] .panel[.panel-name[Result] ``` ## Collection of 2 parameters for tuning ## ## id parameter type object class ## mtry mtry nparam[?] ## regularization.factor regularization.factor nparam[+] ## ## Model parameters needing finalization: ## # Randomly Selected Predictors ('mtry') ## ## See `?dials::finalize` or `?dials::update.parameters` for more information. ``` ] .panel[.panel-name[Running the tuning] ```r set.seed(2021) data_folds <- vfold_cv(data = juice(ames_rec), v = 5) ranger_tune <- rf_mod %>% tune_grid( Sale_Price ~ ., resamples = data_folds, grid = 10, control = ctrl ) ``` ] ] --- class: middle ## Evaluating performance results .panelset[ .panel[.panel-name[RMSE plot] <img src="slides_files/figure-html/unnamed-chunk-24-1.png" width="40%" height="30%" style="display: block; margin: auto;" /> ] .panel[.panel-name[RSQ plot] <img src="slides_files/figure-html/unnamed-chunk-25-1.png" width="40%" height="30%" style="display: block; margin: auto;" /> ] ] --- class: middle ## Using the best hyperparameters .panelset[ .panel[.panel-name[Best model] ```r best_res <- select_best(ranger_tune, metric = "rmse") final_rf_mod <- rand_forest(mtry = best_res$mtry) %>% set_mode("regression") %>% set_engine("ranger", regularization.factor = best_res$regularization.factor) final_rf_fit <- final_rf_mod %>% fit(Sale_Price ~ ., data = juice(ames_rec)) ``` ] .panel[.panel-name[New predictions] ```r test_bake <- bake(ames_rec, new_data = ames_test) final_pred <- final_rf_fit %>% predict(test_bake) %>% bind_cols(ames_test) final_rmse <- final_pred %>% rmse(Sale_Price, .pred) ``` ] .panel[.panel-name[Final plot] <img src="slides_files/figure-html/unnamed-chunk-28-1.png" width="40%" height="30%" style="display: block; margin: auto;" /> ] ] --- ## Resources - [Talk GitHub repository](https://github.com/brunaw/tidymodels-webinar) - http://tidymodels.org/ - Tutorials and function documentation - [Book: Tidy Modeling with `R`](https://www.tmwr.org/) - [Book: Applied Predictive Modeling](http://appliedpredictivemodeling.com/) - [Code: Applied Predictive Modeling](https://github.com/topepo/tidyAPM) - Max's workshops: - https://github.com/topepo/2020-earl-workshop - https://github.com/topepo/RPharma-2019-Workshop - https://github.com/topepo/nyr-2020 --- background-image: url(images/sea.jpg) background-size: cover class: center, middle, inverse ## .big-text[Questions?] --- class: bottom, left, inverse <img style="border-radius: 50%;" src="https://avatars.githubusercontent.com/u/18500161?s=460&u=34b7f4888b6fe48b3c208beb51c69c146ae050cf&v=4" width="150px"/> ## Thank you! ### Find me at... [GitHub: @brunaw](https://github.com/brunaw) [Twitter: @bwundervald](https://twitter.com/bwundervald) [Page: http://brunaw.com/](http://brunaw.com/) Slides template by [Dr. Alison Hill](http://twitter.com/apreshill) & illustrations by [Allison Horst](https://github.com/allisonhorst/stats-illustrations)