class: center, middle, inverse, title-slide # Bayesian Optimization + Regularization Paths in Random Forests ## Bruna Wundervald ### Ph.D. Candidate in Statistics, Maynooth University ### October, 2019 --- # Summary - Parameter Tuning in Machine Learning - Bayesian Function Optimisation - Bayesian Optimization of Regularization Paths in Random Forests --- # Big Picture - Machine learning algorithms often involve the tuning of parameters - Expert experience, rules of thumb or even brute force - Migh take a lot of time: great appeal for automatic techniques **We like (and know how) to model things.** Can we then use a model to select which hyperparameters will be good for an algorithm? <img src="img/func.png" width="50%" style="display: block; margin: auto;" /> ...but that will be a tricky optimization problem! --- # Bayesian Hyperparameter Optimisation - **We are interested in finding the minimum of a function `\(f(x)\)` on some bounded `\(\mathcal{X} \in \mathbb{R}^{D}\)`, or** `$$x^* = \underset{x \in \mathcal{X}}{\text{arg min}} f(x)$$` Basically, <b> we build a probability model of the objective function and use it to select the most promising parameters. </b> $$ P(\text{objective } | \text{ hyperparameters})$$ where the objectives are, e.g., the RMSE , misclassification rate, etc. For doing Bayesian optimisation, we need: `$$\underbrace{\text{Prior over } f(x)}_{\text{Our assumptions about the functions being optmized}} + \underbrace{\text{Acquisition function}}_{\text{Determines the next point to evaluate}}$$` (Snoek, Larochelle, and Adams, 2012) --- ## Building a probabilistic model for the objective function <img src="img/bayes_opt.png" width="90%" style="display: block; margin: auto;" /> Source: `https://github.com/glouppe/talk-bayesian-optimisation` --- ## Prior over `\(f(x)\)`: Gaussian Processes - The classical prior in Bayesian Optimisation - The GPs are defined by the property that any finite set of `\(N\)` points induces a Multivariate Gaussian distribution on `\(\mathbb{R}^{N}\)` - Mean function `\(m: \mathcal{X} \rightarrow \mathbb{R}\)` and covariance function `\(K: \mathcal{X} \times \mathcal{X} \rightarrow \mathbb{R}\)` - Convenient and powerful as a prior: very flexible ## Acquisition Function We can now assume `\(f(\mathbf{x}) \sim GP\)` (prior) and `\(y_n \sim \mathcal{N}(f(\mathbf{x}_n), \nu)\)`, where `\(\nu\)` is the noise introduced into the function observations. - Acquisition Function: `\(a: \mathcal{X} \rightarrow \mathbb{R}^{+}\)`, determines what point in `\(\mathcal{X}\)` should be the next evaluated - Generally depends on the previous observation and the GP hyperparameters: `\(a(\mathbf{x}; {\mathbf{x}_n, y_n}, \theta)\)` Best current value: `$$\mathbf{x}_{best} = \underset{\mathbf{x}_n}{\text{arg min}} f(\mathbf{x}_n)$$` --- ## Acquisition Function: choices Defining the standardised improvement as $$ \gamma(\mathbf{x}) = \frac{f(\mathbf{x}_{best}) - \mu(\mathbf{x}; {\mathbf{x}_n, y_n}, \theta)}{\sigma(\mathbf{x}; {\mathbf{x}_n, y_n}, \theta)} $$ we can choose to maximize one of the following: 1. <b> Probability of improvement:</b> `$$a_{\text{PI}}(\mathbf{x}; {\mathbf{x}_n, y_n}, \theta) = \Phi(\gamma(\mathbf{x}))$$` - Downside: can be <b> greedy </b>, since likely candidates can be close to the current one. 2. <b> Expected improvement:</b> `$$a_{\text{EI}}(\mathbf{x}; {\mathbf{x}_n, y_n}, \theta) = \sigma(\mathbf{x}; {\mathbf{x}_n, y_n}, \theta)[\gamma(\mathbf{x})\Phi(\gamma(\mathbf{x}) + \mathcal{N}(\gamma(\mathbf{x}); 0, 1)]$$` 3. <b> GP Upper Confidence Bound: </b> $$a_{\text{LCB}}(\mathbf{x}; {\mathbf{x}_n, y_n}, \theta) = \mu(\mathbf{x}; {\mathbf{x}_n, y_n}, \theta) - \kappa \sigma(\mathbf{x}; {\mathbf{x}_n, y_n}, \theta),$$ where `\(\kappa\)` is tunable. --- # Algorithm .content-box-grey[ 1. Choose some **prior** over the space of possible objectives `\(f\)` 2. Combine prior and likelihood to get a **posterior** over the objective, given some observations 3. Use the posterior to find the next value to be evaluated, according to the chosen **acquisition function** 4. Augment the data (with the new best value) ] > Iterate between 2 and 4 until you are satisfied --- ## Acquisition Function: in action <img src="img/utility.gif" width="120%" style="display: block; margin: auto;" /> Source: adapted from `https://github.com/glouppe/talk-bayesian-optimisation` --- ## All the functions at the same time <img src="img/bayesian_optimization.gif" width="120%" style="display: block; margin: auto;" /> --- class: inverse, middle, center Bayesian Optimization of Regularization Paths in Random Forests --- # Regularization in Random Forests In (Deng and Runger, 2012), the authors first discuss the idea of regularising Random Forests models by penalizing the gain of each variable, or `$$\begin{equation} Gain_{R}(\mathbf{X}_{j}, t) = \begin{cases} \lambda_{j} Gain(\mathbf{X}_{j}, t), \thinspace j \notin \mathbb{S} \text{ and} \\ Gain(\mathbf{X}_{j}, t), \thinspace j \in \mathbb{S}, \end{cases} \end{equation}$$` where `\(\mathbb{S}\)` is the set of indices of the covariates used previously, `\(\mathbf{X}_{j}, j \in \{1, \dots, p\}\)` is the candidate covariate for splitting and `\(t\)` the respective splitting point. We proposed a extension to their work, with `$$\begin{equation} \lambda_j = (1 - \gamma) \lambda_0 + \gamma g(\mathbf{x}_j), \end{equation}$$` where - `\(\lambda_0 \in [0, 1)\)` can be interpreted as the baseline regularization, - `\(g(X_j)\)` is a function of the respective `\(j\)`-th covariate, - `\(\gamma \in [0, 1)\)` is their mixture parameter, under the resctriction that `\(\lambda_j \in [0, 1)\)`. --- # Parameters and parameters - The regularized Random Forests work well, but it is not clear how to set up their parameters - Main sources of concern: `\(\lambda_0\)`, `\(\gamma\)` and `mtry` values - All have shown to be affecting the regularization results - `\(g(X_j)\)` should not be tuned but instead defined by hand, since it incorporates expert knowledge about the problem - Random forests in high-dimensional situations: take a long time to run - Should be tuned in a smart way! ## Bayesian Optimisation `$$P(\text{Objective }_{RRF}| \text{ mtry}, \gamma, \lambda_0, g(\mathbf{x}_j))$$` --- ## Regularization paths - We wish to produce *smooth* plots of the Number of Selected Variables `\(\times \lambda_j\)` .pull-left[ <img src="presentation_files/figure-html/unnamed-chunk-5-1.png" width="70%" style="display: block; margin: auto;" /> <img src="presentation_files/figure-html/unnamed-chunk-6-1.png" width="70%" style="display: block; margin: auto;" /> ] .pull-right[ - Main idea: be able to select a point in the plot that will gives us a good notion of how many variables will be used given some `\(\lambda_j\)` - The **Objective** function here will be different: - We need to optimize the smoothness of the number of parameters selected within a group of parameters ] --- # Number of Parameter Smoothness **How?** 1. Run the models for different combinations of parameters( `\(\lambda_0\)`, `\(\gamma\)` and `mtry`) 2. Check if within each combination of 2 parameters (for example, `\(\gamma\)` and `mtry`), the distribution of the frequencies of the number of selected variables is uniform **Why is that?** - We make the assumption that, if the number of variables selected is *smooth* within a certain combination of parameters, their frequency distribution should be uniform > Use the statistic of a KS-test for Uniform distributions as objective! --- ## Given that, our next steps are 1. Obtain more conclusions from experiments with BayesOpt 2. Implement the `rrftune` package in `R` and `c++`, which will do the automatic parameter tuning for regularized Random Forests 3. Combine all with the previous results about regularized Random Forests 4. Finish paper on the whole subject **Suggestions?** --- class: center, middle ## Acknowledgments This work was supported by a Science Foundation Ireland Career Development Award grant number: 17/CDA/4695 <img src="img/SFI_logo.jpg" width="50%" height="40%" style="display: block; margin: auto;" /> --- # References <p><cite>Breiman, L. (2001). “Random Forests”. In: <em>Machine Learning</em>. ISSN: 1098-6596. DOI: <a href="https://doi.org/10.1017/CBO9781107415324.004">10.1017/CBO9781107415324.004</a>. eprint: arXiv:1011.1669v3.</cite></p> <p><cite><a id='bib-guided'></a><a href="#cite-guided">Deng, H. and G. C. Runger</a> (2012). “Gene selection with guided regularized random forest”. In: <em>CoRR</em> abs/1209.6425. eprint: 1209.6425. URL: <a href="http://arxiv.org/abs/1209.6425">http://arxiv.org/abs/1209.6425</a>.</cite></p> <p><cite>Friedman, J. H. (1991). “Rejoinder: Multivariate Adaptive Regression Splines”. In: <em>The Annals of Statistics</em>. ISSN: 0090-5364. DOI: <a href="https://doi.org/10.1214/aos/1176347973">10.1214/aos/1176347973</a>. eprint: arXiv:1306.3979v1.</cite></p> <p><cite>Murphy, K. P. (2012). <em>Machine Learning: A Probabilistic Perspective</em>. The MIT Press. ISBN: 0262018020, 9780262018029.</cite></p> <p><cite><a id='bib-bayesopt'></a><a href="#cite-bayesopt">Snoek, J., H. Larochelle, and R. P. Adams</a> (2012). “Practical Bayesian Optimization of Machine Learning Algorithms”. In: <em>Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 2</em>. NIPS'12. Lake Tahoe, Nevada: Curran Associates Inc., pp. 2951–2959. URL: <a href="http://dl.acm.org/citation.cfm?id=2999325.2999464">http://dl.acm.org/citation.cfm?id=2999325.2999464</a>.</cite></p> --- class: center, middle, inverse # Thanks! <img src= "https://s3.amazonaws.com/kleebtronics-media/img/icons/github-white.png", width="50", height="50", align="middle"> <b>[@brunaw](https://github.com/brunaw)