A method for feature selection with Random Forests
Decision trees ensembles are a very popular type of machine learning algorithm, which is mostly due to their adaptive nature, high prediction power and, in some sense, interpretability. Random Forests are one form of such ensembles, and they consist of growing many trees in re-samples of the data, and averaging their results at end, creating a bagged ensemble described (Breiman 1996) by
\[\begin{equation} \hat f(\mathbf{x}) = \sum_{n = 1}^{N_{tree}} \frac{1}{N_{tree}} \hat f_n(\mathbf{x}), \end{equation}\]
where \(\hat f_n\) corresponds to the \(n\)-th tree. However, even though we can name many very good qualities of the Random Forests, we also know that they don’t do feature selection very well. However, Random Forests usually use all or most of the features that are feed to them, and they struggle a lot to detect highly correlated features (Louppe 2014), that ideally shouldn’t be used in an algorithm more than once. In a situation where predictions are hard or expensive to obtain (e.g. genetic related data such as SNPs, peptides or proteins), this becomes a relevant issue that needs to be addressed if we realistically want to use RFs for such prediction tasks.
In this post, I will give a general overview of feature selection in Random Forests using gain penalization. The R
code is provided along with the explanation, and I’ll often be referring to my own paper on the subject (Wundervald, Parnell, and Domijan 2020). A few auxiliary functions are used throughout the code, and they can be found here.
The idea of doing feature selection via gain penalization was first introduced in (Deng and Runger 2012), and it is basically a gain weighting method, done during the greedy procedure step of a tree estimation. In other words, when determining the next child node to be added to a decision tree, the gain (or the error reduction) of each feature is multiplied by a penalization parameter. With this, a new split will only be made if, after the penalization, the gain of adding this node is still higher than having no new child node in the tree. This new penalized gain is written as
\[\begin{equation} \text{Gain}_{R}(\mathbf{X}_{i}, t) = \begin{cases} \lambda \Delta(i, t), \thinspace i \notin \mathbb{U} \text{ and} \\ \Delta(i, t), \thinspace i \in \mathbb{U}, \end{cases} \label{eq:grrf} \end{equation}\]
where \(\mathbb{U}\) is the set of indices of the features previously used in the tree, \(\mathbf{X}_{i}\) is the candidate feature, \(t\) is the candidate splitting point and \(\lambda \in (0, 1]\).
In our paper (Wundervald, Parnell, and Domijan 2020), we proposed a generalization to the way the penalization coefficients are calculated, such that we can have full control over it. Our \(\lambda_i\) is written as
\[\begin{equation} \lambda_i = (1 - \gamma) \lambda_0 + \gamma g(\mathbf{x}_i), \label{eq:generalization} \end{equation}\]
where \(\lambda_0 \in [0, 1)\) is interpreted as the baseline regularization, \(g(\mathbf{x}_i)\) is a function of the \(i\)-th feature, and \(\gamma \in [0, 1)\) is their mixture parameter, with \(\lambda_i \in [0, 1)\). The idea behind this composition is creating a local-global form of penalization, since the equation mixes how much all features are jointly (globally) penalized and how much it is due to a local \(g(\mathbf{x}_i)\), which is manually defined. This \(g(\mathbf{x}_i)\), by its turn, should represent relevant information about the features, based on some characteristic of interest (correlation to the target, for example). This formulation also has inspiration on the use of priors made in Bayesian methods, since we introduce “prior knowledge” regarding the importance of each feature into the model (likewise, the data will tell us how strong our assumptions about the penalization are, since even if we try to penalize a truly important feature, its gain will be high enough to overcome the penalization and the feature will get selected by the algorithm).
In this blog post, I’ll use two different types of \(g(\mathbf{x}_i)\):
The Mutual Information between each feature and the target variable \(y\) (normalized to be between 0 and 1)
The variable importance values obtained from a previously run standard Random Forest, which is what I call a Boosted \(g(\mathbf{x}_i)\)
(also normalized to be between 0 and 1)
For more details on those functions and other options, please see the paper (Wundervald, Parnell, and Domijan 2020).
In general, the penalized random forest model is not the one that will be used for the final predictions. Instead, I prefer to use the method described before as a tool to first select the best features possible, and then have a final random forest that uses such features. This full feature selection procedure happens in 3 main steps:
All this is to make sure that the features used in the final model are, indeed, very good. This might sound a bit exhaustive but to me it pays off knowing that out of a few thousand variables, I’ll manage to select only a few and still have a powerful and generalizable model.
You can add an “extra penalization” when the new variable is to be picked at a deep node in a tree (for details please see the paper)
The mtry
hyperparameter requires attention (and even proper tuning), since it is known that to affect the prediction power of random forests and, is our case, the penalized random forests
Ideally, the \(\gamma\), \(\lambda_0\) and mtry
hyperparameters should be tuned, or set based on the experience of the person running the algorithms, but for the time being we’ll be using a few predefined values (kind of like grid search)
Let us consider the gravier
dataset (Gravier et al. 2010), for which the goal is to predict whether 168 breast cancer patients had a diagnosis labelled “poor” (~66%) or “good” (~33%), based on a a set of 2905 predictors. In this first part of the code, we’ll just load the data and create our 5-fold cross validation object, which will be used to create 5 different train and test sets. As of usual, there will be lots of tidyverse
and tidymodels
functions throughout my code:
library(tidyverse)
library(tidymodels)
library(infotheo) # For the mutual information function
set.seed(2021)
# Loading data and creating a 5-fold CV object
data('gravier', package = 'datamicroarray')
gravier <- data.frame(class = gravier$y, gravier$x)
folds <- rsample::vfold_cv(gravier, v = 5) %>%
dplyr::mutate(train = map(splits, training),
test = map(splits, testing))
With this done, we can start the actual modelling steps of the code. I will be using a few of auxiliary functions, which are given here, but the two following functions are explicitly shown in this post because they’re very important. The modelling()
function will be used to run the random forests algorithms, and it’s written in a way that I can change the mtry
hyperparameter, the penalization coefficients. At this point we’ll be feeding all the 2905 features, and letting the gain penalization perform the feature selection for us. The second function shown below is penalization()
, which implements the calculation of two different types of penalization: one that takes \(g(\mathbf{x}_i)\) to be the normalized mutual information between the target and each feature, and one that I call a “Boosted” \(g(\mathbf{x}_i)\), because it depends on the normalized importance values of a previously calculated random forest (for more details, see Wundervald, Parnell, and Domijan (2020)).
# A function that run the penalized random forests models
modelling <- function(train, reg_factor = 1, mtry = 1){
rf_mod <-
rand_forest(trees = 500, mtry = (mtry * ncol(train)) - 1) %>%
set_engine("ranger", importance = "impurity",
regularization.factor = reg_factor) %>%
set_mode("classification") %>%
parsnip::fit(class ~ ., data = train)
return(rf_mod)
}
# A function that receives the mixing parameters
# and calculates lambda_i with the chose g(x_i)
penalization <- function(gamma, lambda_0, data = NULL, imps = NULL, type = "rf"){
if(type == "rf"){
# Calculating the normalized importance values
imps <- imps/max(imps)
imp_mixing <- (1 - gamma) * lambda_0 + imps * gamma
return(imp_mixing)
} else if(type == "MI"){
mi <- function(data, var) mutinformation(c(data$class), data %>% pull(var))
# Calculating the normalized mutual information values
disc_data <- infotheo::discretize(data)
disc_data$class <- as.factor(data$class)
names_data <- names(data)[-1]
mi_vars <- names_data %>% map_dbl(~{mi(data = disc_data, var = .x) })
mi_mixing <- (1 - gamma) * lambda_0 + gamma * (mi_vars/max(mi_vars))
return(mi_mixing)
}
}
The code below creates the combinations of all hypeparameter values used here, for the \(\gamma\), \(\lambda_0\) and mtry
hyperparameters. After that, we calculate the two \(g(\mathbf{x}_i)\) for each training set, and their final coefficient penalization values by combining them with \(\gamma\) and \(\lambda_0\) to create the penalization mixture (as described previously), for each of the 5 training sets.
# Setting all parameters ---
mtry <- tibble(mtry = c(0.20, 0.45, 0.85))
gamma_f <- c(0.3, 0.5, 0.8)
lambda_0_f <- c(0.35, 0.75)
parameters <- mtry %>% tidyr::crossing(lambda_0_f, gamma_f)
# Adds gamma_f and lambda_0_f and run the functions with them ------
folds_imp <- folds %>%
dplyr::mutate(
# Run the standard random forest model for the 5 folds
model = purrr::map(train, modelling),
importances_std = purrr::map(model, ~{.x$fit$variable.importance})) %>%
tidyr::expand_grid(parameters) %>%
dplyr::mutate(imp_rf = purrr::pmap(
list(gamma_f, lambda_0_f, train, importances_std), type = "rf",
penalization),
imp_mi = purrr::pmap(
list(gamma_f, lambda_0_f, train, importances_std), type = "MI", penalization))
# A tibble: 3 x 7
id reg_factor mtry lambda_0_f gamma_f imp_rf imp_mi
<chr> <list> <dbl> <dbl> <dbl> <list> <list>
1 Fold1 <dbl [2,905]> 0.2 0.35 0.3 <dbl [2,90… <dbl [2,90…
2 Fold1 <dbl [2,905]> 0.2 0.35 0.5 <dbl [2,90… <dbl [2,90…
3 Fold1 <dbl [2,905]> 0.2 0.35 0.8 <dbl [2,90… <dbl [2,90…
The folds_imp
object has 90 rows, since it is the combination of 2 \(\times\) 3 \(\times\) 3 hyperparameter combinations for each of the 5 training sets, and 2 different \(g(\mathbf{x}_i)\). Before running our penalized models, we take a look at the results for the standard random forests models (the model
column). Here, the accuracy
and accuracy_std
columns represent the test accuracy and training accuracy from a non-penalized RF, which was run before to create the penalization coefficients, so now we can use it for comparison:
folds_imp %>%
dplyr::group_by(id) %>%
dplyr::slice(1) %>%
dplyr::ungroup() %>%
dplyr::select(id, model, train, test) %>%
dplyr::mutate(
model_importance = purrr::map(model, ~{.x$fit$variable.importance}),
n_var = purrr::map_dbl(model_importance, n_vars),
accuracy_test_std = purrr::map2_dbl(
.x = model, .y = test, ~{ acc_test(.x, test = .y)}),
accuracy_std = 1 -purrr::map_dbl(model, ~{ .x$fit$prediction.error})
) %>%
dplyr::select(id, n_var, accuracy_test_std, accuracy_std)
id | n_var | accuracy_test_std | accuracy_std |
---|---|---|---|
Fold1 | 1206 | 0.735 | 0.8481872 |
Fold2 | 1380 | 0.765 | 0.8235503 |
Fold3 | 1379 | 0.765 | 0.8246055 |
Fold4 | 1377 | 0.788 | 0.8260160 |
Fold5 | 1198 | 0.667 | 0.8548713 |
The following code runs all the penalized random forests models and calculates their metrics.
run_all_models <- folds_imp %>%
dplyr::select(id, model, train, test, imp_rf, imp_mi, mtry, lambda_0_f, gamma_f) %>%
tidyr::gather(type, importance, -train, -test, -mtry,-id, -model, -lambda_0_f, -gamma_f) %>%
dplyr::mutate(fit_penalized_rf = purrr::pmap(list(train, importance, mtry), modelling))
And finally we extract the metrics we’re interested in, from each estimated model: the number of features used, accuracy in the test set, and the accuracy calculated during training:
results <- run_all_models %>%
dplyr::mutate(
model_importance = purrr::map(fit_penalized_rf, ~{.x$fit$variable.importance}),
n_var = purrr::map_dbl(model_importance, n_vars),
accuracy = 1 - purrr::map_dbl(fit_penalized_rf, ~{ .x$fit$prediction.error}),
accuracy_test = purrr::map2_dbl(
.x = fit_penalized_rf, .y = test, ~{ acc_test(.x, .y)}))
results
object:
results %>%
dplyr::arrange(id, desc(accuracy_test), desc(accuracy), n_var) %>%
dplyr::slice(1:5)
id | mtry | lambda_0_f | gamma_f | type | n_var | accuracy | accuracy_test |
---|---|---|---|---|---|---|---|
Fold1 | 0.85 | 0.75 | 0.5 | imp_rf | 5 | 0.8979667 | 0.824 |
Fold1 | 0.20 | 0.75 | 0.3 | imp_rf | 32 | 0.8786310 | 0.794 |
Fold1 | 0.45 | 0.75 | 0.8 | imp_rf | 12 | 0.9001406 | 0.765 |
Fold1 | 0.45 | 0.35 | 0.3 | imp_rf | 11 | 0.8998857 | 0.765 |
Fold1 | 0.45 | 0.35 | 0.5 | imp_rf | 13 | 0.8995989 | 0.765 |
From the plots above, we can have an idea of what the test accuracies (Figure 1) and final number of variables used (Figure 2) is for each combination of mtry
(in percentage of variables used), and type of \(g(\mathbf{x}_i)\) (using a mutual information function or boosted by a standard RF), marginalized over \(\lambda_0\). Comparing that to the test accuracy (average of 0.744) and number of variables used (average of 1308) of the standard random forest, we can see that there has been a good improvement, since most models have a simelar accuracy to a full random forest, but are using many fewer features (from a maximum of around 30 to a minimum of around 5 features). Regarding the hyperparameter configurations, it seems that using the normalized importance values of a standard random forest as \(g(\mathbf{x}_i)\) leads to the best test accuracy results overall, but with more variation across the different mtry
values. As for the number of features used, using the normalized importance values of a standard random forest as \(g(\mathbf{x}_i)\) results in using just a few variables, also with a bigger variation across mtry
values. The number of features used for this scenario gets very low, which can be very attractive if we’re worried about using the least variables as possible.
Now, following what was described before as the ‘full feature selection procedure’, let’s move on to the next step: selecting the best penalized models for each training set and reevaluating them. In the next code chunks, we get the top-3 models for each training id, arranging first by test accuracy, training accuracy and number of variables. After that, we create the new model formulas for each model, rerun the random forest algorithm with each feature set, for each of the 5 training sets and evaluate their results:
best_models <- results %>%
arrange(desc(accuracy_test), desc(accuracy), n_var) %>%
group_by(id) %>%
slice(1:3) %>%
ungroup() %>%
mutate(new_formula = map(model_importance, get_formula))
# Re-evaluating selected variables -----------------
reev <- tibble(forms = best_models$new_formula) %>%
tidyr::expand_grid(folds) %>%
dplyr::mutate(reev_models = purrr::map2(train, forms, modelling_reev))
results_reev <- reev %>%
dplyr::mutate(feat_importance = purrr::map(reev_models, ~{.x$fit$variable.importance}),
n_var = purrr::map_dbl(feat_importance, n_vars),
accuracy = 1 - purrr::map_dbl(reev_models, ~{ .x$fit$prediction.error}),
accuracy_test = purrr::map2_dbl(.x = reev_models, .y = test, ~{ acc_test(.x, test = .y)}))
id | n_var | accuracy | accuracy_test |
---|---|---|---|
Fold2 | 31 | 0.8638687 | 0.971 |
Fold2 | 31 | 0.8768484 | 0.941 |
Fold2 | 12 | 0.8650016 | 0.941 |
Fold3 | 5 | 0.8607979 | 0.941 |
Fold2 | 5 | 0.8566280 | 0.941 |
The accuracy values are looking very good now, even for the test set. Note that the results_reev
object has 75 rows, since we have run 15 random forests models for each of the 5 training sets. We still need to reduce this number, so the last step of our methods consists of gathering the most used features by such models, and creating one final algorithm. This final model will be evaluated in 20 training and test sets, so we can be more certain about its accuracy results. In the following code, we select the top best 30 fitted models (in terms of the accuracies and number of features used) find the 15 features most used by them, and fit a random forest model in the 20 new training and test sets using this final 15-features set.
selected_vars <- results_reev %>%
arrange(desc(accuracy_test), desc(accuracy), n_var) %>%
slice(1:30) %>%
mutate(ind = 1:n(), vars = map(feat_importance, get_vars)) %>%
dplyr::select(ind, vars) %>%
unnest() %>%
group_by(vars) %>%
summarise(count = n()) %>%
arrange(desc(count))
# Select the final 15 features
final_vars <- selected_vars %>% slice(1:15) %>% pull(vars)
# Create the final formula
final_form <- paste("class ~ ", paste0(final_vars, collapse = ' + ')) %>%
as.formula()
# Create the 20 new training and test sets
set.seed(2021)
folds_20 <- rsample::vfold_cv(gravier, v = 20) %>%
dplyr::mutate(train = map(splits, training), test = map(splits, testing))
# Run the final model for the new train-test sets
final_results <- folds_20$splits %>% map(~{
train <- training(.x)
test <- testing(.x)
rf <- rand_forest(trees = 500, mtry = 7) %>%
set_engine("ranger", importance = "impurity") %>%
set_mode("classification") %>%
parsnip::fit(final_form, data = train)
accuracy_test <- acc_test(rf, test = test)
list(accuracy_test = accuracy_test,
accuracy = 1 - rf$fit$prediction.error,
imp = rf$fit$variable.importance)
})
The accuracy averages and medians for this final model are shown below. We can see that the final test accuracy (average) is higher than what was seen in the previous plots, but now using only 15 features. At last, we show the variable importance plot for the 15 features used, arranged by importance order. This plot informs us about which variables helped the predictions the most, and we can see that the most important feature really dominates the plot.
data.frame(accuracy_test = final_results %>% map_dbl("accuracy_test"),
accuracy = final_results %>% map_dbl("accuracy")) %>%
gather(type, value) %>%
group_by(type) %>%
summarise(mean = mean(value), median = median(value))
type | mean | median |
---|---|---|
accuracy | 0.8974934 | 0.8971951 |
accuracy_test | 0.8681000 | 0.8820000 |
This post only intends to quickly demonstrate how the feature selection via gain penalization can be used, but we can also compare our results to a few others that have come up in similar literature that used the same dataset:
Breiman, Leo. 1996. “Bagging Predictors.” Machine Learning 24 (2): 123–40. https://doi.org/10.1023/A:1018054314350.
Deng, Houtao, and George Runger. 2012. “Feature Selection via Regularized Trees.” In The 2012 International Joint Conference on Neural Networks (Ijcnn), 1–8. IEEE.
Gravier, Eléonore, Gaëlle Pierron, Anne Vincent-Salomon, Nadège Gruel, Virginie Raynal, Alexia Savignoni, Yann De Rycke, et al. 2010. “A Prognostic Dna Signature for T1t2 Node-Negative Breast Cancer Patients.” Genes, Chromosomes and Cancer 49 (12): 1125–34.
Huynh, Phuoc-Hai, Thanh-Nghi Do, and others. 2020. “Improvements in the Large P, Small N Classification Issue.” SN Computer Science 1 (4): 1–19.
Louppe, Gilles. 2014. “Understanding Random Forests: From Theory to Practice.” PhD thesis. https://doi.org/10.13140/2.1.1570.5928.
López, Julio, Sebastián Maldonado, and Miguel Carrasco. 2018. “Double Regularization Methods for Robust Feature Selection and Svm Classification via Dc Programming.” Information Sciences 429: 377–89.
Takada, Masaaki, Taiji Suzuki, and Hironori Fujisawa. 2018. “Independently Interpretable Lasso: A New Regularizer for Sparse Regression with Uncorrelated Variables.” In International Conference on Artificial Intelligence and Statistics, 454–63. PMLR.
Wundervald, Bruna, Andrew C Parnell, and Katarina Domijan. 2020. “Generalizing Gain Penalization for Feature Selection in Tree-Based Models.” IEEE Access 8: 190231–9.
For attribution, please cite this work as
Wundervald, "Blog Posts: Feature Selection via Gain Penalization in Random Forests", IEEE Access, 2021
BibTeX citation
@article{wundervald2021feature, author = {Wundervald, Bruna}, title = {Blog Posts: Feature Selection via Gain Penalization in Random Forests}, journal = {IEEE Access}, year = {2021}, note = {https://ieeexplore.ieee.org/document/9229097}, doi = {10.1109/ACCESS.2020.3032095} }