Random Forests by Hand

Bruna Wundervald
brunadaviesw at gmail.com
PhD Candidate in Statistics, Hamilton Institute

Introduction

Statistical machine learning is a field focused on predictions tasks. When the variable of interest already exists, we are dealing with a supervised problem. The general idea is that we can predict the response variable \(Y\) based on the information brought by a set of covariables \(\mathbf{X}\). This is done by fitting a model that relates these predictors to the variable of interest, that can either be quantitative (continuous or discrete) or be a factor variable, with two or more classes.

When the response is discrete, the prediction method is configured as classification method. Classical statistical learning methods for classification are: logistic regression, support vector machines, neural networks, decision trees and random forests (Hastie, Trevor, Tibshirani, Robert, Friedman (2009)). Classification problems emerge in the most different range of occasions, examples include: predict whether someone would pay a bank loan or not based on their financial situation, predict in which party a person someone is more likely to vote based on their political views, ect.

As for the problems involving a quantitative response variable, we call them regression problems. The response can represent infinite different measures, such as human height, human weight, the count of people in a building each day, how long it takes for a bus to arrive at a bus stop, salaries for technology professions, and so on. Some methods used for classification can also be applied in this cases, as neural networks, regression trees and random forests. Other methods for regression include least squares linear regression, survival regression, time series models and several more (Hastie, Trevor, Tibshirani, Robert, Friedman (2009)).

Regression Trees

Decision trees are flexible non-parametric methods that be can applied to both regression and classification problems. It is based on the estimation of a series of binary conditional statements (if-else) on the predictors space. As an illustration, let us get back to a classification problem demonstrated in Figure 1. On the left we see the plot of two predictor variables \(x_1\) and \(x_2\) and the color represents the class to each data point belongs. Suppose the two classes are the name of the colors, “red” and “green”. An optimal way of correctly classyfing the points is shown on the right plot. The rule is: if \(x_1 <= 50\) and \(x_2 <= 0.5\) or \(x_1 > 50\) and \(x_2 > 0.5\), the point should be classified as “red”; if the point is not in any of these two regions, it should be classified as “green”.

Figure 1: Illustration of a decision spaceFigure 1: Illustration of a decision space

Figure 1: Illustration of a decision space

Figure 1 also demonstrates an important characteristic of the trees: its ability to handle non-linear relationships between the predictor variables and the response.

Trees can also be shown as a graph. In Figure 2 we have a simple graph of the rules of our illustration. In this case, the observations in terminal nodes 7 and 4 would be the “red” class, and the nodes 5 and 6 would be the “green” ones.

Figure 2: The graph of the decision rules.

Figure 2: The graph of the decision rules.

For a classification task, the prediction for each division is a class, and the partitions of the predictor spare are chosen in order to minimize the rate of incorrect classification. In the regression case, on the other hand, the prediction consists is a constant value, usually taken as the mean of \(y\) in each region. The same way, the minimized measure is the residual sum of squares, given by

\[ RSS_{tree} = \sum_{j = 1}^{J} \sum_{i \in R_j} (y_i - \hat y_{R_j})^2\]

where \(\hat y_{R_j}\) is the mean response for the observations in the jth region of the predictors space. It is computationally too expensive to consider a combination of every possible partition of the feature space into \(J\) regions, a approach called recursive binary splitting is taken. The algorithm is described as

  1. Select a predictor \(X_j\),
  2. Select a respective cutpoint \(s\) to split the feature space into regions \(\{X|X_j < s \}\) and \(\{X|X_j \geq s \}\) that leads to the greatest reduction in the \(RSS_{tree}\), described by

\[ \sum_{i: x_i \in R_1 (j, s)} (y_i - \hat y_{R_1})^2 + \sum_{i: x_i \in R_2 (j, s)} (y_i - \hat y_{R_2})^2\]

with \(\hat y_{R_1}\) being the mean response in \(R_1 (j, s)\) and \(\hat y_{R_2}\) the mean response in \(R_2 (j, s)\),

  1. Break the feature space into the region and come back to Step 1, taking in consideration the previous nodes.

The process continues until a criterion is reached. Once the regions are created, we predict the response for the observations as the means in each of the partitions. In the following we have a very simple implementation of the three algorithm. Note that here we considered that each node has to contain at least 5% of the dataset, in order to avoid having meaningless nodes. Besides that, we choose to use only \(m \approx \sqrt(p)\), p being the total of available features, in each tree for reasons that will be explained in the next section.

R Code

The following chunk represents one way of implementing regression trees in R. The function receives the formula of the model and the data and returns a list of the data with the prediction, the variables where the splits happened and the cutpoints. That is not a perfect way and improvements can be done, as making it generic to receives all types of predictors and to make it possible for the splits repeat a variable.

library(tidyverse)
library(furrr)


# Defining the function to create the minimizing criteria
sum_of_squares <- function(cut_point, variable_index, data_ss, X){
  
  # This function receives a variable and calculates the residual 
  # sum of squares if we partition the data accordingly to a cutpoint
  # in the chosen variable, considering all the pre-existent nodes
  
  cuts <- data_ss %>% 
    mutate(var = !!sym(variable_index)) %>% 
    group_by(node) %>% 
    mutate(criteria = ifelse(var > cut_point, "set1", "set2")) %>%
    mutate(mean_y = mean(!!sym(colnames(data_ss)[1]))) %>% 
    group_by(criteria) %>% 
    mutate(
      mean_new_node = mean(!!sym(colnames(data_ss)[1])), 
      resid = (!!sym(colnames(data_ss)[1]) - mean_new_node)^2)  %>%
    group_by(node) %>% 
    summarise(
      sqr_previous = sum((!!sym(colnames(data_ss)[1]) - mean_y)^2), 
      sqr_new_node = sum(resid)) 
  
  results <- cuts %>% 
    filter(sqr_new_node < sqr_previous) %>% 
    arrange(sqr_new_node) %>%
    slice(1) %>% 
    c()
  
  if(length(results[[1]]) > 0){
    return(results)
  } else { 
    results <- cuts %>% 
      arrange(sqr_previous) %>%
      slice(1) %>%  c()
    
    return(results)
  }
} 



my_trees <- function(formula, data, parallel = TRUE,
                     random_vars = FALSE){

  # Taking some action if the data has too many NAs
  if(nrow(na.omit(data))[1] < 10){
    stop("Your data has too many NAs.")
  }
  # Removing the intercept
  formula <- as.formula(paste(c(formula), "- 1"))
  
  # Extracting the model structure
  m <- model.frame(formula, data = data)
  # Defining the response
  response_name <- all.vars(formula)[1]
  y <- model.response(m)
  
  # Defining the matrix of predictor variables
  X <- model.matrix(formula, m)
  colnames(X) <- colnames(m)[-1]

  # Selecting m predictors randomly 
  if(random_vars){
    n_pred <- round(sqrt(ncol(X)))
    selected_predictors <- sample(1:ncol(X), size = n_pred, replace = FALSE)
  } else {
    n_pred <- ncol(X)
    selected_predictors <- 1:ncol(X)
  }
  node_num <- n_pred + 1
  
  # Creating the matrix to save the criteria for the cuts
  #criteria <- matrix(NA, ncol = 4, nrow = n_pred)
  criteria <- data.frame(i = NA, cut = NA, sqr = NA, node = NA)
  
  # Creating the minimum of observations that should be in each node
  keep_node <- round(nrow(data)*0.05)
  
  # Creating the node for each observation
  m$node<- "root"
  
  # Defining auxiliar variables 
  counter <- 0
  var  <-  vector("numeric", n_pred)
  cut_point <-  vector("numeric", n_pred)
  parents  <-  vector("numeric", n_pred)
  
  final_vars <-  c(0)
  final_cuts <-  c(0)
  final_parents <-  c(0)
  
  temps <- vector("numeric", n_pred)
  #iter = 2
  
  for(j in 2:node_num){

    # Finding the minimun sum of squares by testing all variables and different
    # cutpoints
    for(i in 1:n_pred){
  
      # Selecting a variable
      variable <- colnames(X)[selected_predictors[i]]
      # Defining sequence of cutpoints to be tested
      cut_points <- c(X[, variable])
      
        if(parallel){ plan(multiprocess) }
      # Mapping the RSS function to the cutpoints
      points <- cut_points %>% furrr::future_map(sum_of_squares, 
                                          variable_index = variable,
                                          data_ss = m, X = X) 
      
      # Saving the results
      sqrs <- points %>% map_dbl("sqr_new_node") 
      node_new <- points %>% map_chr("node") 
      cut <- sqrs %>% which.min()
      
      # Creating the criteria matrix to check for the variable that produced the
      #  smallest rss
      
      criteria[i, ] <- c(i, cut_points[cut], sqrs[cut], node_new[cut])
      
    }  

    # Verifying which variable and cutpoint gave the best results
    min_sqr <- which.min(criteria[,3])
    root <- selected_predictors[min_sqr]
    var[j-1] <- colnames(X)[min_sqr]
    cut_point[j-1] <- as.numeric(criteria[min_sqr, 2])
    parents[j-1] <- criteria[min_sqr, 4] 
    
    
    # Checking the dimension of the possible new nodes to see if  
    # they have the minimum of observations
    temps[j] <- m %>% 
      filter(node == parents[j-1]) %>% 
      mutate(groups =  ifelse(!!sym(var[j-1]) > cut_point[j-1], 'set1', 'set2')) %>% 
      count(groups) %>% 
      arrange(n) %>% slice(1) %>% pull()
    
    
    # Creating the very first split 
    if(j == 2 && temps[2] >= keep_node){
      
      m <- m %>% 
        mutate(node = ifelse(!!sym(var[j-1]) > cut_point[j-1],
                             paste(node, var[j-1], "1"), 
                             paste(node, var[j-1], "2")))
      
      counter <- counter + 1
      
      # Feeding the final vectors of variables and cutpoints
      final_vars[counter] <- var[j-1] 
      final_cuts[counter] <- cut_point[j-1] 
      final_parents[counter] <- parents[j-1]
      
    } else {
      # Creating the next splits and changing the auxiliar variables 
      if(temps[j] >= keep_node && temps[j] != temps[j-1]){
        m <- m %>% 
          mutate(node = ifelse(node == parents[j-1],
                               ifelse(!!sym(var[j-1]) > cut_point[j-1],
                                      paste(node, var[j-1], "1"), 
                                      paste(node, var[j-1], "2")), 
                               paste(node)))
        counter = counter + 1
        
        final_vars[counter] <- var[j-1] 
        final_cuts[counter] <- cut_point[j-1] 
        final_parents[counter] <- parents[j-1]
        
      } 
      #else { counter = counter }
    }
    
    # Updating objects 
    selected_predictors <- selected_predictors[-min_sqr]
    n_pred <- n_pred - 1
    criteria <- data.frame(i = NA, cut = NA, sqr = NA, node = NA)
    j <- j + 1
    
  }
  
  # Calculating the mean for each final node  
  m <- m %>% 
    group_by(node) %>% 
    mutate(prediction = round(mean(!!sym(response_name)), 3))
  
  # Calculating the means
  means <- m %>% 
    distinct(prediction) %>% 
    pull(prediction)
  
  nodes <- m %>% distinct(node)
  
  return(list(model_data = m, 
              vars = final_vars, 
              cuts = as.numeric(final_cuts),
              parents = final_parents,
              means = means, 
              nodes = nodes))
  
}

Example: the mtcars data

The mtcars data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models). Here we try to predict miles per gallon using as covariables the number of cylinders, displacement, gross horsepower, rear axle ratio, weight, 1/4 mile time, transmission, number of forward gears and number of carburetors for each car model.

## [1] 148.3985

Random Forests

Decision trees usually have high variance. A natural way to reduce variance in a model is to take many training sets from the population, build a separate prediction for each training set and average the resulting predictions, forming a equation like

\[ \hat f_{avg}(x) = \frac{1}{B} \sum_{b=1}^{B} \hat f^b(x),\]

where B is the number of training sets and \(\hat f^b(x)\) are the estimated functions. With that, even if each model has high variance, their average will have the variance reduced. One way of producing this many training sets for trees is using bootstrap (Hesterberg (2011)). We construct B regression trees using B bootstrapped sets and average the predictions. This method is called bagging (Hastie, Trevor, Tibshirani, Robert, Friedman (2009)).

Random forests (Breiman (2001)) are a big combination of trees that improves the bagging method by decorrelating the trees. We still build trees on bootstrapped samples, but for each tree, only a random sample of predictors is considered as candidates variables. The size of this sample is usually taken as \(m \approx \sqrt(p)\), with \(p\) being the total of available features. By doing this simple change, we prevent that strong predictor variables would always appear on the top of the trees, since sometimes they will not even be considered for the split. The average of the trees will be less variable and hence more reliable (Hastie, Trevor, Tibshirani, Robert, Friedman (2009)).

R Code

The following code implements a function that produces a random forest. As for the trees, this function can be improved in various points.

Example: the mtcars data

This is the same problem as considered before, but now using the rf function to fit a random forest model.

## [1] 2068.631

In the following we have the function that predicts a response vector for the random forest model. The new data used is the same as for the tree model.

By varying the number of trees in the random forest, we can check what is the residual sum of squares for each value. Figure 3 shows results for different numbers of trees. The plot tells us that a small number of trees produces better results than a big amount of trees.

Figure 3: Results of the residual sum of squares for different numbers of trees

Figure 3: Results of the residual sum of squares for different numbers of trees

Conclusions and final remarks

For this project, the task was to implement a full random forest model. To do that, the first task was to implement the decision trees, once a random forest is a big combination of them. After that, it is necessary to create a big set of trees and use their average to produce a random forest. The code shown here implements a fit function and prediction function for both the trees and random forests models.

This code is not final and improvements should be made. First, the decision tree function needs to be faster and allow for the tree to split the same variable more than once in the model. With that, the random forest model will also be more efficient and correct. The complexity parameter of the tree, which was not discussed here, is a fundamental detail that needs to be included. The prediction functions could use a little more of efficiency too.

References

Breiman, Leo. 2001. “Random Forests.” Machine Learning. https://doi.org/10.1017/CBO9781107415324.004.

Hastie, Trevor, Tibshirani, Robert, Friedman, Jerome. 2009. The Elements of Statistical Learning The Elements of Statistical LearningData Mining, Inference, and Prediction, Second Edition. https://doi.org/10.1007/978-0-387-84858-7.

Hesterberg, Tim. 2011. “Bootstrap.” Wiley Interdisciplinary Reviews: Computational Statistics. https://doi.org/10.1002/wics.182.