Random Forest and Gradient-Boosted Trees

Boston Housing Data

Author

Byeong-Hak Choe

Published

March 30, 2026

Modified

March 30, 2026

Boston Housing Data

boston <- read_csv("https://bcdanl.github.io/data/boston.csv") |>
  janitor::clean_names()

paged_table(boston)

Modeling goal

  • Predict medv, median housing value in thousands of dollars
  • Boston housing is a classic regression example with multiple socioeconomic and neighborhood predictors
  • We will use it to study overfitting, pruning, random forests, and boosting

Create training and test data

set.seed(42120532)
boston <- boston |> 
  mutate(rnd = runif(n()))

dtrain <- boston |> 
  filter(rnd > .2) |> 
  select(-rnd)

dtest <- boston |> 
  filter(rnd <= .2) |> 
  select(-rnd)

x_train <- dtrain |> select(-medv)
y_train <- dtrain$medv
x_test  <- dtest |> select(-medv)
y_test  <- dtest$medv

Random Forest

Why random forests improve on a single tree

  • A single tree is easy to interpret but can be unstable.
  • Random forests average across many trees.
  • Each tree is fit on a bootstrap sample.
  • At each split, only a random subset of predictors is considered.
  • This reduces variance and usually improves prediction.

Important ranger() arguments

  • num.trees: number of trees in the forest
  • mtry: number of predictors considered at each split
  • min.node.size: minimum terminal node size
  • importance: method used to calculate variable importance
  • seed: for reproducibility

What does mtry do?

  • Smaller mtry means more randomness and more diverse trees.
  • Larger mtry means trees look more similar to one another.
  • For regression, a common default is about p / 3.
  • If mtry = p, the method becomes close to bagging.

Fit an initial random forest on Boston housing

set.seed(1917)

init_boston_rf <- ranger(
  medv ~ .,
  data = dtrain,
  num.trees = 300,
  importance = "impurity"
)

init_boston_rf
Ranger result

Call:
 ranger(medv ~ ., data = dtrain, num.trees = 300, importance = "impurity") 

Type:                             Regression 
Number of trees:                  300 
Sample size:                      392 
Number of independent variables:  13 
Mtry:                             3 
Target node size:                 5 
Variable importance mode:         impurity 
Splitrule:                        variance 
OOB prediction error (MSE):       11.83664 
R squared (OOB):                  0.8582419 

Interpreting the ranger output

  • num.trees tells us how many trees were grown.
  • mtry tells us how many predictors were considered at each split.
  • prediction.error is the out-of-bag error estimate.
  • Out-of-bag error is a built-in validation measure because each tree leaves some observations out of its bootstrap sample.
  • Lower out-of-bag error indicates better predictive performance.

Tune mtry

p <- ncol(x_train)
rf_grid <- tibble(mtry = seq(2, p, by = 2))

rf_results <- tibble(
  mtry = numeric(),
  oob_error = numeric()
)

for (m in rf_grid$mtry) {
  fit <- ranger(
    medv ~ .,
    data = dtrain,
    num.trees = 300,
    mtry = m,
    importance = "impurity"
  )

  rf_results <- bind_rows(
    rf_results,
    tibble(
      mtry = m,
      oob_error = fit$prediction.error
    )
  )
}

rf_results
# A tibble: 6 Γ— 2
   mtry oob_error
  <dbl>     <dbl>
1     2     14.0 
2     4     10.6 
3     6      9.99
4     8     10.5 
5    10     10.0 
6    12     10.1 

Plot the tuning results

rf_results |>
  ggplot(aes(x = mtry, y = oob_error)) +
  geom_line() +
  geom_point() +
  labs(
    title = "Random Forest Tuning Results",
    x = "mtry",
    y = "Out-of-bag prediction error"
  )

Final random forest

best_mtry <- rf_results |>
  slice_min(oob_error, n = 1) |>
  pull(mtry)

final_boston_rf <- ranger(
  medv ~ .,
  data = dtrain,
  num.trees = 500,
  mtry = best_mtry,
  importance = "impurity"
)

best_mtry
[1] 6
final_boston_rf
Ranger result

Call:
 ranger(medv ~ ., data = dtrain, num.trees = 500, mtry = best_mtry,      importance = "impurity") 

Type:                             Regression 
Number of trees:                  500 
Sample size:                      392 
Number of independent variables:  13 
Mtry:                             6 
Target node size:                 5 
Variable importance mode:         impurity 
Splitrule:                        variance 
OOB prediction error (MSE):       10.59967 
R squared (OOB):                  0.8730561 

Test performance of the random forest

rf_test_pred <- predict(final_boston_rf, data = dtest)$predictions

sqrt(mean((y_test - rf_test_pred)^2))
[1] 3.070572

Random forest variable importance

vip(final_boston_rf, geom = "point")

How to interpret vip() for a random forest

  • The plot shows which predictors contributed most across the ensemble of trees.
  • Higher importance means a variable was more useful in improving splits across the forest.
  • In forests, importance is often more stable than in a single tree because it aggregates over many trees.
  • Still, it remains a predictive ranking, not a causal statement.

Partial dependence from the forest

partial(final_boston_rf, pred.var = "lstat", train = x_train) |>
  autoplot() +
  labs(title = "Partial Dependence of MEDV on LSTAT")

Interpreting the random-forest PDP

  • The curve shows the model’s average predicted medv as lstat changes.
  • All other variables are averaged over their observed values.
  • Nonlinearity in the curve suggests that the marginal relationship is not well described by a straight line.
  • Flat regions suggest limited marginal predictive contribution in that range.

Gradient-Boosted Trees

From bagging to boosting

  • Bagging / random forest builds many trees in parallel and averages them.
  • Boosting builds trees sequentially.
  • Each new tree focuses on the errors left by earlier trees.
  • Boosting can achieve stronger prediction, but tuning matters more.

Boosting intuition

  1. Start with an initial prediction.
  2. Compute residuals or gradients from the loss function.
  3. Fit a small tree to those residuals.
  4. Add that tree to the current ensemble.
  5. Repeat many times.

Key XGBoost tuning arguments

  • nrounds: number of boosting iterations
  • eta: learning rate
  • max_depth: depth of each tree
  • gamma: minimum improvement required for a split
  • subsample: fraction of observations used for each boosting step
  • colsample_bytree: fraction of predictors used for each tree
  • min_child_weight: minimum total weight allowed in a child node

How to think about the main tuning parameters

  • Smaller eta means slower learning and usually requires larger nrounds.
  • Larger max_depth allows more complex interactions.
  • Larger gamma makes splitting more conservative.
  • Smaller subsample and colsample_bytree can reduce overfitting by injecting randomness.
  • nrounds and eta usually need to be tuned together.

Prepare matrix input for XGBoost

x_train_mat <- as.matrix(x_train)
x_test_mat  <- as.matrix(x_test)

Fit an initial XGBoost model

library(xgboost)

set.seed(1937)

xgb_fit_initial <- xgboost(
  data = x_train_mat,
  label = y_train,
  objective = "reg:squarederror",
  nrounds = 100,
  eta = 0.1,
  max_depth = 3,
  verbose = 0
)

xgb_fit_initial
##### xgb.Booster
raw: 114.2 Kb 
call:
  xgb.train(params = params, data = dtrain, nrounds = nrounds, 
    watchlist = watchlist, verbose = verbose, print_every_n = print_every_n, 
    early_stopping_rounds = early_stopping_rounds, maximize = maximize, 
    save_period = save_period, save_name = save_name, xgb_model = xgb_model, 
    callbacks = callbacks, objective = "reg:squarederror", eta = 0.1, 
    max_depth = 3)
params (as set within xgb.train):
  objective = "reg:squarederror", eta = "0.1", max_depth = "3", validate_parameters = "TRUE"
xgb.attributes:
  niter
callbacks:
  cb.evaluation.log()
# of features: 13 
niter: 100
nfeatures : 13 
evaluation_log:
  iter train_rmse
 <num>      <num>
     1  21.502189
     2  19.483237
   ---        ---
    99   1.481362
   100   1.475502

What does the initial model represent?

  • It is a sum of many shallow trees, not one large tree.
  • Each boosting round adds a small correction to the current prediction.
  • The printed summary is much less visually intuitive than a single tree.
  • Interpretation usually relies more on tools such as variable importance and partial dependence.

Simple tuning grid

xgb_grid <- tidyr::crossing(
  nrounds = seq(50, 250, by = 50),
  eta = c(0.025, 0.05, 0.1),
  max_depth = c(1, 2, 3, 4),
  gamma = c(0, 1),
  colsample_bytree = c(0.8, 1),
  min_child_weight = 1,
  subsample = c(0.8, 1)
)

xgb_grid
# A tibble: 480 Γ— 7
   nrounds   eta max_depth gamma colsample_bytree min_child_weight subsample
     <dbl> <dbl>     <dbl> <dbl>            <dbl>            <dbl>     <dbl>
 1      50 0.025         1     0              0.8                1       0.8
 2      50 0.025         1     0              0.8                1       1  
 3      50 0.025         1     0              1                  1       0.8
 4      50 0.025         1     0              1                  1       1  
 5      50 0.025         1     1              0.8                1       0.8
 6      50 0.025         1     1              0.8                1       1  
 7      50 0.025         1     1              1                  1       0.8
 8      50 0.025         1     1              1                  1       1  
 9      50 0.025         2     0              0.8                1       0.8
10      50 0.025         2     0              0.8                1       1  
# β„Ή 470 more rows

Cross-validated tuning

xgb_results_lst <- vector("list", nrow(xgb_grid))

for (i in seq_len(nrow(xgb_grid))) {
  nrounds <- xgb_grid$nrounds[i]
  eta <- xgb_grid$eta[i]
  max_depth <- xgb_grid$max_depth[i]
  gamma <- xgb_grid$gamma[i]
  colsample_bytree <- xgb_grid$colsample_bytree[i]
  min_child_weight <- xgb_grid$min_child_weight[i]
  subsample <- xgb_grid$subsample[i]

  cv_fit <- xgb.cv(
    data = x_train_mat,
    label = y_train,
    nrounds = nrounds,
    nfold = 5,
    objective = "reg:squarederror",
    eta = eta,
    max_depth = max_depth,
    gamma = gamma,
    colsample_bytree = colsample_bytree,
    min_child_weight = min_child_weight,
    subsample = subsample,
    verbose = 0
  )

  xgb_results_lst[[i]] <- tibble(
    nrounds = nrounds,
    eta = eta,
    max_depth = max_depth,
    gamma = gamma,
    colsample_bytree = colsample_bytree,
    min_child_weight = min_child_weight,
    subsample = subsample,
    cv_rmse = min(cv_fit$evaluation_log$test_rmse_mean)
  )
}

xgb_results <- bind_rows(xgb_results_lst)

xgb_results |>
  arrange(cv_rmse) |>
  slice_head(n = 10)
# A tibble: 10 Γ— 8
   nrounds   eta max_depth gamma colsample_bytree min_child_weight subsample
     <dbl> <dbl>     <dbl> <dbl>            <dbl>            <dbl>     <dbl>
 1     100 0.1           4     0              1                  1       1  
 2     250 0.05          4     0              0.8                1       0.8
 3     200 0.05          4     0              1                  1       0.8
 4     250 0.025         4     0              0.8                1       0.8
 5     250 0.05          4     1              0.8                1       0.8
 6     200 0.05          3     1              0.8                1       0.8
 7     200 0.1           4     1              1                  1       0.8
 8     200 0.1           3     0              1                  1       0.8
 9     150 0.1           4     0              0.8                1       0.8
10     200 0.1           3     1              0.8                1       1  
# β„Ή 1 more variable: cv_rmse <dbl>

Fit the final XGBoost model

# best_xgb <- xgb_results[order(xgb_results$cv_rmse), ][1, ]

best_xgb <- xgb_results |>
  slice_min(cv_rmse, n = 1)

best_xgb
# A tibble: 1 Γ— 8
  nrounds   eta max_depth gamma colsample_bytree min_child_weight subsample
    <dbl> <dbl>     <dbl> <dbl>            <dbl>            <dbl>     <dbl>
1     100   0.1         4     0                1                1         1
# β„Ή 1 more variable: cv_rmse <dbl>
xgb_fit_final <- xgboost(
  data = x_train_mat,
  label = y_train,
  objective = "reg:squarederror",
  nrounds = best_xgb$nrounds,
  eta = best_xgb$eta,
  max_depth = best_xgb$max_depth,
  gamma = best_xgb$gamma,
  colsample_bytree = best_xgb$colsample_bytree,
  min_child_weight = best_xgb$min_child_weight,
  subsample = best_xgb$subsample,
  verbose = 0
)

Test performance of XGBoost

xgb_test_pred <- predict(xgb_fit_final, newdata = x_test_mat)

sqrt(mean((y_test - xgb_test_pred)^2))
[1] 2.868832

XGBoost variable importance

vip(xgb_fit_final) +
  labs(title = "XGBoost Variable Importance")

How to interpret vip() for XGBoost

  • XGBoost importance is based on how much each variable contributes across many boosting rounds.
  • Variables near the top are most useful for reducing the loss function.
  • Importance does not reveal the direction of the effect.
  • A variable can be important because it appears in many small improvements, not only because it appears in one dominant split.

Variable Importance: A Free Byproduct of Ensembles

  • Both random forest and gradient boosting measure variable importance automatically

  • The key question: how much does each predictor contribute to reducing prediction error?

  • Random forest β€” permutation importance:

    • For each variable, randomly shuffle its values and measure the drop in OOB accuracy
    • A large drop β†’ the variable was doing real work; a small drop β†’ it barely mattered
  • Gradient boosting β€” gain-based importance:

    • Tracks how much each split on a variable reduced the loss function, summed across all trees
    • Variables used in high-gain splits rank highly
Why This Matters

Unlike linear regression, tree ensembles can capture non-linear effects and interactions. Variable importance tells you which variables the model relies on most, even when the relationship between predictor and outcome is complex.

Partial dependence for XGBoost

partial(
  xgb_fit_final,
  pred.var = "lstat",
  train = x_train_mat,
  plot = TRUE,
  plot.engine = "ggplot2",
  type = "regression"
) +
  labs(title = "Partial Dependence of MEDV on LSTAT")

Interpreting the XGBoost PDP

  • The PDP summarizes the average fitted relationship between lstat and predicted housing values.
  • Because XGBoost is highly flexible, the curve can show richer nonlinearities than a single tree.
  • Sharp changes in the curve can indicate thresholds learned by the boosted ensemble.
  • But again, this is an interpretation of the model, not a causal estimate.
Back to top