boston <- read_csv("https://bcdanl.github.io/data/boston.csv") |>
janitor::clean_names()
paged_table(boston)Random Forest and Gradient-Boosted Trees
Boston Housing Data
Boston Housing Data
Modeling goal
- Predict
medv, median housing value in thousands of dollars - Boston housing is a classic regression example with multiple socioeconomic and neighborhood predictors
- We will use it to study overfitting, pruning, random forests, and boosting
Create training and test data
set.seed(42120532)
boston <- boston |>
mutate(rnd = runif(n()))
dtrain <- boston |>
filter(rnd > .2) |>
select(-rnd)
dtest <- boston |>
filter(rnd <= .2) |>
select(-rnd)
x_train <- dtrain |> select(-medv)
y_train <- dtrain$medv
x_test <- dtest |> select(-medv)
y_test <- dtest$medvRandom Forest
Why random forests improve on a single tree
- A single tree is easy to interpret but can be unstable.
- Random forests average across many trees.
- Each tree is fit on a bootstrap sample.
- At each split, only a random subset of predictors is considered.
- This reduces variance and usually improves prediction.
Important ranger() arguments
num.trees: number of trees in the forestmtry: number of predictors considered at each splitmin.node.size: minimum terminal node sizeimportance: method used to calculate variable importanceseed: for reproducibility
What does mtry do?
- Smaller
mtrymeans more randomness and more diverse trees. - Larger
mtrymeans trees look more similar to one another. - For regression, a common default is about
p / 3. - If
mtry = p, the method becomes close to bagging.
Fit an initial random forest on Boston housing
set.seed(1917)
init_boston_rf <- ranger(
medv ~ .,
data = dtrain,
num.trees = 300,
importance = "impurity"
)
init_boston_rfRanger result
Call:
ranger(medv ~ ., data = dtrain, num.trees = 300, importance = "impurity")
Type: Regression
Number of trees: 300
Sample size: 392
Number of independent variables: 13
Mtry: 3
Target node size: 5
Variable importance mode: impurity
Splitrule: variance
OOB prediction error (MSE): 11.83664
R squared (OOB): 0.8582419
Interpreting the ranger output
num.treestells us how many trees were grown.mtrytells us how many predictors were considered at each split.prediction.erroris the out-of-bag error estimate.- Out-of-bag error is a built-in validation measure because each tree leaves some observations out of its bootstrap sample.
- Lower out-of-bag error indicates better predictive performance.
Tune mtry
p <- ncol(x_train)
rf_grid <- tibble(mtry = seq(2, p, by = 2))
rf_results <- tibble(
mtry = numeric(),
oob_error = numeric()
)
for (m in rf_grid$mtry) {
fit <- ranger(
medv ~ .,
data = dtrain,
num.trees = 300,
mtry = m,
importance = "impurity"
)
rf_results <- bind_rows(
rf_results,
tibble(
mtry = m,
oob_error = fit$prediction.error
)
)
}
rf_results# A tibble: 6 Γ 2
mtry oob_error
<dbl> <dbl>
1 2 14.0
2 4 10.6
3 6 9.99
4 8 10.5
5 10 10.0
6 12 10.1
Plot the tuning results
rf_results |>
ggplot(aes(x = mtry, y = oob_error)) +
geom_line() +
geom_point() +
labs(
title = "Random Forest Tuning Results",
x = "mtry",
y = "Out-of-bag prediction error"
)
Final random forest
best_mtry <- rf_results |>
slice_min(oob_error, n = 1) |>
pull(mtry)
final_boston_rf <- ranger(
medv ~ .,
data = dtrain,
num.trees = 500,
mtry = best_mtry,
importance = "impurity"
)
best_mtry[1] 6
final_boston_rfRanger result
Call:
ranger(medv ~ ., data = dtrain, num.trees = 500, mtry = best_mtry, importance = "impurity")
Type: Regression
Number of trees: 500
Sample size: 392
Number of independent variables: 13
Mtry: 6
Target node size: 5
Variable importance mode: impurity
Splitrule: variance
OOB prediction error (MSE): 10.59967
R squared (OOB): 0.8730561
Test performance of the random forest
rf_test_pred <- predict(final_boston_rf, data = dtest)$predictions
sqrt(mean((y_test - rf_test_pred)^2))[1] 3.070572
Random forest variable importance
vip(final_boston_rf, geom = "point")
How to interpret vip() for a random forest
- The plot shows which predictors contributed most across the ensemble of trees.
- Higher importance means a variable was more useful in improving splits across the forest.
- In forests, importance is often more stable than in a single tree because it aggregates over many trees.
- Still, it remains a predictive ranking, not a causal statement.
Partial dependence from the forest
partial(final_boston_rf, pred.var = "lstat", train = x_train) |>
autoplot() +
labs(title = "Partial Dependence of MEDV on LSTAT")
Interpreting the random-forest PDP
- The curve shows the modelβs average predicted
medvaslstatchanges. - All other variables are averaged over their observed values.
- Nonlinearity in the curve suggests that the marginal relationship is not well described by a straight line.
- Flat regions suggest limited marginal predictive contribution in that range.
Gradient-Boosted Trees
From bagging to boosting
- Bagging / random forest builds many trees in parallel and averages them.
- Boosting builds trees sequentially.
- Each new tree focuses on the errors left by earlier trees.
- Boosting can achieve stronger prediction, but tuning matters more.
Boosting intuition
- Start with an initial prediction.
- Compute residuals or gradients from the loss function.
- Fit a small tree to those residuals.
- Add that tree to the current ensemble.
- Repeat many times.
Key XGBoost tuning arguments
nrounds: number of boosting iterationseta: learning ratemax_depth: depth of each treegamma: minimum improvement required for a splitsubsample: fraction of observations used for each boosting stepcolsample_bytree: fraction of predictors used for each treemin_child_weight: minimum total weight allowed in a child node
How to think about the main tuning parameters
- Smaller
etameans slower learning and usually requires largernrounds. - Larger
max_depthallows more complex interactions. - Larger
gammamakes splitting more conservative. - Smaller
subsampleandcolsample_bytreecan reduce overfitting by injecting randomness. nroundsandetausually need to be tuned together.
Prepare matrix input for XGBoost
x_train_mat <- as.matrix(x_train)
x_test_mat <- as.matrix(x_test)Fit an initial XGBoost model
library(xgboost)
set.seed(1937)
xgb_fit_initial <- xgboost(
data = x_train_mat,
label = y_train,
objective = "reg:squarederror",
nrounds = 100,
eta = 0.1,
max_depth = 3,
verbose = 0
)
xgb_fit_initial##### xgb.Booster
raw: 114.2 Kb
call:
xgb.train(params = params, data = dtrain, nrounds = nrounds,
watchlist = watchlist, verbose = verbose, print_every_n = print_every_n,
early_stopping_rounds = early_stopping_rounds, maximize = maximize,
save_period = save_period, save_name = save_name, xgb_model = xgb_model,
callbacks = callbacks, objective = "reg:squarederror", eta = 0.1,
max_depth = 3)
params (as set within xgb.train):
objective = "reg:squarederror", eta = "0.1", max_depth = "3", validate_parameters = "TRUE"
xgb.attributes:
niter
callbacks:
cb.evaluation.log()
# of features: 13
niter: 100
nfeatures : 13
evaluation_log:
iter train_rmse
<num> <num>
1 21.502189
2 19.483237
--- ---
99 1.481362
100 1.475502
What does the initial model represent?
- It is a sum of many shallow trees, not one large tree.
- Each boosting round adds a small correction to the current prediction.
- The printed summary is much less visually intuitive than a single tree.
- Interpretation usually relies more on tools such as variable importance and partial dependence.
Simple tuning grid
xgb_grid <- tidyr::crossing(
nrounds = seq(50, 250, by = 50),
eta = c(0.025, 0.05, 0.1),
max_depth = c(1, 2, 3, 4),
gamma = c(0, 1),
colsample_bytree = c(0.8, 1),
min_child_weight = 1,
subsample = c(0.8, 1)
)
xgb_grid# A tibble: 480 Γ 7
nrounds eta max_depth gamma colsample_bytree min_child_weight subsample
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 50 0.025 1 0 0.8 1 0.8
2 50 0.025 1 0 0.8 1 1
3 50 0.025 1 0 1 1 0.8
4 50 0.025 1 0 1 1 1
5 50 0.025 1 1 0.8 1 0.8
6 50 0.025 1 1 0.8 1 1
7 50 0.025 1 1 1 1 0.8
8 50 0.025 1 1 1 1 1
9 50 0.025 2 0 0.8 1 0.8
10 50 0.025 2 0 0.8 1 1
# βΉ 470 more rows
Cross-validated tuning
xgb_results_lst <- vector("list", nrow(xgb_grid))
for (i in seq_len(nrow(xgb_grid))) {
nrounds <- xgb_grid$nrounds[i]
eta <- xgb_grid$eta[i]
max_depth <- xgb_grid$max_depth[i]
gamma <- xgb_grid$gamma[i]
colsample_bytree <- xgb_grid$colsample_bytree[i]
min_child_weight <- xgb_grid$min_child_weight[i]
subsample <- xgb_grid$subsample[i]
cv_fit <- xgb.cv(
data = x_train_mat,
label = y_train,
nrounds = nrounds,
nfold = 5,
objective = "reg:squarederror",
eta = eta,
max_depth = max_depth,
gamma = gamma,
colsample_bytree = colsample_bytree,
min_child_weight = min_child_weight,
subsample = subsample,
verbose = 0
)
xgb_results_lst[[i]] <- tibble(
nrounds = nrounds,
eta = eta,
max_depth = max_depth,
gamma = gamma,
colsample_bytree = colsample_bytree,
min_child_weight = min_child_weight,
subsample = subsample,
cv_rmse = min(cv_fit$evaluation_log$test_rmse_mean)
)
}
xgb_results <- bind_rows(xgb_results_lst)
xgb_results |>
arrange(cv_rmse) |>
slice_head(n = 10)# A tibble: 10 Γ 8
nrounds eta max_depth gamma colsample_bytree min_child_weight subsample
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 100 0.1 4 0 1 1 1
2 250 0.05 4 0 0.8 1 0.8
3 200 0.05 4 0 1 1 0.8
4 250 0.025 4 0 0.8 1 0.8
5 250 0.05 4 1 0.8 1 0.8
6 200 0.05 3 1 0.8 1 0.8
7 200 0.1 4 1 1 1 0.8
8 200 0.1 3 0 1 1 0.8
9 150 0.1 4 0 0.8 1 0.8
10 200 0.1 3 1 0.8 1 1
# βΉ 1 more variable: cv_rmse <dbl>
Fit the final XGBoost model
# best_xgb <- xgb_results[order(xgb_results$cv_rmse), ][1, ]
best_xgb <- xgb_results |>
slice_min(cv_rmse, n = 1)
best_xgb# A tibble: 1 Γ 8
nrounds eta max_depth gamma colsample_bytree min_child_weight subsample
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 100 0.1 4 0 1 1 1
# βΉ 1 more variable: cv_rmse <dbl>
xgb_fit_final <- xgboost(
data = x_train_mat,
label = y_train,
objective = "reg:squarederror",
nrounds = best_xgb$nrounds,
eta = best_xgb$eta,
max_depth = best_xgb$max_depth,
gamma = best_xgb$gamma,
colsample_bytree = best_xgb$colsample_bytree,
min_child_weight = best_xgb$min_child_weight,
subsample = best_xgb$subsample,
verbose = 0
)Test performance of XGBoost
xgb_test_pred <- predict(xgb_fit_final, newdata = x_test_mat)
sqrt(mean((y_test - xgb_test_pred)^2))[1] 2.868832
XGBoost variable importance
vip(xgb_fit_final) +
labs(title = "XGBoost Variable Importance")
How to interpret vip() for XGBoost
- XGBoost importance is based on how much each variable contributes across many boosting rounds.
- Variables near the top are most useful for reducing the loss function.
- Importance does not reveal the direction of the effect.
- A variable can be important because it appears in many small improvements, not only because it appears in one dominant split.
Variable Importance: A Free Byproduct of Ensembles
Both random forest and gradient boosting measure variable importance automatically
The key question: how much does each predictor contribute to reducing prediction error?
Random forest β permutation importance:
- For each variable, randomly shuffle its values and measure the drop in OOB accuracy
- A large drop β the variable was doing real work; a small drop β it barely mattered
Gradient boosting β gain-based importance:
- Tracks how much each split on a variable reduced the loss function, summed across all trees
- Variables used in high-gain splits rank highly
Why This Matters
Unlike linear regression, tree ensembles can capture non-linear effects and interactions. Variable importance tells you which variables the model relies on most, even when the relationship between predictor and outcome is complex.
Partial dependence for XGBoost
partial(
xgb_fit_final,
pred.var = "lstat",
train = x_train_mat,
plot = TRUE,
plot.engine = "ggplot2",
type = "regression"
) +
labs(title = "Partial Dependence of MEDV on LSTAT")
Interpreting the XGBoost PDP
- The PDP summarizes the average fitted relationship between
lstatand predicted housing values. - Because XGBoost is highly flexible, the curve can show richer nonlinearities than a single tree.
- Sharp changes in the curve can indicate thresholds learned by the boosted ensemble.
- But again, this is an interpretation of the model, not a causal estimate.