Decision Trees

NBC Show Data

Author

Byeong-Hak Choe

Published

March 23, 2026

Modified

April 8, 2026

Setup for Decision Trees

library(tidyverse)
library(janitor)
library(ggthemes)
library(rmarkdown)

library(rpart)
library(rpart.plot)
library(vip)
library(pdp)

theme_set(
  theme_bw() +
    theme(
      legend.position = "bottom",
      strip.background = element_rect(fill = "lightgray"),
      axis.title.x = element_text(size = rel(1.1)),
      axis.title.y = element_text(size = rel(1.1))
    )
)

scale_colour_discrete <- function(...) scale_color_colorblind(...)
scale_fill_discrete <- function(...) scale_fill_colorblind(...)

NBC Show Data

nbc <- read_csv("https://bcdanl.github.io/data/nbc_show.csv") |> 
  janitor::clean_names()  # column names are with all lowercase; 
                          # spaces in column names are replaced by _

paged_table(nbc)

GRP: Gross Ratings Points, an estimate of total viewership or broadcast marketability
PE: Projected Engagement, based on viewer recall of order and detail after watching the show

nbc_demog <- read_csv("https://bcdanl.github.io/data/nbc_demog.csv") |> 
  janitor::clean_names()

paged_table(nbc_demog)

Visualize GRP and PE by genre

nbc |>
  ggplot(aes(x = grp, y = pe, color = genre)) +
  geom_point(size = 2, alpha = 0.8) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    title = "GRP and Projected Engagement by Genre",
    x = "Gross Ratings Points (GRP)",
    y = "Projected Engagement (PE)",
    color = "Genre"
  )

Regression Tree

Modeling goal

Predict pe using grp and genre
This is a regression tree because pe is numeric
The tree will allow nonlinear relationships and interactions between audience size and show genre

Fit the initial regression tree using `rpart()`

init_nbc_tree <- rpart(
  pe ~ grp + genre,
  data = nbc,
  method = "anova"
)

Key arguments in `rpart()`

formula: specifies the outcome and predictors
data: the data frame used for estimation
method = "anova": use a regression tree for a numeric outcome
method = "class": use a classification tree for a categorical outcome
control = rpart.control(...): determines how aggressively the tree is allowed to grow

How to interpret the `rpart` output for a regression tree

init_nbc_tree

n= 40 

node), split, n, deviance, yval
      * denotes terminal node

 1) root 40 5646.4560 72.68308  
   2) grp< 223.05 7 1512.7210 56.63661 *
   3) grp>=223.05 33 1948.9810 76.08687  
     6) genre=Reality 11  823.4392 70.56522 *
     7) genre=Drama/Adventure,Situation Comedy 22  622.4796 78.84770  
      14) grp< 1545.15 15  421.4750 77.61867 *
      15) grp>=1545.15 7  129.7948 81.48133 *

node) = node number in the tree
- The node numbers follow a binary tree indexing convention — the same system used in binary heaps. For any node numbered \(n\):
  - Its left child is \(2n\)
  - Its right child is \(2n + 1\)
split = the rule used to divide the data at that node
n = number of observations in that node
deviance = the sum of squared errors within that node for a regression tree
yval = the predicted value at that node, which is the mean outcome for observations in the node
* denotes a terminal node, also called a leaf node

How to read the plotted tree

rpart.plot(init_nbc_tree)

Start at the root node at the top.
Read the split rule.
Observations satisfying the rule go left, and the rest go right.
Continue until you reach a leaf.
The two numbers in each node are:
- the predicted value (\(\widehat{y}\)) for that node
- the percentage of observations in that node

`plotcp()` for an `rpart` tree

plotcp(init_nbc_tree)

plotcp() visualizes the cost-complexity pruning results stored in the fitted rpart object.
The horizontal axis shows the size of the tree, often in terms of the number of splits.
The vertical axis shows the cross-validated error, labeled xerror.
Lower values of xerror indicate better estimated out-of-sample performance.
Each point represents a candidate pruning level, indexed by the complexity parameter cp.
The vertical bars show uncertainty around the cross-validated error using xstd, the estimated standard error.
A very large tree may fit the training data well but still have higher cross-validated error.
A smaller pruned tree is often preferred if it achieves similar or lower cross-validated error.
The size of the tree refers to the number of splits, or internal decision nodes, in the tree.

Why it matters

plotcp() helps us compare model complexity against estimated out-of-sample performance.
It provides a visual guide for deciding whether the full tree is too complex.
It is one of the main tools for deciding how much to prune an rpart tree.

Grow a larger NBC regression tree (for illustration)

full_nbc_tree <- rpart(
  pe ~ grp + genre,
  data = nbc,
  method = "anova",
  control = rpart.control(cp = 0, xval = 10, minsplit = 2)
)

Parameter	Value	Description
`cp`	`0.01` (default)	Complexity parameter used as an early stopping rule when growing the tree. A split is added only if it improves the fit by at least this amount. Setting `cp = 0` removes this stopping threshold, so the tree can grow as large as allowed by other controls such as `maxdepth`, `minsplit`, and `minbucket`.
`minsplit`	`20` (default)	Minimum number of observations in a node required to attempt a split. Setting `minsplit = 2` allows splits even when only 2 observations are present.
`minbucket`	`minsplit/3` (default)	Minimum number of observations allowed in any terminal (leaf) node. Setting `minbucket = 1` permits leaves with a single observation, producing the most complex tree.
`maxdepth`	`30` (default)	Maximum depth of any node in the final tree. The root node counts as depth 0.
`xval`	`10` (default)	Number of cross-validation folds used to estimate the cross-validated error (`xerror`) and compute the cost-complexity pruning table (`cptable`).

Important `rpart.control()` options

cp: complexity parameter. Larger values make splitting harder.
- cp = 0: Full tree
minsplit: minimum number of observations required before a split is attempted.
minbucket: minimum number of observations allowed in a leaf.
maxdepth: maximum depth of the tree (The root node counts as depth 0).
xval: number of cross-validation folds used to construct the cp table.

full_nbc_tree

n= 40 

node), split, n, deviance, yval
      * denotes terminal node

   1) root 40 5.646456e+03 72.68308  
     2) grp< 223.05 7 1.512721e+03 56.63661  
       4) grp< 12.4 1 0.000000e+00 30.00000 *
       5) grp>=12.4 6 6.849601e+02 61.07605  
        10) genre=Reality 5 1.764027e+02 56.95878  
          20) grp>=132.25 2 1.840788e+01 50.96620  
            40) grp>=182.35 1 0.000000e+00 47.93240 *
            41) grp< 182.35 1 0.000000e+00 54.00000 *
          21) grp< 132.25 3 3.829145e+01 60.95383  
            42) grp< 78.8 2 2.158442e+01 59.28515  
              84) grp>=30.7 1 0.000000e+00 56.00000 *
              85) grp< 30.7 1 0.000000e+00 62.57030 *
            43) grp>=78.8 1 0.000000e+00 64.29120 *
        11) genre=Situation Comedy 1 0.000000e+00 81.66240 *
     3) grp>=223.05 33 1.948981e+03 76.08687  
       6) genre=Reality 11 8.234392e+02 70.56522  
        12) grp< 433.85 3 1.450214e+02 63.12957  
          24) grp>=282.1 2 9.417366e+00 58.37555  
            48) grp>=351.7 1 0.000000e+00 56.20560 *
            49) grp< 351.7 1 0.000000e+00 60.54550 *
          25) grp< 282.1 1 0.000000e+00 72.63760 *
        13) grp>=433.85 8 4.503510e+02 73.35359  
          26) grp>=873.45 1 0.000000e+00 67.13380 *
          27) grp< 873.45 7 4.061387e+02 74.24213  
            54) grp< 728.2 5 2.661971e+02 71.53452  
             108) grp>=490.3 4 1.867944e+02 69.54200  
               216) grp< 502.2 1 0.000000e+00 61.24370 *
               217) grp>=502.2 3 9.497869e+01 72.30810  
                 434) grp>=638.85 1 0.000000e+00 64.35750 *
                 435) grp< 638.85 2 1.606311e-01 76.28340  
                   870) grp>=569.35 1 0.000000e+00 76.00000 *
                   871) grp< 569.35 1 0.000000e+00 76.56680 *
             109) grp< 490.3 1 0.000000e+00 79.50460 *
            55) grp>=728.2 2 1.164659e+01 81.01115  
             110) grp< 817.35 1 0.000000e+00 78.59800 *
             111) grp>=817.35 1 0.000000e+00 83.42430 *
       7) genre=Drama/Adventure,Situation Comedy 22 6.224796e+02 78.84770  
        14) genre=Drama/Adventure 19 5.138261e+02 78.00671  
          28) grp< 1545.15 12 2.502221e+02 75.97984  
            56) grp>=362.15 9 2.078974e+02 74.91691  
             112) grp< 390.75 1 0.000000e+00 64.64790 *
             113) grp>=390.75 8 8.926319e+01 76.20054  
               226) grp>=1096.7 4 2.664028e+01 74.25095  
                 452) grp< 1450.4 3 6.694296e+00 72.96170  
                   904) grp< 1252.3 1 0.000000e+00 71.05570 *
                   905) grp>=1252.3 2 1.245042e+00 73.91470  
                    1810) grp>=1379.8 1 0.000000e+00 73.12570 *
                    1811) grp< 1379.8 1 0.000000e+00 74.70370 *
                 453) grp>=1450.4 1 0.000000e+00 78.11870 *
               227) grp< 1096.7 4 3.221578e+01 78.15012  
                 454) grp< 784.35 2 2.319486e+01 76.72520  
                   908) grp>=540.55 1 0.000000e+00 73.31970 *
                   909) grp< 540.55 1 0.000000e+00 80.13070 *
                 455) grp>=784.35 2 8.992746e-01 79.57505  
                   910) grp< 969.2 1 0.000000e+00 78.90450 *
                   911) grp>=969.2 1 0.000000e+00 80.24560 *
            57) grp< 362.15 3 1.651192e+00 79.16863  
             114) grp>=293.85 1 0.000000e+00 78.12080 *
             115) grp< 293.85 2 4.259645e-03 79.69255  
               230) grp< 235.65 1 0.000000e+00 79.64640 *
               231) grp>=235.65 1 0.000000e+00 79.73870 *
          29) grp>=1545.15 7 1.297948e+02 81.48133  
            58) grp>=1665.55 6 5.856939e+01 80.17908  
             116) grp< 1759.75 1 0.000000e+00 75.59160 *
             117) grp>=1759.75 5 3.331539e+01 81.09658  
               234) grp< 2537.05 4 2.259437e+01 80.36442  
                 468) grp>=1798.65 3 6.920865e+00 79.22157  
                   936) grp>=2147.3 1 0.000000e+00 77.54410 *
                   937) grp< 2147.3 2 2.700023e+00 80.06030  
                    1874) grp< 1905.05 1 0.000000e+00 78.89840 *
                    1875) grp>=1905.05 1 0.000000e+00 81.22220 *
                 469) grp< 1798.65 1 0.000000e+00 83.79300 *
               235) grp>=2537.05 1 0.000000e+00 84.02520 *
            59) grp< 1665.55 1 0.000000e+00 89.29480 *
        15) genre=Situation Comedy 3 1.010800e+01 84.17397  
          30) grp< 876.5 2 2.335178e-01 82.89110  
            60) grp>=714.2 1 0.000000e+00 82.54940 *
            61) grp< 714.2 1 0.000000e+00 83.23280 *
          31) grp>=876.5 1 0.000000e+00 86.73970 *

rpart.plot(full_nbc_tree)

plotcp(full_nbc_tree)

Variable Importance Plot

vip(full_nbc_tree, geom = "point")

How to interpret `vip()` for a tree

The plot ranks predictors by how much they contributed to improving the splits.
Higher values mean the variable played a larger role in reducing node impurity.
In a regression tree, this usually reflects reductions in squared-error loss.
Variables near the top are more influential for prediction in this fitted model.
Variable importance does not tell us the sign of the relationship.

What `vip()` does not tell us

It does not say whether higher values of a predictor increase or decrease the outcome.
It does not tell us whether the effect is linear, nonlinear, or highly interactive.
It does not tell us whether the variable matters everywhere or only in a small part of the feature space.
It should be treated as a ranking summary, not as a full interpretation on its own.

Partial Dependence Plot

# Partial Dependence for `grp`
partial(full_nbc_tree, pred.var = "grp") |>
  autoplot() +
  labs(title = "Partial Dependence of PE on GRP")

How to interpret `pdp::partial()`

A partial dependence plot averages predictions over the observed values of all other predictors.
The horizontal axis shows the focal predictor.
The vertical axis shows the model’s average predicted outcome as that focal predictor changes.
For trees, the plot often has a step-like appearance because tree predictions are piecewise constant.
The PDP helps us see the direction and shape of the model-implied relationship.

Classification Tree

Classification goal

Now predict genre from audience demographics
This is a classification tree because the outcome is categorical
The tree will try to create nodes that are relatively pure with respect to genre

Prepare data for the classification tree

nbc_genre <- nbc |> 
  select(genre)
nbc_demog_only <- nbc_demog |> 
  select(-show)

nbc_class_data <- bind_cols(
  nbc_genre,
  nbc_demog_only
)

paged_table(nbc_class_data)

Fit a larger classification tree

nbc_genre_tree_full <- rpart(
  genre ~ .,
  data = nbc_class_data,
  method = "class",
  control = rpart.control(cp = 0, minsplit = 2, minbucket = 1, xval = 10)
)

nbc_genre_tree_full

n= 40 

node), split, n, loss, yval, (yprob)
      * denotes terminal node

 1) root 40 21 Drama/Adventure (0.47500000 0.42500000 0.10000000)  
   2) wired_cable_w_o_pay< 28.66505 22  6 Drama/Adventure (0.72727273 0.09090909 0.18181818)  
     4) vcr_owner>=83.749 17  1 Drama/Adventure (0.94117647 0.00000000 0.05882353)  
       8) territory_east_central< 16.45555 16  0 Drama/Adventure (1.00000000 0.00000000 0.00000000) *
       9) territory_east_central>=16.45555 1  0 Situation Comedy (0.00000000 0.00000000 1.00000000) *
     5) vcr_owner< 83.749 5  2 Situation Comedy (0.00000000 0.40000000 0.60000000)  
      10) territory_pacific>=18.87055 2  0 Reality (0.00000000 1.00000000 0.00000000) *
      11) territory_pacific< 18.87055 3  0 Situation Comedy (0.00000000 0.00000000 1.00000000) *
   3) wired_cable_w_o_pay>=28.66505 18  3 Reality (0.16666667 0.83333333 0.00000000)  
     6) black>=17.2017 3  0 Drama/Adventure (1.00000000 0.00000000 0.00000000) *
     7) black< 17.2017 15  0 Reality (0.00000000 1.00000000 0.00000000) *

n gives the number of observations in the node.
loss is the number of observations that would be misclassified if we predicted the majority class for everyone in that node.
yval or the displayed the majority class of the node.
yprob gives the estimated class probability for the predicted class.

How to interpret the fuller classification tree

rpart.plot(
  nbc_genre_tree_full,
  tweak = 0.8   # values greater than 1 make labels and boxes appear larger
                # values less than 1 make them smaller
)

Each split is chosen to make the child nodes more homogeneous by class.
The label at the top of a node is the predicted class for that node.

Setup for Decision Trees

NBC Show Data

Visualize GRP and PE by genre

Regression Tree

Modeling goal

Fit the initial regression tree using rpart()

Key arguments in rpart()

How to interpret the rpart output for a regression tree

How to read the plotted tree

plotcp() for an rpart tree

Why it matters

Grow a larger NBC regression tree (for illustration)

Important rpart.control() options

Variable Importance Plot

How to interpret vip() for a tree

What vip() does not tell us

Partial Dependence Plot

How to interpret pdp::partial()

Classification Tree

Classification goal

Prepare data for the classification tree

Fit a larger classification tree

How to interpret the fuller classification tree

Fit the initial regression tree using `rpart()`

Key arguments in `rpart()`

How to interpret the `rpart` output for a regression tree

`plotcp()` for an `rpart` tree

Important `rpart.control()` options

How to interpret `vip()` for a tree

What `vip()` does not tell us

How to interpret `pdp::partial()`