Lecture 7

Unsupervised Learning

Byeong-Hak Choe

SUNY Geneseo

April 13, 2026

๐Ÿค– Unsupervised Learning

๐Ÿค– Unsupervised Learning

  • Unsupervised learning methods discover unknown relationships in data.
  • With unsupervised methods, thereโ€™s no outcome that weโ€™re trying to predict.
    • Instead, we want to discover patterns in the data that perhaps we hadnโ€™t previously suspected.
      • Unsupervised learning: ML + Data Visualization
  • We look at three classes of unsupervised methods:
    • Cluster analysis finds groups with similar characteristics.
    • Principal component analysis (PCA) turns many variables into a few new ones, called principal components, that still capture most of our data. (PCA is a popular dimensionality reduction method!)
    • Association rule learning discovers relationships between variables โ€” for example, which items tend to be purchased together.

๐Ÿ—‚๏ธ Clustering

  • In cluster analysis, the goal is to group the observations in our data into clusters such that every observation in a cluster is more similar to other observations in the same cluster than observations in other clusters.
    • E.g., a company that offers guided tours might want to cluster its clients by behavior and tastes:
      • Which countries they like to visit
      • Whether they prefer adventure tours, luxury tours, or educational tours
      • What kinds of activities they participate in
      • What sorts of sites they like to visit

๐Ÿฅฉ Protein Consumption Data

library(tidyverse)
library(factoextra)
library(cluster)
library(fpc)

protein <- read_csv("https://bcdanl.github.io/data/protein.csv")
  • To demonstrate clustering, we will use a small dataset from 1973 on protein consumption from 9 different food variables in 25 countries in Europe.
  • The goal is to group the countries based on patterns in their protein consumption.

โš–๏ธ Units and Scaling

  • The units we choose to measure our data can significantly influence the clusters that an algorithm uncovers.
  • One way to try to make the units of each variable more compatible is to standardize all the variables to have a mean value of 0 and a standard deviation of 1.
# Uses all columns except the first (Country)
vars_to_use <- colnames(protein)[-1]

# Scale to mean 0, sd 1; stores center and scale attributes
pmatrix <- scale(protein[, vars_to_use])
pcenter <- attr(pmatrix, "scaled:center")
pscale  <- attr(pmatrix, "scaled:scale")

# Convenience function to strip scale attributes
rm_scales <- function(scaled_matrix) {
  attr(scaled_matrix, "scaled:center") <- NULL
  attr(scaled_matrix, "scaled:scale")  <- NULL
  scaled_matrix
}

pmatrix <- rm_scales(pmatrix)
  • We can scale numeric data in R using the function scale().
  • The scale() function annotates its output with two attributes:
    • scaled:center returns the mean values of all the columns;
    • scaled:scale returns the standard deviations.

๐Ÿ“Š Density Plots โ€” Raw vs. Scaled

df_raw    <- protein |> select(FrVeg, RedMeat) |> mutate(type = "Original")
df_scaled <- as.data.frame(pmatrix) |>
               select(FrVeg, RedMeat) |>
               mutate(type = "Scaled")

bind_rows(df_raw, df_scaled) |>
  pivot_longer(c(FrVeg, RedMeat), names_to = "variable", values_to = "value") |>
  ggplot(aes(x = value, color = variable)) +
  geom_density(linewidth = 1) +
  facet_wrap(~type, scales = "free") +
  labs(title = "Raw vs. Scaled Distributions", color = NULL)

๐ŸŒณ Hierarchical Clustering โ€” Dendrogram

Hierarchical clustering builds a tree of nested groupings, called a dendrogram.

rownames(pmatrix) <- protein$Country
distmat <- dist(pmatrix, method = "euclidean")
pfit    <- hclust(distmat, method = "ward.D")

p <- fviz_dend(pfit, k = 5,
               rect        = TRUE,
               rect_fill   = TRUE,
               rect_border = "gray40",
               main        = "Ward Dendrogram โ€” Protein Data",
               xlab        = "", ylab = "Height",
               cex         = 0.75) +
  theme(legend.position = "none")

# Thin the branch segments by overriding linewidth in every segment layer
for (i in seq_along(p$layers)) {
  if (inherits(p$layers[[i]]$geom, "GeomSegment")) {
    p$layers[[i]]$aes_params$linewidth <- 0.5
    p$layers[[i]]$aes_params$size      <- 0.5
  }
}
p
  • hclust() uses one of a variety of clustering methods to produce a tree that records the nested cluster structure.
    • Here we choose Wardโ€™s method.
    • hclust() takes a distance matrix from dist() as input.

๐Ÿ“ Wardโ€™s Method and Within Sum of Squares (WSS)

  • Wardโ€™s method starts out with each data point as an individual cluster and merges clusters iteratively so as to minimize the total within sum of squares (WSS) of the clustering

๐Ÿ”ช Hierarchical Clustering โ€” Cutting the Tree

cutree() extracts the members of each cluster from the hclust object.

groups_hc <- cutree(pfit, k = 5)

# Convenience function: print selected columns by cluster
print_clusters <- function(data, groups, columns) {
  grouped <- split(data, groups)
  lapply(grouped, function(df) df[, columns])
}

cols_to_print <- c("Country", "RedMeat", "Fish", "FrVeg")
print_clusters(protein, groups_hc, cols_to_print)

๐Ÿ”ญ Principal Component Analysis (PCA)

  • We can visualize the clustering by projecting the data onto the first two principal components of the data.
  • Ellipsoid is described by three principal components.
  • The ellipsoid roughly bounds the data.
  • The first two principal components, \(\text{PC}_{1}\) & \(\text{PC}_{2}\), describe the best 2-D projection of the data.
    • Notice that the principal components are orthogonal.
    • \(Cor(\text{PC}_{1}, \text{PC}_{2}) = 0\)!

๐Ÿ”„ PCA โ€” Rotation and Interpretation

Rotation and interpretation

  • Principal components are unknown but associated with variables in the data.
    • A principal component can be interpreted in terms of how much each variable in the data is associated with each principal component.
  • Principal components describe the data by projecting the data into the space of the (independent) principal components, which is called rotations.
  • The maximum number of principal components of the data is the number of numeric variables in the data.

โš™๏ธ PCA โ€” Decomposition

The prcomp() function does the principal components decomposition.

princ   <- prcomp(pmatrix)
nComp   <- 2
project <- predict(princ, pmatrix)[, 1:nComp]

project_plus <- cbind(
  as.data.frame(project),
  cluster = as.factor(groups_hc),
  country = protein$Country
)

๐Ÿงฉ PC Interpretation โ€” Protein Data

# First 2 loading vectors (transposed for readability)
t( round(princ$rotation, 2) )[1:2, ]

  • \(PC_{1}\) is high nut/grain and low meat/dairy:
    • \(PC_{1}\) increases by 0.42 with 1 S.D. increase in nuts.
    • \(PC_{1}\) decreases by 0.30 with 1 S.D. increase in red meat.
  • \(PC_{2}\) is Iberian:
    • \(PC_{2}\) increases by 0.65 with 1 S.D. extra fish.

๐Ÿ“– How to Read Loadings โ€” A Simple Rule

The loading table (princ$rotation) shows how much each variable contributes to each PC.

  • Large positive loading โ†’ variable pulls the PC up (positively associated).
  • Large negative loading โ†’ variable pulls the PC down (negatively associated).
  • Near zero โ†’ variable barely contributes to that PC.
  • The sign alone doesnโ€™t matter โ€” what matters is which variables have the largest absolute loadings and share the same sign.
  • Name the PC by describing what is high vs. low:
    • PC1: high Nuts/FrVeg/Cereals vs. low RedMeat/Milk โ†’ โ€œPlant-heavy dietโ€
    • PC2: high Fish/Starch vs. low WhiteMeat/FrVeg โ†’ โ€œIberian/coastal dietโ€

๐Ÿ—บ๏ธ PCA Projection by Ward Clusters

ggplot(project_plus, aes(x = PC1, y = PC2)) +
  geom_point(data = as.data.frame(project),
             color = "darkgrey", 
             alpha = 0.4) +
  geom_point(aes(color = cluster)) +
  geom_text(aes(label = country), 
            hjust = 0, vjust = 1, 
            size = 3) +
  facet_wrap(~ cluster, 
             ncol = 3, labeller = label_both) +
  labs(title = "PCA Projection by Ward Clusters") +
  theme(legend.position = "none")

๐ŸŽฏ K-means Algorithm

K-Means partitions data into \(k\) groups by assigning each point to the nearest cluster center and iteratively updating the centers to minimize within-cluster variance.

๐Ÿ’ป K-means โ€” R Implementation

kmeans() function implements the k-means algorithm.

kbest_p  <- 5
pclusters <- kmeans(pmatrix, kbest_p, nstart = 100, iter.max = 100)

pclusters$centers  # cluster centroids (in scaled units)
pclusters$size     # number of observations per cluster

groups_km <- pclusters$cluster
print_clusters(protein, groups_km, cols_to_print)

๐Ÿ† Best Number of Clusters โ€” Indices

  • Calinski-Harabasz Index: the adjusted ratio of between-sum-of-square (BSS) to within-sum-of-square (WSS).
    • \(\text{BSS} = \text{TSS} - \text{WSS}\), where \(\text{TSS}\) is total-sum-of-square.
    • Good clustering: a small average WSS & a large average BSS.
  • Average Silhouette Width (ASW) Index: the mean of individual silhouette coefficients
    • Silhouette coefficient quantifies how well a point sits in its cluster by comparing cohesion (avg distance to points in same cluster) and separation (avg distance to points in nearest other cluster).

๐Ÿ”ข Best Number of Clusters โ€” kmeansruns()

The fpc package provides kmeansruns(), which calls kmeans() over a range of \(k\) and selects the best \(k\) automatically.

clustering_ch  <- kmeansruns(pmatrix, krange = 1:10, criterion = "ch")
clustering_asw <- kmeansruns(pmatrix, krange = 1:10, criterion = "asw")

clustering_ch$bestk   # best k by Calinski-Harabasz
clustering_asw$bestk  # best k by Average Silhouette Width

clustering_ch$crit    # criterion scores for each k (CH)
clustering_asw$crit   # criterion scores for each k (ASW)
  • The \(k^{*} = 2\) clustering corresponds to the first split of the protein data dendrogram.
    • Clustering with \(k^{*} = 2\) might not be so informative.

๐Ÿ“ˆ Visualizing Cluster Quality โ€” Elbow Plot

The elbow plot shows WSS for each \(k\). Look for the โ€œelbowโ€ where adding more clusters stops improving fit much.

fviz_nbclust(pmatrix,
             FUNcluster = kmeans,
             method     = "wss",     # within-cluster sum of squares
             k.max      = 10) +
  labs(title = "Elbow Method โ€” Optimal k",
       x     = "Number of Clusters k",
       y     = "Total Within-Cluster SS")
  • The x-axis is the number of clusters \(k\).
  • The y-axis is total WSS.
  • Choose \(k\) at the โ€œelbowโ€ โ€” where the curve bends sharply and flattens out.
  • There is no single correct answer; look for the point of diminishing returns.

๐Ÿ—บ๏ธ Visualizing K-means Clusters โ€” fviz_cluster()

fviz_cluster() from factoextra projects clusters onto the first two PCs and draws confidence ellipses automatically.

fviz_cluster(pclusters,          # kmeans result
             data        = pmatrix,
             geom        = c("point", "text"),
             ellipse.type= "convex",
             repel       = TRUE,  # avoids label overlap
             ggtheme     = theme_minimal(),
             main        = "K-means Clusters (k = 5) โ€” Protein Data")
  • Points colored by cluster; ellipses show the convex hull of each group.
  • Axes are PC1 and PC2 โ€” they explain the largest fraction of variance.
  • Use ellipse.type = "norm" for elliptical confidence regions instead.

๐Ÿ“Š Visualizing K-means Clusters โ€” fviz_nbclust() ASW

fviz_nbclust() can also plot the average silhouette width across values of \(k\).

fviz_nbclust(pmatrix,
             FUNcluster = kmeans,
             method     = "silhouette",
             k.max      = 10) +
  labs(title = "Average Silhouette Width โ€” Optimal k",
       x     = "Number of Clusters k",
       y     = "Average Silhouette Width")
  • The peak of this curve is the recommended \(k\).
  • Compare with the elbow plot and dendrogram to arrive at a final choice.

๐Ÿ“บ NBC Show Data

  • Data from NBC on response to TV pilots
    • Gross Ratings Points (GRP): estimated total viewership, which measures broadcast marketability.
    • Projected Engagement (PE): a more subtle measure of audience.
      • After watching a show, viewer is quizzed on order and detail.
      • This measures their engagement with the show (and ads!).
url_shows <- "https://bcdanl.github.io/data/nbc_show.csv"
shows <- read_csv(url_shows) |>
  mutate(Genre = as.factor(Genre))

ggplot(shows, aes(x = GRP, y = PE, color = Genre)) +
  geom_point(size = 2.5, alpha = 0.8) +
  geom_smooth(method = "lm", se = FALSE, linewidth = 0.8) +
  labs(title = "Gross Rating Points vs. Projected Engagement")

๐Ÿ“‹ NBC Show Survey

  • After watching a show, viewer is quizzed on order and detail.
    • The survey data include 6241 views and 20 questions for 40 shows.
  • There are two types of questions in the survey.
    • For Q1, this statement takes the form of โ€œThis show makes me feel โ€ฆโ€
    • For Q2, the statement is โ€œI find this show feel โ€ฆโ€
  • To relate survey results to show performance, we first calculate the average survey response by show.
url_survey <- "https://bcdanl.github.io/data/nbc_survey.csv"
survey <- read_csv(url_survey)

# Average each question by show, then align row order to shows
pilot_avg <- survey |>
  select(-Viewer) |>
  group_by(Show) |>
  summarise(across(everything(), mean)) |>
  filter(Show %in% shows$Show) |>
  arrange(match(Show, shows$Show)) |>
  column_to_rownames("Show")

๐Ÿ“‰ Scree Plot

The scree plot is simply showing us, for each principal component \(j\) (on the \(x\)-axis), what fraction of the total variance in our pilot-survey data that component captures (on the \(y\)-axis).

X     <- scale(pilot_avg)
princ_nbc <- prcomp(X, scale. = FALSE)

# Scree plot using factoextra
fviz_eig(princ_nbc, addlabels = TRUE, ncp = 10,
         main = "Pilot-Survey PCs") 

\[ Var(PC_{1}) \geq Var(PC_{2}) \geq Var(PC_{3}) \geq \cdots \]

๐Ÿ“– How to Read a Scree Plot

  • The y-axis shows percent of variance explained by each PC.
  • The x-axis is the index of each principal component.
  • Key reading rules:

Look for the โ€œelbowโ€:

  • The variance drops steeply at first, then levels off.
  • Keep PCs before the elbow โ€” they capture most of the signal.
  • PCs after the elbow mostly capture noise.

Cumulative variance rule:

  • A common criterion is to keep enough PCs to explain โ‰ฅ 70โ€“80% of total variance.
  • Use summary(princ_nbc) to see cumulative proportions.
# Cumulative variance summary
summary(princ_nbc)$importance[, 1:5]
  • In the NBC survey, PC1 + PC2 together explain the bulk of variation โ†’ we use 2 PCs.

๐Ÿ” PC Interpretation โ€” NBC Survey

# First 3 loading vectors (transposed)
t( round(princ_nbc$rotation, 1) )[1:3, ]
  • \(\text{PC}_{1}\) (โ€œOverall Engagementโ€)
  • \(\text{PC}_{2}\) (โ€œPassive/Comfort vs. Active Dramaโ€)

๐Ÿ“– Reading NBC Loadings Step by Step

Use the same three-step process for any PCA loading table:

  1. Find the largest absolute values in each row (PC).
  2. Note the sign โ€” positive means the variable raises that PC, negative means it lowers it.
  3. Name the PC by describing the contrast (what is high vs. low).
  • PC1 โ€” large positive loadings on all Q1/Q2 items โ†’ represents overall engagement level.
    • Shows with high PC1 score = viewers felt high engagement across all questions.
  • PC2 โ€” positive loadings on comfort/passive feelings, negative on drama/tension items.
    • Shows with high PC2 = comfort/relaxing content; low PC2 = intense/dramatic content.

๐ŸŒ PCA Projection of NBC Shows

Z <- predict(princ_nbc, X)

zpilot_df <- as.data.frame(Z[, 1:2]) |>
  mutate(Show  = shows$Show,
         Genre = shows$Genre,
         PE    = shows$PE,
         GRP   = shows$GRP,
         PE_norm = (PE - min(PE)) / (max(PE) - min(PE)))

ggplot(zpilot_df, aes(x = PC1, y = PC2, color = Genre, size = PE_norm)) +
  geom_point(alpha = 0.5) +
  geom_text(aes(label = Show), size = 2.5, hjust = 0, vjust = 1,
            show.legend = FALSE) +
  scale_size_continuous(range = c(2, 10), name = "PE (normalized)") +
  labs(title = "NBC Pilots in Survey PC Space")

๐Ÿ“– Principal Component Regression (PCR)

  • PCR uses a lower-dimension set of principal components as predictors.
    • PC analysis can reduce dimension, which is usually good.
    • The PCs are independent (\(Cov(PC_{i}, PC_{j}) = 0\)).
  • PC analysis will be driven by the dominant sources of variation in \(x\).
    • If the outcome is connected to these dominant sources of variation, PCR works well.
  • The AIC (Akaike information criterion) is negatively associated to the probability that the data would be observed from the model.
    • The lower AIC is, the better model is.

๐Ÿ’ก PCR โ€” Model Selection by AIC

max_K    <- min(20, ncol(Z))
aic_pe   <- numeric(max_K)
aic_grp  <- numeric(max_K)

for (k in seq_len(max_K)) {
  Xk          <- cbind(1, Z[, 1:k])           # intercept + first k PCs
  aic_pe[k]   <- AIC(lm(shows$PE  ~ Z[, 1:k]))
  aic_grp[k]  <- AIC(lm(shows$GRP ~ Z[, 1:k]))
}

cat("Lowest AIC for PE  achieved with K =", which.min(aic_pe),  "PCs\n")
cat("Lowest AIC for GRP achieved with K =", which.min(aic_grp), "PCs\n")

๐Ÿ“ˆ PCR โ€” AIC Plot

tibble(K = 1:max_K, PE = aic_pe, GRP = aic_grp) |>
  pivot_longer(c(PE, GRP), names_to = "Outcome", values_to = "AIC") |>
  ggplot(aes(x = K, y = AIC, color = Outcome)) +
  geom_line() + geom_point() +
  facet_wrap(~Outcome, ncol = 1, scales = "free") +
  scale_x_continuous(breaks = 1:max_K) +
  labs(title = "PCR Model Comparison: AIC vs. Number of Principal Components",
       x = "Number of Principal Components (K)", y = "AIC")

๐Ÿ—บ๏ธ Full Workflow Summary

๐Ÿ—บ๏ธ Full Clustering + PCA Workflow

Every clustering + PCA analysis follows the same pipeline. Follow this order:

# โ”€โ”€ STEP 1: Load data โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
df <- read_csv("YOUR_DATA.csv")

# โ”€โ”€ STEP 2: Scale numeric variables โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
vars    <- colnames(df)[-1]          # drop ID/label column
mat     <- scale(df[, vars])         # mean=0, sd=1
rownames(mat) <- df$ID_column        # label rows for dendrograms

# โ”€โ”€ STEP 3a: Hierarchical clustering โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
distmat <- dist(mat, method = "euclidean")
hfit    <- hclust(distmat, method = "ward.D")
fviz_dend(hfit, k = 5, rect = TRUE)  # choose k visually

# โ”€โ”€ STEP 3b: Cut the tree to get cluster labels โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
groups_hc <- cutree(hfit, k = 5)

# โ”€โ”€ STEP 4: Choose k for k-means (elbow + silhouette) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
fviz_nbclust(mat, kmeans, method = "wss")
fviz_nbclust(mat, kmeans, method = "silhouette")
kmeansruns(mat, krange = 1:10, criterion = "ch")$bestk
kmeansruns(mat, krange = 1:10, criterion = "asw")$bestk

# โ”€โ”€ STEP 5: K-means โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
km  <- kmeans(mat, centers = 5, nstart = 100, iter.max = 100)
fviz_cluster(km, data = mat, geom = c("point","text"), repel = TRUE)

# โ”€โ”€ STEP 6: PCA โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
pc  <- prcomp(mat)
fviz_eig(pc, addlabels = TRUE)       # scree plot
t(round(pc$rotation, 2))[1:2, ]     # loadings

# โ”€โ”€ STEP 7: PCA projection coloured by cluster โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
proj <- as.data.frame(predict(pc, mat)[, 1:2])
proj$cluster <- as.factor(groups_hc)
ggplot(proj, aes(x = PC1, y = PC2, color = cluster)) + geom_point()

โœ… Step-by-Step Checklist

Use this checklist every time you run a clustering + PCA analysis:

Step Action Key function
1 Load data read_csv()
2 Scale numeric variables scale()
3 Build dendrogram hclust() + fviz_dend()
4 Cut tree cutree(fit, k = K)
5 Choose k (elbow) fviz_nbclust(..., method="wss")
6 Choose k (silhouette) fviz_nbclust(..., method="silhouette")
7 Run k-means kmeans(mat, K, nstart=100)
8 Visualize clusters fviz_cluster()
9 Run PCA prcomp(mat)
10 Scree plot fviz_eig()
11 Read loadings t(round(pc$rotation, 2))
12 Project + color by cluster predict(pc, mat) + ggplot()

๐Ÿง  Common Questions on Interpretation

  • Q: How do I name a principal component?
    • Look at the two or three variables with the largest absolute loading values.
    • Describe what is high (large positive) versus low (large negative) on that PC.
    • Example: PC1 has large positive loading on Fish and large negative on RedMeat โ†’ โ€œFish-heavy vs. meat-heavy diet.โ€
  • Q: How many PCs should I keep?
    • Use the scree plot elbow: keep PCs before the curve flattens.
    • Or use the cumulative variance rule: keep enough PCs to explain โ‰ฅ 70โ€“80% of variance.
    • summary(prcomp_object)$importance shows cumulative proportions.
  • Q: Which k to choose for k-means?
    • Look at the elbow plot (WSS), the silhouette plot (ASW), and the dendrogram.
    • If they agree โ†’ use that \(k\). If they disagree โ†’ use subject-matter knowledge or interpretability.
  • Q: What is fviz_cluster() actually plotting?
    • It projects the data onto PC1 and PC2 and colors points by cluster label. The axis labels tell you how much variance each PC explains.

Association Rule

๐Ÿ›’ Association Rule โ€” Overview

  • Association rule is used to find objects or attributes that frequently occur together.
    • Products that are often bought together during a shopping session;
    • Queries that tend to occur together during a session on a websiteโ€™s search engine;
    • Such information can be used to recommend products to shoppers, to place frequently bundled items together on store shelves, or to redesign websites for easier navigation.

๐Ÿงบ Association Rule โ€” Transactions & Itemsets

  • The unit of โ€œtogethernessโ€ when mining association rules is called a transaction.
    • A single shopping basket;
    • A single user session on a website;
    • Or even a single customer.
  • The objects that comprise a transaction are referred to as items in an itemset:
    • The products in the shopping basket;
    • The pages visited during a website session, the actions of a customer.
  • Sometimes transactions are referred to as baskets, from the shopping basket analogy.

๐Ÿ“š Association Rule โ€” Library Example

Example

  • Suppose you work in a library.
  • You want to know which books tend to be checked out together, to help you make predictions about book availability.
  • When a library patron checks out a set of books, thatโ€™s a transaction.
  • The books that the patron checks out are the itemset that comprise the transaction.

๐Ÿ”Ž Association Rule โ€” Mining Steps

Transaction Items checked out
T1 H, PB, Dune
T2 H, PB, Foundation
T3 H, PB, 1984
T4 H, Dune
T5 PB, Foundation
  • Mining for association rules occurs in two steps:
    1. Look for all the itemsets (subsets of transactions) that occur more often than in a minimum fraction of the transactions.
    2. Turn those itemsets into rules.
  • Letโ€™s look at the transactions that involve The Hobbit (H) and The Princess Bride (PB).

๐Ÿ“ Association Rule โ€” Support & Confidence

  • The Hobbit is in 5/10, or 50% of all transactions.
  • The Princess Bride is in 4/10, or 40% of all transactions.
  • Both books are checked out together in 3/10, or 30% of all transactions.
    • Weโ€™d say the support of the itemset {The Hobbit, The Princess Bride} is 30%.
    • Of the five transactions that include The Hobbit, three (3/5 = 60%) also include The Princess Bride.
    • Of the four transactions that include The Princess Bride, three (3/4 = 75%) also include The Hobbit.

๐Ÿ’ฌ Association Rule โ€” Making Rules

Example

  • We can make a rule: โ€œPeople who check out The Hobbit also check out The Princess Bride.โ€
    • This rule should be correct 60% of the time.
    • Weโ€™d say that the confidence of the rule is 60%.
  • We can make another rule: โ€œPeople who check out The Princess Bride also check out The Hobbit.โ€
    • This rule should be correct 75% of the time.
    • Weโ€™d say that the confidence of this rule is 75%.

๐Ÿงฎ Rules, Support, and Confidence

Rules, Support, and Confidence

  • The rule โ€œif X, then Yโ€: every time you see the itemset X in a transaction, you expect to also see Y (with a given confidence).

  • support(X): the number of transactions that contain X divided by the total number of transactions in database T.

  • The confidence of the rule โ€œif X, then Yโ€: \[ \text{Confidence} = \frac{\texttt{support}(\{X, Y\})}{\texttt{support}(X)} \]

  • The goal in association rule mining is to find all the interesting rules with at least a given minimum support (say, 10%) and a minimum given confidence (say, 60%).

๐Ÿช Bookstore Example

Bookstore Example

  • Suppose you work for a bookstore.
  • You want to recommend books that a customer might be interested in, based on all of their previous purchases and book interests.
  • You also want to use historical book interest information to develop some recommendation rules.

๐Ÿ“ฅ Reading in the Book Data

library(arules)

tmp <- tempfile(fileext = ".tsv.gz")
download.file("https://bcdanl.github.io/data/bookdata.tsv.gz",
              destfile = tmp, mode = "wb", quiet = TRUE)

bookbaskets <- read.transactions(
  tmp,
  format        = "single",
  header        = TRUE,
  sep           = "\t",
  cols          = c("userid", "title"),
  rm.duplicates = TRUE
)
  • read.transactions() reads data in two formats:
    1. format = "single": every row corresponds to a single item (like bookdata.tsv.gz);
    2. format = "basket": each row corresponds to a single transaction, possibly with a transaction ID.
  • rm.duplicates = TRUE eliminates duplicates due to different editions of a book.

๐Ÿ”ฌ Examining the Transaction Data

class(bookbaskets)
dim(bookbaskets)

colnames(bookbaskets)[1:5]
rownames(bookbaskets)[1:5]
  • Transactions are represented as a special object called transactions.
  • You can think of a transactions object as a 0/1 matrix, with one row for every transaction (a customer) and one column for every possible item (a book).
  • The matrix entry \((i, j)\) is 1 if the \(i\)-th transaction contains item \(j\).

๐Ÿ“ฆ Examining the Size Distribution

basketSizes <- size(bookbaskets)
summary(basketSizes)

quantile(basketSizes, probs = seq(0, 1, 0.1))

ggplot(data.frame(count = basketSizes)) +
  geom_density(aes(x = count)) +
  scale_x_log10() +
  labs(title = "Distribution of Basket Sizes (log scale)", x = "Basket size", y = "Density")
  • 55% of customers expressed interest in only one book.
  • A few people have expressed interest in several hundred, or even several thousand books.

๐Ÿ“š Counting How Often Each Book Occurs

bookCount    <- itemFrequency(bookbaskets, "absolute")
orderedBooks <- sort(bookCount, decreasing = TRUE)

head(orderedBooks, 10)

# Fraction of transactions containing the most popular book
orderedBooks[1] / nrow(bookbaskets)
  • itemFrequency() tells you how often each book shows up in the transaction data.

โ›๏ธ Finding the Association Rules

# Restrict to customers who expressed interest in at least 2 books
bookbaskets_use <- bookbaskets[size(bookbaskets) > 1]
dim(bookbaskets_use)

rules <- apriori(
  bookbaskets_use,
  parameter = list(support = 0.002, confidence = 0.75)
)
  • You may want to restrict the dataset to customers who have expressed interest in at least two books.
  • To mine rules, you need to decide on a minimum support level and a minimum confidence level.
  • Use apriori() to find the association rules.

๐Ÿช Rule Quality โ€” Lift

summary(rules)
  • The quality measures on the rules include a ruleโ€™s support, confidence, the support count, and a quantity called lift.

  • Lift compares the frequency of an observed pattern with how often youโ€™d expect to see that pattern just by chance: \[ \text{lift} = \frac{\texttt{support}(\{X, Y\})}{\texttt{support}(X) \times \texttt{support}(Y)} \]

  • If the lift is near 1, the pattern is likely occurring just by chance. The larger the lift, the more likely the pattern is real.

๐Ÿงช Scoring Rules

measures <- interestMeasure(
  rules,
  measure      = c("coverage", "fishersExactTest"),
  transactions = bookbaskets_use
)
summary(measures)
  • Coverage: the support of the left side of the rule (X); tells you how often the rule would be applied in the dataset.
  • Fisherโ€™s exact test: a significance test for whether an observed pattern is real or chance.
    • Returns the p-value โ€” the probability that you would see the observed pattern by chance; you want the p-value to be small.

๐Ÿฅ‡ Top 5 Most Confident Rules

rules |>
  sort(by = "confidence") |>
  head(n = 5) |>
  inspect()

๐Ÿ”ต Visualizing Rules โ€” Scatter Plot

library(arulesViz)

# Scatter plot: support vs. confidence, shaded by lift
plot(rules,
     method  = "scatterplot",
     measure  = c("support", "confidence"),
     shading  = "lift",
     engine   = "ggplot2")
  • Each point is one rule.
  • x-axis: support โ€” how often the rule applies.
  • y-axis: confidence โ€” how often the rule is correct.
  • Color (lift): darker = stronger association beyond chance.
  • Rules in the top-right corner are both frequent and reliable.

๐Ÿ•ธ๏ธ Visualizing Rules โ€” Graph

# Graph plot: items as nodes, rules as edges
# Subset to top 20 rules by lift for readability
rules |>
  sort(by = "lift") |>
  head(n = 20) |>
  plot(method  = "graph",
       engine  = "ggplot2")
  • Nodes represent individual books (items).
  • Edges represent rules connecting left-hand side items to right-hand side items.
  • Node size reflects support; edge shade reflects lift.
  • Tightly connected clusters reveal books that frequently co-occur.

๐Ÿ”ฒ Visualizing Rules โ€” Grouped Matrix

# Grouped matrix: LHS groups vs. RHS items
plot(rules,
     method  = "grouped matrix",
     engine  = "ggplot2")
  • Rows: right-hand side items (consequents).

  • Columns: left-hand side item groups (antecedents), clustered by similarity.

  • Bubble size: support.

  • Bubble color: lift โ€” useful for spotting which LHS groups most strongly predict each RHS item.

  • inspect() pretty-prints the rules.

  • The most confident rules typically concern series books โ€” readers who pick up one book in a series are very likely to pick up another.

๐ŸŽ›๏ธ Finding Rules with Restrictions

brules <- apriori(
  bookbaskets_use,
  parameter  = list(support = 0.001, 
                    confidence = 0.6),
  appearance = list(rhs = c("The Lovely Bones: A Novel"),
                    default = "lhs")
)
summary(brules)
brules |>
  sort(by = "confidence") |>
  lhs() |>
  head(n = 5) |>
  inspect()
  • You can restrict which items appear on the left side or right side of a rule.
  • Here we find books that tend to co-occur with The Lovely Bones by restricting which books appear on the right side of the rule using the appearance parameter.
  • By default, all the books can go into the left side of the rules.