Unsupervised Learning
April 13, 2026
# Uses all columns except the first (Country)
vars_to_use <- colnames(protein)[-1]
# Scale to mean 0, sd 1; stores center and scale attributes
pmatrix <- scale(protein[, vars_to_use])
pcenter <- attr(pmatrix, "scaled:center")
pscale <- attr(pmatrix, "scaled:scale")
# Convenience function to strip scale attributes
rm_scales <- function(scaled_matrix) {
attr(scaled_matrix, "scaled:center") <- NULL
attr(scaled_matrix, "scaled:scale") <- NULL
scaled_matrix
}
pmatrix <- rm_scales(pmatrix)scale().scale() function annotates its output with two attributes:
scaled:center returns the mean values of all the columns;scaled:scale returns the standard deviations.df_raw <- protein |> select(FrVeg, RedMeat) |> mutate(type = "Original")
df_scaled <- as.data.frame(pmatrix) |>
select(FrVeg, RedMeat) |>
mutate(type = "Scaled")
bind_rows(df_raw, df_scaled) |>
pivot_longer(c(FrVeg, RedMeat), names_to = "variable", values_to = "value") |>
ggplot(aes(x = value, color = variable)) +
geom_density(linewidth = 1) +
facet_wrap(~type, scales = "free") +
labs(title = "Raw vs. Scaled Distributions", color = NULL)Hierarchical clustering builds a tree of nested groupings, called a dendrogram.
rownames(pmatrix) <- protein$Country
distmat <- dist(pmatrix, method = "euclidean")
pfit <- hclust(distmat, method = "ward.D")
p <- fviz_dend(pfit, k = 5,
rect = TRUE,
rect_fill = TRUE,
rect_border = "gray40",
main = "Ward Dendrogram โ Protein Data",
xlab = "", ylab = "Height",
cex = 0.75) +
theme(legend.position = "none")
# Thin the branch segments by overriding linewidth in every segment layer
for (i in seq_along(p$layers)) {
if (inherits(p$layers[[i]]$geom, "GeomSegment")) {
p$layers[[i]]$aes_params$linewidth <- 0.5
p$layers[[i]]$aes_params$size <- 0.5
}
}
phclust() uses one of a variety of clustering methods to produce a tree that records the nested cluster structure.
hclust() takes a distance matrix from dist() as input.
cutree() extracts the members of each cluster from the hclust object.
groups_hc <- cutree(pfit, k = 5)
# Convenience function: print selected columns by cluster
print_clusters <- function(data, groups, columns) {
grouped <- split(data, groups)
lapply(grouped, function(df) df[, columns])
}
cols_to_print <- c("Country", "RedMeat", "Fish", "FrVeg")
print_clusters(protein, groups_hc, cols_to_print)
The prcomp() function does the principal components decomposition.
The loading table (princ$rotation) shows how much each variable contributes to each PC.
ggplot(project_plus, aes(x = PC1, y = PC2)) +
geom_point(data = as.data.frame(project),
color = "darkgrey",
alpha = 0.4) +
geom_point(aes(color = cluster)) +
geom_text(aes(label = country),
hjust = 0, vjust = 1,
size = 3) +
facet_wrap(~ cluster,
ncol = 3, labeller = label_both) +
labs(title = "PCA Projection by Ward Clusters") +
theme(legend.position = "none")K-Means partitions data into \(k\) groups by assigning each point to the nearest cluster center and iteratively updating the centers to minimize within-cluster variance.
kmeans() function implements the k-means algorithm.
The fpc package provides kmeansruns(), which calls kmeans() over a range of \(k\) and selects the best \(k\) automatically.
clustering_ch <- kmeansruns(pmatrix, krange = 1:10, criterion = "ch")
clustering_asw <- kmeansruns(pmatrix, krange = 1:10, criterion = "asw")
clustering_ch$bestk # best k by Calinski-Harabasz
clustering_asw$bestk # best k by Average Silhouette Width
clustering_ch$crit # criterion scores for each k (CH)
clustering_asw$crit # criterion scores for each k (ASW)The elbow plot shows WSS for each \(k\). Look for the โelbowโ where adding more clusters stops improving fit much.
fviz_cluster() from factoextra projects clusters onto the first two PCs and draws confidence ellipses automatically.
ellipse.type = "norm" for elliptical confidence regions instead.fviz_nbclust() can also plot the average silhouette width across values of \(k\).
url_shows <- "https://bcdanl.github.io/data/nbc_show.csv"
shows <- read_csv(url_shows) |>
mutate(Genre = as.factor(Genre))
ggplot(shows, aes(x = GRP, y = PE, color = Genre)) +
geom_point(size = 2.5, alpha = 0.8) +
geom_smooth(method = "lm", se = FALSE, linewidth = 0.8) +
labs(title = "Gross Rating Points vs. Projected Engagement")url_survey <- "https://bcdanl.github.io/data/nbc_survey.csv"
survey <- read_csv(url_survey)
# Average each question by show, then align row order to shows
pilot_avg <- survey |>
select(-Viewer) |>
group_by(Show) |>
summarise(across(everything(), mean)) |>
filter(Show %in% shows$Show) |>
arrange(match(Show, shows$Show)) |>
column_to_rownames("Show")The scree plot is simply showing us, for each principal component \(j\) (on the \(x\)-axis), what fraction of the total variance in our pilot-survey data that component captures (on the \(y\)-axis).
\[ Var(PC_{1}) \geq Var(PC_{2}) \geq Var(PC_{3}) \geq \cdots \]
Look for the โelbowโ:
Use the same three-step process for any PCA loading table:
Z <- predict(princ_nbc, X)
zpilot_df <- as.data.frame(Z[, 1:2]) |>
mutate(Show = shows$Show,
Genre = shows$Genre,
PE = shows$PE,
GRP = shows$GRP,
PE_norm = (PE - min(PE)) / (max(PE) - min(PE)))
ggplot(zpilot_df, aes(x = PC1, y = PC2, color = Genre, size = PE_norm)) +
geom_point(alpha = 0.5) +
geom_text(aes(label = Show), size = 2.5, hjust = 0, vjust = 1,
show.legend = FALSE) +
scale_size_continuous(range = c(2, 10), name = "PE (normalized)") +
labs(title = "NBC Pilots in Survey PC Space")max_K <- min(20, ncol(Z))
aic_pe <- numeric(max_K)
aic_grp <- numeric(max_K)
for (k in seq_len(max_K)) {
Xk <- cbind(1, Z[, 1:k]) # intercept + first k PCs
aic_pe[k] <- AIC(lm(shows$PE ~ Z[, 1:k]))
aic_grp[k] <- AIC(lm(shows$GRP ~ Z[, 1:k]))
}
cat("Lowest AIC for PE achieved with K =", which.min(aic_pe), "PCs\n")
cat("Lowest AIC for GRP achieved with K =", which.min(aic_grp), "PCs\n")tibble(K = 1:max_K, PE = aic_pe, GRP = aic_grp) |>
pivot_longer(c(PE, GRP), names_to = "Outcome", values_to = "AIC") |>
ggplot(aes(x = K, y = AIC, color = Outcome)) +
geom_line() + geom_point() +
facet_wrap(~Outcome, ncol = 1, scales = "free") +
scale_x_continuous(breaks = 1:max_K) +
labs(title = "PCR Model Comparison: AIC vs. Number of Principal Components",
x = "Number of Principal Components (K)", y = "AIC")Every clustering + PCA analysis follows the same pipeline. Follow this order:
# โโ STEP 1: Load data โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
df <- read_csv("YOUR_DATA.csv")
# โโ STEP 2: Scale numeric variables โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
vars <- colnames(df)[-1] # drop ID/label column
mat <- scale(df[, vars]) # mean=0, sd=1
rownames(mat) <- df$ID_column # label rows for dendrograms
# โโ STEP 3a: Hierarchical clustering โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
distmat <- dist(mat, method = "euclidean")
hfit <- hclust(distmat, method = "ward.D")
fviz_dend(hfit, k = 5, rect = TRUE) # choose k visually
# โโ STEP 3b: Cut the tree to get cluster labels โโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
groups_hc <- cutree(hfit, k = 5)
# โโ STEP 4: Choose k for k-means (elbow + silhouette) โโโโโโโโโโโโโโโโโโโโโโ
fviz_nbclust(mat, kmeans, method = "wss")
fviz_nbclust(mat, kmeans, method = "silhouette")
kmeansruns(mat, krange = 1:10, criterion = "ch")$bestk
kmeansruns(mat, krange = 1:10, criterion = "asw")$bestk
# โโ STEP 5: K-means โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
km <- kmeans(mat, centers = 5, nstart = 100, iter.max = 100)
fviz_cluster(km, data = mat, geom = c("point","text"), repel = TRUE)
# โโ STEP 6: PCA โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
pc <- prcomp(mat)
fviz_eig(pc, addlabels = TRUE) # scree plot
t(round(pc$rotation, 2))[1:2, ] # loadings
# โโ STEP 7: PCA projection coloured by cluster โโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
proj <- as.data.frame(predict(pc, mat)[, 1:2])
proj$cluster <- as.factor(groups_hc)
ggplot(proj, aes(x = PC1, y = PC2, color = cluster)) + geom_point()Use this checklist every time you run a clustering + PCA analysis:
| Step | Action | Key function |
|---|---|---|
| 1 | Load data | read_csv() |
| 2 | Scale numeric variables | scale() |
| 3 | Build dendrogram | hclust() + fviz_dend() |
| 4 | Cut tree | cutree(fit, k = K) |
| 5 | Choose k (elbow) | fviz_nbclust(..., method="wss") |
| 6 | Choose k (silhouette) | fviz_nbclust(..., method="silhouette") |
| 7 | Run k-means | kmeans(mat, K, nstart=100) |
| 8 | Visualize clusters | fviz_cluster() |
| 9 | Run PCA | prcomp(mat) |
| 10 | Scree plot | fviz_eig() |
| 11 | Read loadings | t(round(pc$rotation, 2)) |
| 12 | Project + color by cluster | predict(pc, mat) + ggplot() |
Fish and large negative on RedMeat โ โFish-heavy vs. meat-heavy diet.โsummary(prcomp_object)$importance shows cumulative proportions.fviz_cluster() actually plotting?
| Transaction | Items checked out |
|---|---|
| T1 | H, PB, Dune |
| T2 | H, PB, Foundation |
| T3 | H, PB, 1984 |
| T4 | H, Dune |
| T5 | PB, Foundation |
The rule โif X, then Yโ: every time you see the itemset X in a transaction, you expect to also see Y (with a given confidence).
support(X): the number of transactions that contain X divided by the total number of transactions in database T.
The confidence of the rule โif X, then Yโ: \[
\text{Confidence} = \frac{\texttt{support}(\{X, Y\})}{\texttt{support}(X)}
\]
The goal in association rule mining is to find all the interesting rules with at least a given minimum support (say, 10%) and a minimum given confidence (say, 60%).
read.transactions() reads data in two formats:
format = "single": every row corresponds to a single item (like bookdata.tsv.gz);format = "basket": each row corresponds to a single transaction, possibly with a transaction ID.rm.duplicates = TRUE eliminates duplicates due to different editions of a book.transactions.transactions object as a 0/1 matrix, with one row for every transaction (a customer) and one column for every possible item (a book).1 if the \(i\)-th transaction contains item \(j\).itemFrequency() tells you how often each book shows up in the transaction data.apriori() to find the association rules.The quality measures on the rules include a ruleโs support, confidence, the support count, and a quantity called lift.
Lift compares the frequency of an observed pattern with how often youโd expect to see that pattern just by chance: \[ \text{lift} = \frac{\texttt{support}(\{X, Y\})}{\texttt{support}(X) \times \texttt{support}(Y)} \]
If the lift is near 1, the pattern is likely occurring just by chance. The larger the lift, the more likely the pattern is real.
X); tells you how often the rule would be applied in the dataset.Rows: right-hand side items (consequents).
Columns: left-hand side item groups (antecedents), clustered by similarity.
Bubble size: support.
Bubble color: lift โ useful for spotting which LHS groups most strongly predict each RHS item.
inspect() pretty-prints the rules.
The most confident rules typically concern series books โ readers who pick up one book in a series are very likely to pick up another.
appearance parameter.default, all the books can go into the left side of the rules.