Lecture 8

Unsupervised Learning: Association Rules

Byeong-Hak Choe

bchoe@geneseo.edu

SUNY Geneseo

April 27, 2026

🛒 Association Rule

🛒 Association Rule — Overview

Association rule is used to find objects or attributes that frequently occur together.
- Products that are often bought together during a shopping session;
- Queries that tend to occur together during a session on a website’s search engine;
- Such information can be used to recommend products to shoppers, to place frequently bundled items together on store shelves, or to redesign websites for easier navigation.

🧺 Association Rule — Transactions & Itemsets

The unit of “togetherness” when mining association rules is called a transaction.
- A single shopping basket;
- A single user session on a website;
- Or even a single customer.
The objects that comprise a transaction are referred to as items in an itemset:
- The products in the shopping basket;
- The pages visited during a website session, the actions of a customer.
Sometimes transactions are referred to as baskets, from the shopping basket analogy.

📚 Association Rule — Library Example

Example

Suppose you work in a library.
You want to know which books tend to be checked out together, to help you make predictions about book availability.
When a library patron checks out a set of books, that’s a transaction.
The books that the patron checks out are the itemset that comprise the transaction.

🔎 Association Rule — Mining Steps

Transaction ID	Books checked out
1	The Hobbit, The Princess Bride
2	The Princess Bride, The Last Unicorn
3	The Hobbit
4	The Neverending Story
5	The Last Unicorn
6	The Hobbit, The Princess Bride, The Fellowship of the Ring
7	The Hobbit, The Fellowship of the Ring, The Two Towers, The Return of the King
8	The Fellowship of the Ring, The Two Towers, The Return of the King
9	The Hobbit, The Princess Bride, The Last Unicorn
10	The Last Unicorn, The Neverending Story

Mining for association rules occurs in two steps:
1. Look for all the itemsets (subsets of transactions) that occur more often than in a minimum fraction of the transactions.
2. Turn those itemsets into rules.
Let’s look at the transactions that involve The Hobbit (H) and The Princess Bride (PB).

📏 Association Rule — Support

The Hobbit is in 5/10, or 50% of all transactions.
The Princess Bride is in 4/10, or 40% of all transactions.
Both books are checked out together in 3/10, or 30% of all transactions.
- We’d say the support of the itemset {The Hobbit, The Princess Bride} is 30%.
- Of the five transactions that include The Hobbit, three (3/5 = 60%) also include The Princess Bride.
- Of the four transactions that include The Princess Bride, three (3/4 = 75%) also include The Hobbit.

💬 Association Rule — Making Rules

Example

We can make a rule: “People who check out The Hobbit also check out The Princess Bride.”
- This rule should be correct 60% of the time.
- We’d say that the confidence of the rule is 60%.
We can make another rule: “People who check out The Princess Bride also check out The Hobbit.”
- This rule should be correct 75% of the time.
- We’d say that the confidence of this rule is 75%.

🧮 Rules, Support, and Confidence

Rules, Support, and Confidence

The rule “if X, then Y”: every time you see the item X in a transaction, you expect to also see the item Y (with a given confidence).
support(X): the number of transactions that contain X divided by the total number of transactions in database T.
The confidence of the rule “if X, then Y”: \[ \text{Confidence}(X \Rightarrow Y) = \frac{\texttt{support}(\{X, Y\})}{\texttt{support}(X)} \]
The goal in association rule mining is to find all the interesting rules with at least a given minimum support (say, 10%) and a minimum given confidence (say, 60%).

🏪 Bookstore Example

Bookstore Example

Suppose you work for a bookstore.
You want to recommend books that a customer might be interested in, based on all of their previous purchases and book interests.
You also want to use historical book interest information to develop some recommendation rules.

📥 Reading in the Book Data

library(arules)

tmp <- tempfile(fileext = ".tsv.gz")
download.file("https://bcdanl.github.io/data/bookdata.tsv.gz",
              destfile = tmp, mode = "wb", quiet = TRUE)

bookbaskets <- read.transactions(
  tmp,
  format        = "single",
  header        = TRUE,
  sep           = "\t",
  cols          = c("userid", "title"),
  rm.duplicates = TRUE
)

read.transactions() reads transaction data in two common formats:
1. format = "single": each row corresponds to one item in one transaction, usually with a transaction ID and an item name.
2. format = "basket": each row corresponds to one transaction, with multiple items listed in that row.
rm.duplicates = TRUE removes duplicate items within the same transaction, so each item appears at most once in a given basket.

🔬 Examining the Transaction Data

class(bookbaskets)
dim(bookbaskets)

colnames(bookbaskets)[1:5]
rownames(bookbaskets)[1:5]

Transactions are represented as a special object called transactions.
You can think of a transactions object as a 0/1 matrix, with one row for every transaction (a customer) and one column for every possible item (a book).
The matrix entry \((i, j)\) is 1 if the \(i\)-th transaction contains item \(j\).

📦 Examining the Size Distribution

basketSizes <- size(bookbaskets)
summary(basketSizes)

quantile(basketSizes, probs = seq(0, 1, 0.1))

basketSizes_df <- data.frame(count = basketSizes)

ggplot(basketSizes_df) +
  # geom_density(aes(x = count)) +
  geom_histogram(aes(x = count)) +
  scale_x_log10() +
  labs(title = "Distribution of Basket Sizes (log scale)", x = "Basket size", y = "Density")

55% of customers expressed interest in only one book.
A few people have expressed interest in several hundred, or even several thousand books.

📚 Counting How Often Each Book Occurs

bookCount    <- itemFrequency(bookbaskets, "absolute")

bookCount_df <- data.frame(
  item = names(bookCount),
  n = bookCount,
  row.names = NULL
)

bookCount_df <- bookCount_df |> 
  arrange(-n) |> # popular books 
  # Fraction of transactions containing the most popular book
  mutate(pct = n / nrow(bookbaskets))

itemFrequency() tells you how often each book shows up in the transaction data.

⛏️ Finding the Association Rules

# Restrict to customers who expressed interest in at least 2 books
bookbaskets_use <- bookbaskets[size(bookbaskets) > 1]
dim(bookbaskets_use)

rules <- apriori(
  bookbaskets_use,
  parameter = list(support = 0.002, confidence = 0.75)
)

You may want to restrict the dataset to customers who have expressed interest in at least two books.
To mine rules, you need to decide on a minimum support level and a minimum confidence level.
Use apriori() to find the association rules.

🪁 Rule Quality — Lift

summary(rules)

The quality measures on the rules include a rule’s support, confidence, the support count, and a quantity called lift.
Lift compares the frequency of an observed pattern with how often you’d expect to see that pattern just by chance: \[ \text{lift} = \frac{\texttt{support}(\{X, Y\})}{\texttt{support}(X) \times \texttt{support}(Y)} \]
Lift less than 1: X and Y occur together less often than expected.
Lift close to 1: X and Y occur together about as often as expected by chance.
Lift greater than 1: X and Y occur together more often than expected.
The larger the lift, the more likely the pattern is real.

🧪 Scoring Rules

measures <- interestMeasure(
  rules,
  measure      = c("coverage", "fishersExactTest"),
  transactions = bookbaskets_use
)
summary(measures)

Coverage: the support of the left side of the rule (X); tells you how often the rule would be applied in the dataset.
Fisher’s exact test: a significance test for whether an observed pattern is real or chance.
- Returns the p-value — the probability that you would see the observed pattern by chance; you want the p-value to be small.

🥇 Top 5 Most Confident Rules

rules |>
  sort(by = "confidence") |>
  head(n = 5) |>
  inspect()

inspect() pretty-prints the rules.

rules |>
  sort(by = "confidence") |>
  head(n = 5) |>
  as("data.frame")

as("data.frame") converts rules into data.frame

The most confident rules typically concern series books — readers who pick up one book in a series are very likely to pick up another.

🔵 Visualizing Rules — Scatter Plot

library(arulesViz)

# Scatter plot: support vs. confidence, shaded by lift
plot(rules,
     method  = "scatterplot",
     measure  = c("support", "confidence"),
     shading  = "lift",
     engine   = "ggplot2")

Each point is one rule.
x-axis: support — how often the rule applies.
y-axis: confidence — how often the rule is correct.
Color (lift): darker = stronger association beyond chance.
Rules in the top-right corner are both frequent and reliable.

🕸️ Visualizing Rules — Graph

# Graph plot: items as nodes, rules as edges
# Subset to top 20 rules by lift for readability
rules |>
  sort(by = "lift") |>
  head(n = 20) |>
  plot(method  = "graph",
       engine  = "ggplot2")

Nodes represent individual books (items).
Edges represent rules connecting left-hand side items to right-hand side items.
Node size reflects support; edge shade reflects lift.
Tightly connected clusters reveal books that frequently co-occur.

🔲 Visualizing Rules — Grouped Matrix

# Grouped matrix: LHS groups vs. RHS items
plot(rules,
     method  = "grouped matrix",
     engine  = "ggplot2") +
  coord_flip()

Rows: left-hand side (LHS) item groups (antecedents), clustered by similarity.
Columns: right-hand side (RHS) items (consequents).
Bubble size: support.
Bubble color: lift — useful for spotting which LHS groups most strongly predict each RHS item.

🎛️ Finding Rules with Restrictions

brules <- apriori(
  bookbaskets_use,
  parameter  = list(support = 0.001, 
                    confidence = 0.6),
  appearance = list(rhs = c("The Lovely Bones: A Novel"),
                    default = "lhs")
)

summary(brules)

brules |>
  sort(by = "confidence") |>
  lhs() |>
  head(n = 5) |>
  inspect()

You can restrict which items appear on the left side or right side of a rule.
Here we find books that tend to co-occur with The Lovely Bones by restricting which books appear on the right side of the rule using the appearance parameter.
By default, all the books can go into the left side of the rules.