Lecture 8

Unsupervised Learning: Association Rules

Byeong-Hak Choe

SUNY Geneseo

April 27, 2026

🛒 Association Rule

🛒 Association Rule — Overview

  • Association rule is used to find objects or attributes that frequently occur together.
    • Products that are often bought together during a shopping session;
    • Queries that tend to occur together during a session on a website’s search engine;
    • Such information can be used to recommend products to shoppers, to place frequently bundled items together on store shelves, or to redesign websites for easier navigation.

🧺 Association Rule — Transactions & Itemsets

  • The unit of “togetherness” when mining association rules is called a transaction.
    • A single shopping basket;
    • A single user session on a website;
    • Or even a single customer.
  • The objects that comprise a transaction are referred to as items in an itemset:
    • The products in the shopping basket;
    • The pages visited during a website session, the actions of a customer.
  • Sometimes transactions are referred to as baskets, from the shopping basket analogy.

📚 Association Rule — Library Example

Example

  • Suppose you work in a library.
  • You want to know which books tend to be checked out together, to help you make predictions about book availability.
  • When a library patron checks out a set of books, that’s a transaction.
  • The books that the patron checks out are the itemset that comprise the transaction.

🔎 Association Rule — Mining Steps

Transaction ID Books checked out
1 The Hobbit, The Princess Bride
2 The Princess Bride, The Last Unicorn
3 The Hobbit
4 The Neverending Story
5 The Last Unicorn
6 The Hobbit, The Princess Bride, The Fellowship of the Ring
7 The Hobbit, The Fellowship of the Ring, The Two Towers, The Return of the King
8 The Fellowship of the Ring, The Two Towers, The Return of the King
9 The Hobbit, The Princess Bride, The Last Unicorn
10 The Last Unicorn, The Neverending Story
  • Mining for association rules occurs in two steps:
    1. Look for all the itemsets (subsets of transactions) that occur more often than in a minimum fraction of the transactions.
    2. Turn those itemsets into rules.
  • Let’s look at the transactions that involve The Hobbit (H) and The Princess Bride (PB).

📏 Association Rule — Support

  • The Hobbit is in 5/10, or 50% of all transactions.
  • The Princess Bride is in 4/10, or 40% of all transactions.
  • Both books are checked out together in 3/10, or 30% of all transactions.
    • We’d say the support of the itemset {The Hobbit, The Princess Bride} is 30%.
    • Of the five transactions that include The Hobbit, three (3/5 = 60%) also include The Princess Bride.
    • Of the four transactions that include The Princess Bride, three (3/4 = 75%) also include The Hobbit.

💬 Association Rule — Making Rules

Example

  • We can make a rule: “People who check out The Hobbit also check out The Princess Bride.”
    • This rule should be correct 60% of the time.
    • We’d say that the confidence of the rule is 60%.
  • We can make another rule: “People who check out The Princess Bride also check out The Hobbit.”
    • This rule should be correct 75% of the time.
    • We’d say that the confidence of this rule is 75%.

🧮 Rules, Support, and Confidence

Rules, Support, and Confidence

  • The rule “if X, then Y: every time you see the item X in a transaction, you expect to also see the item Y (with a given confidence).

  • support(X): the number of transactions that contain X divided by the total number of transactions in database T.

  • The confidence of the rule “if X, then Y”: \[ \text{Confidence}(X \Rightarrow Y) = \frac{\texttt{support}(\{X, Y\})}{\texttt{support}(X)} \]

  • The goal in association rule mining is to find all the interesting rules with at least a given minimum support (say, 10%) and a minimum given confidence (say, 60%).

🏪 Bookstore Example

Bookstore Example

  • Suppose you work for a bookstore.
  • You want to recommend books that a customer might be interested in, based on all of their previous purchases and book interests.
  • You also want to use historical book interest information to develop some recommendation rules.

📥 Reading in the Book Data

library(arules)

tmp <- tempfile(fileext = ".tsv.gz")
download.file("https://bcdanl.github.io/data/bookdata.tsv.gz",
              destfile = tmp, mode = "wb", quiet = TRUE)

bookbaskets <- read.transactions(
  tmp,
  format        = "single",
  header        = TRUE,
  sep           = "\t",
  cols          = c("userid", "title"),
  rm.duplicates = TRUE
)
  • read.transactions() reads transaction data in two common formats:
    1. format = "single": each row corresponds to one item in one transaction, usually with a transaction ID and an item name.
    2. format = "basket": each row corresponds to one transaction, with multiple items listed in that row.
  • rm.duplicates = TRUE removes duplicate items within the same transaction, so each item appears at most once in a given basket.

🔬 Examining the Transaction Data

class(bookbaskets)
dim(bookbaskets)

colnames(bookbaskets)[1:5]
rownames(bookbaskets)[1:5]
  • Transactions are represented as a special object called transactions.
  • You can think of a transactions object as a 0/1 matrix, with one row for every transaction (a customer) and one column for every possible item (a book).
  • The matrix entry \((i, j)\) is 1 if the \(i\)-th transaction contains item \(j\).

📦 Examining the Size Distribution

basketSizes <- size(bookbaskets)
summary(basketSizes)

quantile(basketSizes, probs = seq(0, 1, 0.1))

basketSizes_df <- data.frame(count = basketSizes)

ggplot(basketSizes_df) +
  # geom_density(aes(x = count)) +
  geom_histogram(aes(x = count)) +
  scale_x_log10() +
  labs(title = "Distribution of Basket Sizes (log scale)", x = "Basket size", y = "Density")
  • 55% of customers expressed interest in only one book.
  • A few people have expressed interest in several hundred, or even several thousand books.

📚 Counting How Often Each Book Occurs

bookCount    <- itemFrequency(bookbaskets, "absolute")

bookCount_df <- data.frame(
  item = names(bookCount),
  n = bookCount,
  row.names = NULL
)

bookCount_df <- bookCount_df |> 
  arrange(-n) |> # popular books 
  # Fraction of transactions containing the most popular book
  mutate(pct = n / nrow(bookbaskets))
  • itemFrequency() tells you how often each book shows up in the transaction data.

⛏️ Finding the Association Rules

# Restrict to customers who expressed interest in at least 2 books
bookbaskets_use <- bookbaskets[size(bookbaskets) > 1]
dim(bookbaskets_use)

rules <- apriori(
  bookbaskets_use,
  parameter = list(support = 0.002, confidence = 0.75)
)
  • You may want to restrict the dataset to customers who have expressed interest in at least two books.
  • To mine rules, you need to decide on a minimum support level and a minimum confidence level.
  • Use apriori() to find the association rules.

🪁 Rule Quality — Lift

summary(rules)
  • The quality measures on the rules include a rule’s support, confidence, the support count, and a quantity called lift.

  • Lift compares the frequency of an observed pattern with how often you’d expect to see that pattern just by chance: \[ \text{lift} = \frac{\texttt{support}(\{X, Y\})}{\texttt{support}(X) \times \texttt{support}(Y)} \]

  • Lift less than 1: X and Y occur together less often than expected.

  • Lift close to 1: X and Y occur together about as often as expected by chance.

  • Lift greater than 1: X and Y occur together more often than expected.

  • The larger the lift, the more likely the pattern is real.

🧪 Scoring Rules

measures <- interestMeasure(
  rules,
  measure      = c("coverage", "fishersExactTest"),
  transactions = bookbaskets_use
)
summary(measures)
  • Coverage: the support of the left side of the rule (X); tells you how often the rule would be applied in the dataset.
  • Fisher’s exact test: a significance test for whether an observed pattern is real or chance.
    • Returns the p-value — the probability that you would see the observed pattern by chance; you want the p-value to be small.

🥇 Top 5 Most Confident Rules

rules |>
  sort(by = "confidence") |>
  head(n = 5) |>
  inspect()
  • inspect() pretty-prints the rules.
rules |>
  sort(by = "confidence") |>
  head(n = 5) |>
  as("data.frame")
  • as("data.frame") converts rules into data.frame
  • The most confident rules typically concern series books — readers who pick up one book in a series are very likely to pick up another.

🔵 Visualizing Rules — Scatter Plot

library(arulesViz)

# Scatter plot: support vs. confidence, shaded by lift
plot(rules,
     method  = "scatterplot",
     measure  = c("support", "confidence"),
     shading  = "lift",
     engine   = "ggplot2")
  • Each point is one rule.
  • x-axis: support — how often the rule applies.
  • y-axis: confidence — how often the rule is correct.
  • Color (lift): darker = stronger association beyond chance.
  • Rules in the top-right corner are both frequent and reliable.

🕸️ Visualizing Rules — Graph

# Graph plot: items as nodes, rules as edges
# Subset to top 20 rules by lift for readability
rules |>
  sort(by = "lift") |>
  head(n = 20) |>
  plot(method  = "graph",
       engine  = "ggplot2")
  • Nodes represent individual books (items).
  • Edges represent rules connecting left-hand side items to right-hand side items.
  • Node size reflects support; edge shade reflects lift.
  • Tightly connected clusters reveal books that frequently co-occur.

🔲 Visualizing Rules — Grouped Matrix

# Grouped matrix: LHS groups vs. RHS items
plot(rules,
     method  = "grouped matrix",
     engine  = "ggplot2") +
  coord_flip()
  • Rows: left-hand side (LHS) item groups (antecedents), clustered by similarity.
  • Columns: right-hand side (RHS) items (consequents).
  • Bubble size: support.
  • Bubble color: lift — useful for spotting which LHS groups most strongly predict each RHS item.

🎛️ Finding Rules with Restrictions

brules <- apriori(
  bookbaskets_use,
  parameter  = list(support = 0.001, 
                    confidence = 0.6),
  appearance = list(rhs = c("The Lovely Bones: A Novel"),
                    default = "lhs")
)

summary(brules)

brules |>
  sort(by = "confidence") |>
  lhs() |>
  head(n = 5) |>
  inspect()
  • You can restrict which items appear on the left side or right side of a rule.
  • Here we find books that tend to co-occur with The Lovely Bones by restricting which books appear on the right side of the rule using the appearance parameter.
  • By default, all the books can go into the left side of the rules.