Classwork 10

K-means Clustering and PCA

Author

Byeong-Hak Choe

Published

April 15, 2026

Modified

April 15, 2026

Data

wine <- read_csv("https://bcdanl.github.io/data/wine_data.csv")

Variable Description

Variable Description
acidity_fixed Fixed acidity in the wine.
acidity_volatile Volatile acidity in the wine.
acidity_citric Amount of citric acid in the wine.
residual_sugar Residual sugar remaining after fermentation.
chlorides Amount of chlorides in the wine.
so2_free Free sulfur dioxide concentration.
so2_tot Total sulfur dioxide concentration.
density Density of the wine.
pH pH level of the wine.
so4_2 Sulfate concentration.
alcohol Alcohol content of the wine.
quality Wine quality score.
color Wine type: red or white.

Part 1. Quick Data Check

Question 1. Inspect the data

Use a tidyverse-friendly workflow to answer the following using skimr::skim().

  1. What are the dimensions of wine?
  2. What are the variable types?
  3. Are there any missing values?
  4. How many red wines and white wines are in the data?
  5. Based on the variables, which ones should not be used directly for clustering?
# Your code here


Part 2. Data Cleaning and Scaling

Question 2. Create a cleaned analysis data set

For clustering, we should use only the numeric chemical variables.

Create a new data frame called wine_clust that:

  • keeps only the variables used for clustering,
  • excludes quality and color, and
  • contains only numeric variables.

Then create a scaled version called wine_scaled.

After that, explain why scaling is especially important for k-means clustering.

# Your code here


Part 3. K-means Clustering

Question 3. Fit a k-means model with 2 clusters

Run k-means clustering with centers = 2 and a reasonably large nstart = 25.

Save the result as km2.

Then report:

  1. the cluster sizes, and
  2. the cluster centers.

Based on the cluster centers, what are two or three variables that seem most helpful for distinguishing the two clusters?

# Your code here


Question 4. Add the cluster labels back to the data and profile the clusters

Create a new data frame called wine_km by adding the cluster assignment from km2 to the original wine data.

Then create a summary table that compares the clusters on:

  • alcohol,
  • acidity_volatile,
  • residual_sugar,
  • so2_tot, and
  • quality.

Write 2 to 4 sentences describing how the two clusters differ.

# Your code here


Question 5. Compare cluster membership with wine color

Use a simple table to examine how cluster membership relates to color.

Then answer these questions:

  1. Does cluster membership line up closely with wine color?
  2. Is the clustering recovering something that looks like red versus white wine?
  3. Why is this result interesting in an unsupervised learning setting?
# Your code here


Part 4. Simple Visualization

Question 6. Make one simple scatter plot

Choose two numeric variables and create one scatter plot colored by cluster.

Then answer these questions:

  1. Do the two clusters show visible separation on this plot?
  2. Is there still overlap?
  3. Why might overlap still appear even when k-means used all variables together?
# Your code here


Part 5. Choosing the Number of Clusters

Question 7. Try several values of \(k\)

Fit k-means models for \(k = 2, 3, 4, 5, 6\) and compare their total within-cluster sum of squares.

Then answer these questions:

  1. How does the total within-cluster sum of squares change as \(k\) increases?
  2. Where does the elbow seem to appear?
  3. Based on both the elbow plot and the earlier interpretation, what value of \(k\) seems most reasonable here?
# Your code here


Part 6. Principal Component Analysis

Use the scaled data to run PCA.

Question 8.

  1. How much variation is explained by the first two principal components?
  2. What does the scree plot suggest about how quickly the explained variation drops across components?
  3. Based on the scree plot, do the first few principal components seem to capture a substantial share of the information in the data?
# Your code here


Question 9.

  1. Look at the loading matrix for PC1 and PC2. Which variables have the largest positive and negative loadings on each component?
  2. Based on those loadings, what do PC1 and PC2 seem to represent substantively? In other words, what is each component β€œabout”?
# Your code here


Question 10.

  1. Make one PCA scatter plot colored by wine color.
  2. Does the PCA plot suggest that red and white wines differ in multivariate chemical composition?
# Your code here



Discussion

Welcome to our Classwork 10 Discussion Board! πŸ‘‹

This space is designed for you to engage with your classmates about the material covered in Classwork 10.

Whether you are looking to delve deeper into the content, share insights, or have questions about the content, this is the perfect place for you.

If you have any specific questions for Byeong-Hak (@bcdanl) regarding the Classwork 10 materials or need clarification on any points, don’t hesitate to ask here.

All comments will be stored here.

Let’s collaborate and learn from each other!

Back to top