wine <- read_csv("https://bcdanl.github.io/data/wine_data.csv")Classwork 10
K-means Clustering and PCA
Data
Variable Description
| Variable | Description |
|---|---|
acidity_fixed |
Fixed acidity in the wine. |
acidity_volatile |
Volatile acidity in the wine. |
acidity_citric |
Amount of citric acid in the wine. |
residual_sugar |
Residual sugar remaining after fermentation. |
chlorides |
Amount of chlorides in the wine. |
so2_free |
Free sulfur dioxide concentration. |
so2_tot |
Total sulfur dioxide concentration. |
density |
Density of the wine. |
pH |
pH level of the wine. |
so4_2 |
Sulfate concentration. |
alcohol |
Alcohol content of the wine. |
quality |
Wine quality score. |
color |
Wine type: red or white. |
Part 1. Quick Data Check
Question 1. Inspect the data
Use a tidyverse-friendly workflow to answer the following using skimr::skim().
- What are the dimensions of
wine? - What are the variable types?
- Are there any missing values?
- How many red wines and white wines are in the data?
- Based on the variables, which ones should not be used directly for clustering?
# Your code herePart 2. Data Cleaning and Scaling
Question 2. Create a cleaned analysis data set
For clustering, we should use only the numeric chemical variables.
Create a new data frame called wine_clust that:
- keeps only the variables used for clustering,
- excludes
qualityandcolor, and - contains only numeric variables.
Then create a scaled version called wine_scaled.
After that, explain why scaling is especially important for k-means clustering.
# Your code herePart 3. K-means Clustering
Question 3. Fit a k-means model with 2 clusters
Run k-means clustering with centers = 2 and a reasonably large nstart = 25.
Save the result as km2.
Then report:
- the cluster sizes, and
- the cluster centers.
Based on the cluster centers, what are two or three variables that seem most helpful for distinguishing the two clusters?
# Your code hereQuestion 4. Add the cluster labels back to the data and profile the clusters
Create a new data frame called wine_km by adding the cluster assignment from km2 to the original wine data.
Then create a summary table that compares the clusters on:
alcohol,acidity_volatile,residual_sugar,so2_tot, andquality.
Write 2 to 4 sentences describing how the two clusters differ.
# Your code hereQuestion 5. Compare cluster membership with wine color
Use a simple table to examine how cluster membership relates to color.
Then answer these questions:
- Does cluster membership line up closely with wine color?
- Is the clustering recovering something that looks like red versus white wine?
- Why is this result interesting in an unsupervised learning setting?
# Your code herePart 4. Simple Visualization
Question 6. Make one simple scatter plot
Choose two numeric variables and create one scatter plot colored by cluster.
Then answer these questions:
- Do the two clusters show visible separation on this plot?
- Is there still overlap?
- Why might overlap still appear even when k-means used all variables together?
# Your code herePart 5. Choosing the Number of Clusters
Question 7. Try several values of \(k\)
Fit k-means models for \(k = 2, 3, 4, 5, 6\) and compare their total within-cluster sum of squares.
Then answer these questions:
- How does the total within-cluster sum of squares change as \(k\) increases?
- Where does the elbow seem to appear?
- Based on both the elbow plot and the earlier interpretation, what value of \(k\) seems most reasonable here?
# Your code herePart 6. Principal Component Analysis
Use the scaled data to run PCA.
Question 8.
- How much variation is explained by the first two principal components?
- What does the scree plot suggest about how quickly the explained variation drops across components?
- Based on the scree plot, do the first few principal components seem to capture a substantial share of the information in the data?
# Your code hereQuestion 9.
- Look at the loading matrix for PC1 and PC2. Which variables have the largest positive and negative loadings on each component?
- Based on those loadings, what do PC1 and PC2 seem to represent substantively? In other words, what is each component βaboutβ?
# Your code hereQuestion 10.
- Make one PCA scatter plot colored by wine color.
- Does the PCA plot suggest that red and white wines differ in multivariate chemical composition?
# Your code hereDiscussion
Welcome to our Classwork 10 Discussion Board! π
This space is designed for you to engage with your classmates about the material covered in Classwork 10.
Whether you are looking to delve deeper into the content, share insights, or have questions about the content, this is the perfect place for you.
If you have any specific questions for Byeong-Hak (@bcdanl) regarding the Classwork 10 materials or need clarification on any points, donβt hesitate to ask here.
All comments will be stored here.
Letβs collaborate and learn from each other!