Exam

DANL 310-01: Data Visualization and Presentation

Author

Byeong-Hak Choe

Published

May 5, 2025

Modified

May 5, 2025

Descriptive Statistics

The distribution of scores for Midterm Exam is shown below:

The following provides the descriptive statistics for each part of the Midterm Exam:

Below is R packages for this exam:

library(tidyverse)
library(ggthemes)  
library(hrbrthemes) 
library(lubridate)
library(socviz)

Question 1

The walmart data.frame is for Question 1:

walmart <- read_csv("https://bcdanl.github.io/data/walmart_albers.csv")

Variable Description

opendate: Opening date of original store
st.address: Address
city: City
state: State (abbreviated)
type: Store type
- Wal-MartStore: The traditional Walmart retail format, typically smaller than a SuperCenter. It focuses on a broad mix of general merchandise with a limited grocery selection. It’s designed for convenience, serving local communities with everyday products.
- SuperCenter: A large, full-service retail store that combines a comprehensive grocery supermarket with a wide range of general merchandise, pharmacy services, and additional offerings such as a garden center or auto care. It serves as a one-stop shop for a variety of daily needs.
- DistributionCenter: Unlike the retail locations, a DistributionCenter is a logistics and warehousing facility. It doesn’t sell directly to consumers; instead, it stores, organizes, and distributes products to Walmart stores, ensuring efficient supply chain management.
x_albers: Longitude
y_albers: Latitude

Additionally, below data.frame is for Q1a and Q1b:

county_data <- socviz::county_data

Q1a (Points: 12.5)

Provide code and a brief explanation to explore the relationship between the following two state-level variables:
- n: Number of Walmart stores in a state
- pop: State population size
What kind of pattern do you observe, and why might this pattern exist?

Example Answer:

Data preparation

Click to Check the Answer!

state_data <- county_data |> 
  group_by(state) |> 
  summarise(pop = sum(pop, na.rm = T),
            hh_income = mean(hh_income, na.rm = T),
            )
state_walmart <- walmart |> 
  group_by(state) |> 
  summarise(n = n()) |> 
  left_join(state_data)

Visualization 1

Click to Check the Answer!

state_walmart |> 
  ggplot(aes(x = pop, y = n)) +
  geom_point() +
  geom_smooth(method = lm)

Visualization 2

Click to Check the Answer!

state_walmart |> 
  ggplot(aes(x = log(pop), y = log(n))) +
  geom_point() +
  geom_smooth(method = lm)

Brief Explanation:
- State population size is positively associated with the number of Walmart stores in a state.
- Walmart may prefer new store locations in areas with large population sizes.

Q1b (Points: 15)

Provide code and a brief explanation to explore how the relationship between the following two state-level variables varies by Walmart type:
- n: Number of Walmart stores in a state
- hh_income: Average of county-level median household income in the state
What kind of pattern do you observe, and why might this pattern exist?

Example Answer:

Data preparation

Click to Check the Answer!

state_walmart <- walmart |> 
  group_by(state, type) |> 
  summarise(n = n()) |> 
  left_join(state_data)

Visualization 1

Click to Check the Answer!

state_walmart |> 
  ggplot(aes(x = hh_income, y = n)) +
  geom_point() +
  geom_smooth(method = lm) +
  facet_wrap(~type)

Visualization 2

Click to Check the Answer!

state_walmart |> 
  ggplot(aes(x = log(hh_income), y = log(n))) +
  geom_point() +
  geom_smooth(method = lm) +
  facet_wrap(~type)

Brief Explanation:
- The number of SuperCenter is negatively associated with household income in the state
- The number of Wal-MartStore in a state is nearly independent with that.
- The number of DistributionCenter in a state is independent with that.
- This suggests that Walmart SuperCenter may favor locations with lower-income households over those with higher-income households.

Q1c (Points: 20)

Replicate the above ggplot using the walmart data.frame.
- cumsum() can be useful.
  - cumsum() compute a cumulative sum. For example, below explains how it works:

# Create a sample tibble with numbers
df <- tibble(
  day = 1:10,
  value = c(5, 3, 6, 2, 4, 7, 1, 8, 3, 5)
)

# Calculate the cumulative sum of 'value'
df <- df |>
  mutate(cumulative_value = cumsum(value))

# Create a sample tibble with a grouping variable
df <- tibble(
  group = rep(c("A", "B"), each = 5),
  day = rep(1:5, times = 2),
  value = c(1, 2, 3, 4, 5, 10, 20, 30, 40, 50)
)

# Calculate the cumulative sum of 'value' for each group separately
df_grouped <- df |>
  group_by(group) |>
  mutate(cumulative_value = cumsum(value))

Note that each number within a bar segment represents the number of Walmart stores for that particular Type in a given year, while the x-axis is scaled on a base-10 logarithmic scale.
The given figure uses color-blind friendly colors from scale_*_tableau()
Below is for the labeling:

txt_y = "Log10(Cumulative Total)"
txt_title = "Number of Walmart U.S. stores in US"
txt_subtitle = "From 1970 to 2006. By Type"

Example Answer:

Data preparation 1

Click to Check the Answer!

q1c_mid <- walmart |> 
  mutate(year = year(opendate)) |> 
  count(year, type) |> 
  group_by(type) |> 
  mutate(tot = cumsum(n)) |>
  filter(type != 'DistributionCenter', year >= 1970) 

q1c_mid_loc <- q1c_mid |> 
  filter(type == "Wal-MartStore") |> 
  mutate(type = "SuperCenter") |> 
  select(-n) |> 
  rename(tot_loc = tot)

q1c_mid <- q1c_mid |> 
  left_join(q1c_mid_loc)

Visualization 1

Click to Check the Answer!

p <- ggplot(q1c_mid, aes(y = factor(year))) +
  geom_col(aes(x = log10(tot),
               fill = type),
           width = 1, color = "white") +
  geom_text(data = q1c_mid |> filter(type == "Wal-MartStore"),
            aes(x = .95*log10(tot), 
                label = tot), 
            size = rel(4),
            hjust = .75,
            fontface = 'bold') +
  geom_text(data = q1c_mid |> filter(type == "SuperCenter"),
                      aes(x =  log10(tot_loc), label = tot), 
            size = rel(4),
            hjust = -.5,
            fontface = 'bold') +
  scale_fill_tableau() +
  scale_x_continuous(expand = c(0.01,0)) +
   guides(fill = guide_legend(reverse = TRUE,
                             title.position = "left",
                             label.position = "bottom",
                             keywidth = 10,
                             nrow = 1))  +
  labs(y = "", x = "Log10(Cumulative Total)", fill = "Type",
       title = "Number of Walmart U.S. stores in US",
       subtitle = "From 1970 to 2006. By Type") +
  theme_ipsum() +
  theme(
    legend.position = c(0.75, 0.075),
    legend.background = element_rect(color = "black", fill = NA),
    legend.text = element_text(size = rel(2),
                               face = 'italic'),   
    legend.title = element_text(size = rel(2),
                               face = 'bold'),  
    legend.key.size = unit(2, "lines"), 
    axis.text.y = element_text(size = rel(2)),
    axis.text.x = element_text(size = rel(2)),
    axis.title.x = element_text(size = rel(2)),
    plot.title = element_text(size = rel(3),
                              hjust = .5,
                              face = 'bold',
                              color = 'blue'),
    plot.subtitle = element_text(size = rel(2.5),
                              hjust = .5,
                              face = 'italic')
  )

p

c.f., Data preparation 2

Click to Check the Answer!

q1c2 <- walmart |> 
  mutate(year = year(opendate)) |> 
  count(year, type) |> 
  group_by(type) |> 
  mutate(tot = cumsum(n))

c.f., Visualization 2

Click to Check the Answer!

# Plot with position_stack

q1c2 |> 
  filter(type != 'DistributionCenter',
         year >= 1970) |> 
  ggplot(aes(x = year, y = log10(tot),
             fill = type)) +
  geom_col(width = 1,
           color = 'white') +
  geom_text(aes(label = tot),
            position = position_stack(vjust = .95),
            size = rel(4)) +
  scale_fill_tableau() +
  scale_x_continuous(breaks = seq(1970,2006,1)) +
  coord_flip() +
  labs(x = "", y = "Log10(Cumulative Total)", fill = "Type",
       title = "Number of Walmart U.S. stores in US",
       subtitle = "From 1970 to 2006. By Type") +
  theme_ipsum() +
  theme(
    legend.position = c(0.75, 0.075),
    legend.background = element_rect(color = "black", fill = NA),
    legend.text = element_text(size = rel(2),
                               face = 'italic'),   
    legend.title = element_text(size = rel(2),
                               face = 'bold'),  
    legend.key.size = unit(2, "lines"), 
    axis.text.y = element_text(size = rel(2)),
    axis.text.x = element_text(size = rel(2)),
    axis.title.x = element_text(size = rel(2)),
    plot.title = element_text(size = rel(3),
                              hjust = .5,
                              face = 'bold',
                              color = 'blue'),
    plot.subtitle = element_text(size = rel(2.5),
                              hjust = .5,
                              face = 'italic')
  )

Q1d (Points: 25)

Replicate the above ggplot map using the walmart data.frame and the following two data.frames:

county_map <- socviz::county_map

fips_code <- read_csv("https://bcdanl.github.io/data/county_fips.csv")

Each panel displays the locations of Walmart stores on the last day of the corresponding year.
Below is for the colors:

c('#fa9523', '#376c3d', '#ff1770')

While the size of dots for Wal-MartStore and SuperCenter is 1, that for DistributionCenter is 5.
While the shape of dots for Wal-MartStore and SuperCenter is 21, that for DistributionCenter is 24.

Example Answer:

Data preparation

Click to Check the Answer!

walmart <- walmart |> 
               mutate(size = case_when(
                type == 'DistributionCenter' ~ 5,
                type == 'SuperCenter' ~ 2,
                type == 'Wal-MartStore' ~ 2
               ))


walmart1975 <- walmart |> 
  filter(year(opendate) <= 1975) |> 
  mutate(years_5 = 1975)
walmart1980 <- walmart |> 
  filter(year(opendate) <= 1980) |> 
  mutate(years_5 = 1980)
walmart1985 <- walmart |> 
  filter(year(opendate) <= 1985) |> 
  mutate(years_5 = 1985)
walmart1990 <- walmart |> 
  filter(year(opendate) <= 1990) |> 
  mutate(years_5 = 1990)
walmart1995 <- walmart |> 
  filter(year(opendate) <= 1995) |> 
  mutate(years_5 = 1995)
walmart2000 <- walmart |> 
  filter(year(opendate) <= 2000) |> 
  mutate(years_5 = 2000)
walmart2005 <- walmart |> 
  filter(year(opendate) <= 2005) |> 
  mutate(years_5 = 2005)

walmart_facet <- 
  rbind(walmart1975,
        walmart1980,
        walmart1985,
        walmart1990,
        walmart1995,
        walmart2000,
        walmart2005
        )

walmart_facet <- walmart_facet |> 
               mutate(type = factor(type, 
                                    levels = c("Wal-MartStore", 
                                               "SuperCenter", 
                                               "DistributionCenter")))
map_data <- county_map |> 
                 mutate(fips = as.integer(id)) |> 
                 left_join(fips_code) |> 
                 filter(!(state_abbr %in% c("AK", "HI")))

Visualization

Click to Check the Answer!

pal <- c('#fa9523',
         '#376c3d', 
         '#ff1770')
s <- c(2, 
       2, 
       5)

# Use shape 21 which supports fill; you can also choose other shapes (22, 23, or 24)
sh <- c(21, 
        21, 
        24)

p <- ggplot() +
  geom_polygon(data = map_data,
               aes(x = long, y = lat, group = group),
               fill = NA, 
               color = 'grey80', 
               linewidth = .15) +
  geom_point(data = walmart_facet,
             aes(x = x_albers, 
                 y = y_albers,
                 color = type,
                 fill = type,
                 size = type,
                 shape = type),
             alpha = .175) +
  
  facet_wrap(~years_5, ncol = 2) +
  
  scale_color_manual(values = pal) +
  scale_fill_manual(values = pal) +
  scale_shape_manual(values = sh) +
  scale_size_manual(values = s) +
  
  guides(color = guide_legend(override.aes = list(fill = pal,
                                                  shape = sh,
                                                  size = s,
                                                  alpha = 1),
                              nrow = 1,
                              title.position = 'left',
                              label.position = 'bottom'),
         shape = 'none',
         size = 'none',
         fill = 'none'
         ) +
  
  labs(title = 'Diffusion of Wal-Mart Stores and General Distribution Centers',
       color = 'Type') +
  
  theme_map() +
  theme(plot.title = element_text(size = rel(2.25),
                                  face = 'bold',
                                  hjust = .5,
                                  margin = margin(15,0,15,0)),
        plot.background = element_rect(fill = 'white'),
        legend.title = element_text(face = 'bold',
                                    hjust = .5,
                                    size = rel(1.75)),
        legend.text = element_text(face = 'bold.italic',
                                   size = rel(1.25)),
        # legend.position = "bottom",
        legend.background = element_rect(color = "black", fill = 'white'),
        legend.position = c(.75,.125),
        legend.justification = "center",
        legend.box.just = "bottom",
        strip.background = element_rect(fill = NA, color = NA),
        strip.text = element_text(size = rel(1.25),
                                  margin = margin(5,0,0,0),
                                  face = 'bold.italic'))

p

Q1e (Points: 7.5)

Provide a brief explanation describing the geographic pattern of Walmart store rollout over the years as illustrated in Q1d and in the video below:

Also, provide an explanation for why Walmart chose to expand its stores in this pattern.

Example Answer:

Walmart began in a relatively central region of the country (near Bentonville, Arkansas) and then expanded outward. Rather than making distant jumps to later fill in gaps, new Walmart locations were always opened close to existing clusters.
Walmart chose this expansion pattern primarily to optimize logistics, reduce operating costs, and strengthen its competitive position.
- By placing new stores close to existing locations, Walmart could significantly lower transportation and inventory costs, making its supply chain more efficient.
- This strategy also enabled Walmart to rapidly build brand presence, create loyal customer bases, and establish regional market dominance before gradually expanding outward.

Question 2 (Points: 20)

The following data.frame is for Question 2:

eia <- read_csv("https://bcdanl.github.io/data/eia_2025_02.csv")

Variable Description

mon_yr: Date
retail_price: Retail price (dollars per gallon)
refining: Proportion of the retail price attributed to the refining component
dist_mkt: Proportion of the retail price attributed to the distribution and marketing component
taxes: Proportion of the retail price attributed to taxes
crude_oil: Proportion of the retail price attributed to crude oil
Note: For each observation, the sum of refining, dist_mkt, taxes, and crude_oil equals 1.

Replicate the above ggplot.
Below is for the labeling:

txt_y = 'Gasoline Pump Components'
txt_title = 'WHAT DO WE PAY FOR IN A GALLON\n OF REGULAR GASOLINE?'
txt_subtitle = 'Gasoline Pump Components History'
txt_caption = 'Source: https://www.eia.gov/petroleum/gasdiesel/gaspump_hist.php'

Below is for the colors:

c("#000000", "#E69F00", "#009E73", "#56B4E9")

cf) If you want to extract a single cell value from a variable, consider the following example with pull():

# Create a sample tibble
df <- tibble(
  name = c("Alice", "Bob", "Carol"),
  age = c(30, 40, 35)
)

# Extract Bob's age using filter() and pull()
bob_age <- df |>
  filter(name == "Bob") |>
  pull(age)

bob_age  # 40

While x-axis ranges from 2012-01-01 to 2031-01-01, the mon-yr variable ranges from 2012-01-01 to 2025-02-01.

Example Answer:

Data preparation

Click to Check the Answer!

q2 <- eia |> 
  pivot_longer(
    cols = refining:crude_oil,
    names_to = "component",
    values_to = "pct"
  )

# Determine the max date and corresponding percentages
max_date <- max(q2$mon_yr)

tax <- q2 |> 
  filter(component == "taxes", 
         mon_yr == max_date) |> 
  pull(pct)

ref <- q2 |> 
  filter(component == "refining", 
         mon_yr == max_date) |> 
  pull(pct)

dist <- q2 |> 
  filter(component == "dist_mkt", 
         mon_yr == max_date) |> 
  pull(pct)

# Common x limits and x position for text annotations
x_min  <- as.Date("2025-01-31")
x_max  <- as.Date("2031-01-01")
x_text <- as.Date("2028-01-01")

Visualization

Click to Check the Answer!

ggplot(q2, aes(x = mon_yr, y = pct, fill = component)) +
  geom_area(position = "fill") +
  
  # Annotate Taxes
  annotate("rect", fill = "#56B4E9", alpha = 1,
           xmin = x_min, xmax = x_max,
           ymin = 0, 
           ymax = tax) +
  annotate("text", label = "Taxes", fontface = "italic",
           x = x_text, 
           y = tax / 2) +
  
  # Annotate Refining
  annotate("rect", fill = "#009E73", alpha = 1,
           xmin = x_min, xmax = x_max,
           ymin = tax, 
           ymax = tax + ref) +
  annotate("text", label = "Refining", fontface = "italic",
           x = x_text, 
           y = tax + ref / 2) +
  
  # Annotate Distribution & Marketing
  annotate("rect", fill = "#E69F00", alpha = 1,
           xmin = x_min, xmax = x_max,
           ymin = tax + ref, 
           ymax = tax + ref + dist) +
  annotate("text", label = "Distribution & Marketing", fontface = "italic",
           x = x_text, 
           y = tax + ref + dist / 2) +
  
  # Annotate Crude Oil
  annotate("rect", fill = "#000000", alpha = 1,
           xmin = x_min, xmax = x_max,
           ymin = tax + ref + dist, ymax = 1) +
  annotate("text", label = "Crude Oil", color = "white", fontface = "italic",
           x = x_text, y = (tax + ref + dist + 1) / 2) +
  
  # Scales and labels
  scale_y_percent(breaks = seq(0.1, 0.9, by = 0.1)) +
  scale_x_date(
    breaks = as.Date(paste0("20", seq(13, 25, 2), "-01-01")),
    labels = paste0("20", seq(13, 25, 2)),
    expand = c(0, 0)
  ) +
  scale_fill_manual(values = rev(c("#56B4E9", "#009E73", "#E69F00", "#000000"))) +
  labs(
    x = "",
    y = "Gasoline Pump Components",
    title = "WHAT DO WE PAY FOR IN A GALLON\nOF REGULAR GASOLINE?",
    subtitle = "Gasoline Pump Components History",
    caption = "Source: https://www.eia.gov/petroleum/gasdiesel/gaspump_hist.php"
  ) +
  
  guides(fill = "none") +
  
  theme_ipsum() +
  theme(
    panel.grid = element_blank(),
    plot.title    = element_text(face = "bold", size = rel(1.75)),
    plot.subtitle = element_text(face = "italic", size = rel(1.25)),
    plot.caption  = element_text(face = "italic", margin = margin(0, 0, 0, 0)),
    axis.text.x   = element_text(size = rel(0.75)),
    axis.text.y   = element_text(size = rel(0.75))
  )