Lecture 15

Data Visualization - Bar Charts

Byeong-Hak Choe

SUNY Geneseo

March 26, 2024

ggplot Basics

Learning Objectives

  • Log Functions

  • Bar Charts

  • Position Adjustments

  • Statistical Transformation

A Little Bit of Math for log()

  • The logarithm function, \(y = \log_{b}\,(\,x\,)\), looks like ….

  • \(\log_{10}\,(\,100\,)\): the base \(10\) logarithm of \(100\) is \(2\), because \(10^{2} = 100\)

  • \(\log_{e}\,(\,x\,)\): the base \(e\) logarithm is called the natural log, where \(e = 2.718\cdots\) is the mathematical constant, the Euler’s number.

  • \(\log\,(\,x\,)\) or \(\ln\,(\,x\,)\): the natural log of \(x\) .

  • \(\log_{e}\,(\,7.389\cdots\,)\): the natural log of \(7.389\cdots\) is \(2\), because \(e^{2} = 7.389\cdots\).

The use of log()

NYC Housing Sales

library(tidyverse)
sale_df <- read_csv(
  "https://bcdanl.github.io/data/home_sales_nyc.csv")
  • sale_df contains data for residential property sales from September 2017 and August 2018 in NYC.
    • Let’s focus on sale.price, a property’s sales price.

The use of log()

Percentage Change

1. We should consider using a logarithmic scale when percent change, or change in orders of magnitude, is more important than changes in absolute units.

  • For small changes in variable \(x\) from \(x_{0}\) to \(x_{1}\), the following can be shown:

\[\Delta \log(x) \,= \, \log(x_{1}) \,-\, \log(x_{0}) \approx\, \frac{x_{1} \,-\, x_{0}}{x_{0}} \,=\, \frac{\Delta\, x}{x_{0}}.\]

  • For example, a difference in sale.price of $10,000 means something very different across people with different income/wealth levels.

The use of log()

Wide Range of Skewed Data

2. We should consider using a log scale when a variable is heavily skewed. - It can help visualize both small and large values effectively.

ggplot(sale_df, 
       aes(x = sale.price), 
       bins = 500) +
  geom_histogram()

Statistical Transformation

Statistical Transformation

  • Many graphs, including bar charts, calculate new values to plot:

    • geom_bar(), geom_histogram(), and geom_freqpoly() bin our data and then plot bin counts, the number of observations that fall in each bin.

    • geom_boxplot() computes a summary of the distribution and then display a specially formatted box.

    • geom_smooth() fits a model to our data and then plot predictions from the model.

Statistical Transformation

  • Bar charts seem simple, but they are interesting because they reveal something subtle about plots.

  • Consider a basic bar chart, as drawn with geom_bar().

  • The following bar chart displays the total number of diamonds in the ggplot2::diamonds data.frame, grouped by cut.

  • The diamonds data.frame comes in ggplot2 and contains information about ~54,000 diamonds, including the price, carat, color, clarity, and cut of each diamond.

Statistical Transformation

  • geom_bar() bins our data and then plot bin counts, the number of observations that fall in each bin.
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut))

Statistical Transformation

  • The algorithm used to calculate new values for a graph is called a stat, short for statistical transformation.

  • The figure below describes how this process works with geom_bar().

Bar Charts Aethetics

color and fill aesthetic

  • We can color a bar chart using either the color aesthetic, or, more usefully, fill.
ggplot(data = diamonds) + 
  geom_bar(mapping = 
             aes(x = cut, 
                 color = cut))

Bar Charts Aethetics

color and fill aesthetic

ggplot(data = diamonds) + 
  geom_bar(mapping = 
             aes(x = cut, 
                 fill = cut))

Statistical Transformation

Observed Value vs. Number of Observations

  • There are three reasons we might need to use a stat explicitly:
    1. We might want to override the default stat.
    2. We might want to override the default stat()
    3. We might want to use stat_summary().

Statistical Transformation

Observed Value vs. Number of Observations

  • 1. We might want to override the default stat.
# to make a simple data.frame
demo <- tribble(         
  ~cut,         ~freq,
  "Fair",       1610,
  "Good",       4906,
  "Very Good",  12082,
  "Premium",    13791,
  "Ideal",      21551 )

ggplot(data = demo) +
  geom_bar(mapping = 
             aes(x = cut, 
                 y = freq), 
           stat = "identity")

Statistical Transformation

Count vs. Proportion

  • 2. We might want to override the default stat()
ggplot(data = diamonds) + 
  geom_bar(mapping = 
             aes(x = cut, 
                 y = stat(prop), 
                 group = 1))

Statistical Transformation

Stat summary

  • 3. We might want to use stat_summary().
ggplot(data = diamonds) + 
  stat_summary(
    mapping = aes(x = cut, y = depth),
    fun.min = min,
    fun.max = max,
    fun = median
  )

Position Adjustment

Position Adjustment

Stacked bar charts with fill aesthetic

  • Note that the bars are automatically stacked if we map the fill aesthetic to another variable.
ggplot(data = diamonds) + 
  geom_bar(mapping = 
             aes(x = cut, 
                 fill = clarity) )

Position Adjustment

Stacked bar charts with fill aesthetic

  • The stacking is performed automatically by the position adjustment specified by the position argument.
ggplot(data = diamonds) + 
  geom_bar(mapping = 
             aes(x = cut, 
                 fill = clarity),
           position = "stack")

Position Adjustment

  • If we don’t want a stacked bar chart with counts, we can use one of two other position options: fill or dodge.

  • position = "fill" works like stacking, but makes each set of stacked bars the same height.

    • This makes it easier to compare proportions across groups.
  • position = "dodge" places overlapping objects directly beside one another.

Position Adjustment

position = "fill"

ggplot(data = diamonds) + 
  geom_bar(mapping = 
             aes(x = cut, 
                 fill = clarity), 
           position = "fill")

Position Adjustment

position = "dodge"

ggplot(data = diamonds) + 
  geom_bar(mapping = 
             aes(x = cut, 
                 fill = clarity), 
           position = "dodge")

The Layered Grammar of Graphics (ggplot)

The Layered Grammar of Graphics

  • Below summarizes the layered grammar of graphics—faceting, stats, position adjustments, and, coordinate systems.
ggplot(data = <DATA>) + 
  <GEOM_FUNCTION>(
     mapping = aes(<MAPPINGS>),
     stat = <STAT>, 
     position = <POSITION>) +
  <FACET_FUNCTION>