Distribution Plots

Classwork 12

Author

Byeong-Hak Choe

Published

November 11, 2024

Modified

November 17, 2024

For this classwork, consider the flights data.frame

library(tidyverse)
library(ggthemes)
library(nycflights13)

flights <- flights

Description of Variables in flights

Use ??flights to see the description of variables in the flights data.frame.


Q1a

  • Provide ggplot() code to describe the distribution of air_time (amount of time spent in the air, in minutes).
ggplot(data = __BLANK 1__,
       mapping = aes(__BLANK 2__)) +
  __BLANK 3__()

Answer:

  • Since the larger air_time means the longer flight, this integer variable could be numeric.
    • geom_histogram() would be a safer choice.
ggplot(data = flights,
       mapping = aes(x = air_time)) +
  geom_histogram()

  • In ggplot, the distribution of an integer variable can appear quite similar when using geom_bar() and geom_histogram().
ggplot(data = flights,
       mapping = aes(x = air_time)) +
  geom_histogram(binwidth = 1)

ggplot(data = flights,
       mapping = aes(x = air_time)) +
  geom_bar()

  • In Python or other visualization tools, they can be quite different.


Q1b

  • Provide ggplot() code to describe how the distribution of air_time varies by origin.
ggplot(data = __BLANK 1__,
       mapping = aes(__BLANK 2__)) +
  __BLANK 3__() +
  facet_wrap(__BLANK 4__)

Answer:

ggplot(data = flights,
       mapping = aes(x = air_time)) +
  geom_histogram(bins = 60) +
  facet_wrap(~origin)


Q1c

  • Create the data.frame, top5_n, which include the two variables and 5 observations:
    • carrier: the value of the top 5 carriers in terms of the number of flights.
    • n: the number of flights operated by each of the top 5 carriers.
top5_n <- flights |> 
  __BLANK__ |> 
  arrange(-n) |> 
  head(5)  # returns the first 5 observations of the new data.frame

Answer:

top5_n <- flights |> 
  count(carrier) |> 
  arrange(-n) |> 
  head(5)


Q1d

  • Create the data.frame, top5_carriers, which includes all the flights operated by only the top 5 carriers found in Q1c.
top5_carriers <- flights |> 
  filter(carrier == "__BLANK 1__" |
         carrier == "__BLANK 2__" |
         carrier == "__BLANK 3__" |
         carrier == "__BLANK 4__" |
         carrier == "__BLANK 5__" ) 

Answer:

  • We can see which carriers are the top 5 from the top5_n:
top5_carriers <- flights |> 
  filter(carrier == "UA" |
         carrier == "B6" |
         carrier == "EV" |
         carrier == "DL" |
         carrier == "AA" ) 


Q1e

  • Provide ggplot() code to describe the distribution of carrier using the top5_carriers data.frame.
ggplot(data = __BLANK 1__,
       mapping = aes(__BLANK 2__)) +
  __BLANK 3__()

Answer:

  • Here we are using a horizontal bar chart:
ggplot(data = top5_carriers,
       mapping = aes(y = carrier)) +
  geom_bar()


Q1f

  • Provide ggplot() code to describe how the distribution of carrier varies by origin using the top5_carriers data.frame.

Stacked Bar Chart

ggplot(data = __BLANK 1__,
       mapping = aes(y = __BLANK 2__,
                     __BLANK 3__)) +
  geom_bar()

Answer:

ggplot(data = top5_carriers,
       mapping = aes(y = origin,
                     fill = carrier)) +
  geom_bar()

100% Stacked Bar Chart

ggplot(data = __BLANK 1__,
       mapping = aes(y = __BLANK 2__,
                     __BLANK 3__)) +
  geom_bar(position = __BLANK 4__) +
  labs(x = "Proportion")

Answer:

ggplot(data = top5_carriers,
       mapping = aes(y = origin,
                     fill = carrier)) +
  geom_bar(position = "fill")

Clustered Bar Chart

ggplot(data = __BLANK 1__,
       mapping = aes(y = __BLANK 2__,
                     __BLANK 3__)) +
  geom_bar(position = __BLANK 4__)
ggplot(data = top5_carriers,
       mapping = aes(y = origin,
                     fill = carrier)) +
  geom_bar(position = "dodge")

  • c.f., dodge2 provides a gap between bars:
ggplot(data = top5_carriers,
       mapping = aes(y = origin,
                     fill = carrier)) +
  geom_bar(position = "dodge2")

Facetted Bar Chart

ggplot(data = __BLANK 1__,
       mapping = aes(y = __BLANK 2__,
                     __BLANK 3__)) +
  geom_bar(show.legend = F) +
  facet_wrap(__BLANK 4__)

Answer:

ggplot(data = top5_carriers,
       mapping = aes(y = carrier,
                     fill = carrier)) +
  geom_bar(show.legend = F) +
  facet_wrap(~origin)


Q1g

  • Provide ggplot() code to describe the distribution of carrier using the top5_n data.frame.
ggplot(data = top5_n,
       mapping = aes(x = __BLANK 1__,
                     y = __BLANK 2__)) +
  __BLANK 3__()

Answer:

ggplot(data = top5_n,
       mapping = aes(x = n,
                     y = carrier)) +
  geom_col()


Q1h

  • Provide ggplot() code to describe the sorted distribution of carrier’ using the top5_n data.frame.
ggplot(data = top5_n,
       mapping = aes(x = __BLANK 1__,
                     y = __BLANK 2__)) +
  __BLANK 2__()

Answer:

ggplot(data = top5_n,
       mapping = aes(x = n,
                     y = fct_reorder(carrier, n))) +
  geom_col()

  • labs() can label y-axis title.
ggplot(data = top5_n,
       mapping = aes(x = n,
                     y = fct_reorder(carrier, n))) +
  geom_col() +
  labs(y = "Carrier")


Q1i

  • Create the following data frame named carrier_per_origin with the following three variables:
    • origin: the origin airport
    • carrier: the airline carrier
    • n: the number of flights operated by each carrier from each origin airport
  • The carrier_per_origin data frame should contain the count of flights operated by each carrier departing from each origin airport.
carrier_per_origin <- flights |> 
  __BLANK__ |> 
  arrange(origin, -n)

Answer:

carrier_per_origin <- flights |> 
  count(origin, carrier) |> 
  arrange(origin, -n)


Back to top