Distribution Plots

Classwork 12

Author

Byeong-Hak Choe

Published

November 11, 2024

Modified

November 17, 2024

For this classwork, consider the flights data.frame

library(tidyverse)
library(ggthemes)
library(nycflights13)

flights <- flights

Description of Variables in `flights`

Use ??flights to see the description of variables in the flights data.frame.

Q1a

Provide ggplot() code to describe the distribution of air_time (amount of time spent in the air, in minutes).

ggplot(data = __BLANK 1__,
       mapping = aes(__BLANK 2__)) +
  __BLANK 3__()

Answer:

Since the larger air_time means the longer flight, this integer variable could be numeric.
- geom_histogram() would be a safer choice.

ggplot(data = flights,
       mapping = aes(x = air_time)) +
  geom_histogram()

In ggplot, the distribution of an integer variable can appear quite similar when using geom_bar() and geom_histogram().

ggplot(data = flights,
       mapping = aes(x = air_time)) +
  geom_histogram(binwidth = 1)

ggplot(data = flights,
       mapping = aes(x = air_time)) +
  geom_bar()

In Python or other visualization tools, they can be quite different.

Q1b

Provide ggplot() code to describe how the distribution of air_time varies by origin.

ggplot(data = __BLANK 1__,
       mapping = aes(__BLANK 2__)) +
  __BLANK 3__() +
  facet_wrap(__BLANK 4__)

Answer:

ggplot(data = flights,
       mapping = aes(x = air_time)) +
  geom_histogram(bins = 60) +
  facet_wrap(~origin)

Q1c

Create the data.frame, top5_n, which include the two variables and 5 observations:
- carrier: the value of the top 5 carriers in terms of the number of flights.
- n: the number of flights operated by each of the top 5 carriers.

top5_n <- flights |> 
  __BLANK__ |> 
  arrange(-n) |> 
  head(5)  # returns the first 5 observations of the new data.frame

Answer:

top5_n <- flights |> 
  count(carrier) |> 
  arrange(-n) |> 
  head(5)

Q1d

Create the data.frame, top5_carriers, which includes all the flights operated by only the top 5 carriers found in Q1c.

top5_carriers <- flights |> 
  filter(carrier == "__BLANK 1__" |
         carrier == "__BLANK 2__" |
         carrier == "__BLANK 3__" |
         carrier == "__BLANK 4__" |
         carrier == "__BLANK 5__" )

Answer:

We can see which carriers are the top 5 from the top5_n:

top5_carriers <- flights |> 
  filter(carrier == "UA" |
         carrier == "B6" |
         carrier == "EV" |
         carrier == "DL" |
         carrier == "AA" )

Q1e

Provide ggplot() code to describe the distribution of carrier using the top5_carriers data.frame.

ggplot(data = __BLANK 1__,
       mapping = aes(__BLANK 2__)) +
  __BLANK 3__()

Answer:

Here we are using a horizontal bar chart:

ggplot(data = top5_carriers,
       mapping = aes(y = carrier)) +
  geom_bar()

Q1f

Provide ggplot() code to describe how the distribution of carrier varies by origin using the top5_carriers data.frame.

Stacked Bar Chart

ggplot(data = __BLANK 1__,
       mapping = aes(y = __BLANK 2__,
                     __BLANK 3__)) +
  geom_bar()

Answer:

ggplot(data = top5_carriers,
       mapping = aes(y = origin,
                     fill = carrier)) +
  geom_bar()

100% Stacked Bar Chart

ggplot(data = __BLANK 1__,
       mapping = aes(y = __BLANK 2__,
                     __BLANK 3__)) +
  geom_bar(position = __BLANK 4__) +
  labs(x = "Proportion")

Answer:

ggplot(data = top5_carriers,
       mapping = aes(y = origin,
                     fill = carrier)) +
  geom_bar(position = "fill")

Clustered Bar Chart

ggplot(data = __BLANK 1__,
       mapping = aes(y = __BLANK 2__,
                     __BLANK 3__)) +
  geom_bar(position = __BLANK 4__)

ggplot(data = top5_carriers,
       mapping = aes(y = origin,
                     fill = carrier)) +
  geom_bar(position = "dodge")

c.f., dodge2 provides a gap between bars:

ggplot(data = top5_carriers,
       mapping = aes(y = origin,
                     fill = carrier)) +
  geom_bar(position = "dodge2")

Facetted Bar Chart

ggplot(data = __BLANK 1__,
       mapping = aes(y = __BLANK 2__,
                     __BLANK 3__)) +
  geom_bar(show.legend = F) +
  facet_wrap(__BLANK 4__)

Answer:

ggplot(data = top5_carriers,
       mapping = aes(y = carrier,
                     fill = carrier)) +
  geom_bar(show.legend = F) +
  facet_wrap(~origin)

Q1g

Provide ggplot() code to describe the distribution of carrier using the top5_n data.frame.

ggplot(data = top5_n,
       mapping = aes(x = __BLANK 1__,
                     y = __BLANK 2__)) +
  __BLANK 3__()

Answer:

ggplot(data = top5_n,
       mapping = aes(x = n,
                     y = carrier)) +
  geom_col()

Q1h

Provide ggplot() code to describe the sorted distribution of carrier’ using the top5_n data.frame.

ggplot(data = top5_n,
       mapping = aes(x = __BLANK 1__,
                     y = __BLANK 2__)) +
  __BLANK 2__()

Answer:

ggplot(data = top5_n,
       mapping = aes(x = n,
                     y = fct_reorder(carrier, n))) +
  geom_col()

labs() can label y-axis title.

ggplot(data = top5_n,
       mapping = aes(x = n,
                     y = fct_reorder(carrier, n))) +
  geom_col() +
  labs(y = "Carrier")

Q1i

Create the following data frame named carrier_per_origin with the following three variables:
- origin: the origin airport
- carrier: the airline carrier
- n: the number of flights operated by each carrier from each origin airport
The carrier_per_origin data frame should contain the count of flights operated by each carrier departing from each origin airport.

carrier_per_origin <- flights |> 
  __BLANK__ |> 
  arrange(origin, -n)

Answer:

carrier_per_origin <- flights |> 
  count(origin, carrier) |> 
  arrange(origin, -n)

Description of Variables in flights

Q1a

Q1b

Q1c

Q1d

Q1e

Q1f

Stacked Bar Chart

100% Stacked Bar Chart

Clustered Bar Chart

Facetted Bar Chart

Q1g

Q1h

Q1i

Description of Variables in `flights`