library(tidyverse)
library(ggthemes)
library(nycflights13)
<- flights flights
Distribution Plots
Classwork 12
For this classwork, consider the flights
data.frame
Description of Variables in flights
Use ??flights
to see the description of variables in the flights
data.frame.
Q1a
- Provide
ggplot()
code to describe the distribution ofair_time
(amount of time spent in the air, in minutes).
ggplot(data = __BLANK 1__,
mapping = aes(__BLANK 2__)) +
__() __BLANK 3
Answer:
- Since the larger
air_time
means the longer flight, this integer variable could be numeric.geom_histogram()
would be a safer choice.
ggplot(data = flights,
mapping = aes(x = air_time)) +
geom_histogram()
- In ggplot, the distribution of an integer variable can appear quite similar when using
geom_bar()
andgeom_histogram()
.
ggplot(data = flights,
mapping = aes(x = air_time)) +
geom_histogram(binwidth = 1)
ggplot(data = flights,
mapping = aes(x = air_time)) +
geom_bar()
- In Python or other visualization tools, they can be quite different.
Q1b
- Provide
ggplot()
code to describe how the distribution ofair_time
varies byorigin
.
ggplot(data = __BLANK 1__,
mapping = aes(__BLANK 2__)) +
__() +
__BLANK 3facet_wrap(__BLANK 4__)
Answer:
ggplot(data = flights,
mapping = aes(x = air_time)) +
geom_histogram(bins = 60) +
facet_wrap(~origin)
Q1c
- Create the data.frame,
top5_n
, which include the two variables and 5 observations:carrier
: the value of the top 5 carriers in terms of the number of flights.n
: the number of flights operated by each of the top 5 carriers.
<- flights |>
top5_n |>
__BLANK__ arrange(-n) |>
head(5) # returns the first 5 observations of the new data.frame
Answer:
<- flights |>
top5_n count(carrier) |>
arrange(-n) |>
head(5)
Q1d
- Create the data.frame,
top5_carriers
, which includes all the flights operated by only the top 5 carriers found in Q1c.
<- flights |>
top5_carriers filter(carrier == "__BLANK 1__" |
== "__BLANK 2__" |
carrier == "__BLANK 3__" |
carrier == "__BLANK 4__" |
carrier == "__BLANK 5__" ) carrier
Answer:
- We can see which carriers are the top 5 from the
top5_n
:
<- flights |>
top5_carriers filter(carrier == "UA" |
== "B6" |
carrier == "EV" |
carrier == "DL" |
carrier == "AA" ) carrier
Q1e
- Provide
ggplot()
code to describe the distribution ofcarrier
using thetop5_carriers
data.frame.
ggplot(data = __BLANK 1__,
mapping = aes(__BLANK 2__)) +
__() __BLANK 3
Answer:
- Here we are using a horizontal bar chart:
ggplot(data = top5_carriers,
mapping = aes(y = carrier)) +
geom_bar()
Q1f
- Provide
ggplot()
code to describe how the distribution ofcarrier
varies byorigin
using thetop5_carriers
data.frame.
Stacked Bar Chart
ggplot(data = __BLANK 1__,
mapping = aes(y = __BLANK 2__,
+
__BLANK 3__)) geom_bar()
Answer:
ggplot(data = top5_carriers,
mapping = aes(y = origin,
fill = carrier)) +
geom_bar()
100% Stacked Bar Chart
ggplot(data = __BLANK 1__,
mapping = aes(y = __BLANK 2__,
+
__BLANK 3__)) geom_bar(position = __BLANK 4__) +
labs(x = "Proportion")
Answer:
ggplot(data = top5_carriers,
mapping = aes(y = origin,
fill = carrier)) +
geom_bar(position = "fill")
Clustered Bar Chart
ggplot(data = __BLANK 1__,
mapping = aes(y = __BLANK 2__,
+
__BLANK 3__)) geom_bar(position = __BLANK 4__)
ggplot(data = top5_carriers,
mapping = aes(y = origin,
fill = carrier)) +
geom_bar(position = "dodge")
- c.f.,
dodge2
provides a gap between bars:
ggplot(data = top5_carriers,
mapping = aes(y = origin,
fill = carrier)) +
geom_bar(position = "dodge2")
Facetted Bar Chart
ggplot(data = __BLANK 1__,
mapping = aes(y = __BLANK 2__,
+
__BLANK 3__)) geom_bar(show.legend = F) +
facet_wrap(__BLANK 4__)
Answer:
ggplot(data = top5_carriers,
mapping = aes(y = carrier,
fill = carrier)) +
geom_bar(show.legend = F) +
facet_wrap(~origin)
Q1g
- Provide
ggplot()
code to describe the distribution ofcarrier
using thetop5_n
data.frame.
ggplot(data = top5_n,
mapping = aes(x = __BLANK 1__,
y = __BLANK 2__)) +
__() __BLANK 3
Answer:
ggplot(data = top5_n,
mapping = aes(x = n,
y = carrier)) +
geom_col()
Q1h
- Provide
ggplot()
code to describe the sorted distribution ofcarrier
’ using thetop5_n
data.frame.
ggplot(data = top5_n,
mapping = aes(x = __BLANK 1__,
y = __BLANK 2__)) +
__() __BLANK 2
Answer:
ggplot(data = top5_n,
mapping = aes(x = n,
y = fct_reorder(carrier, n))) +
geom_col()
labs()
can label y-axis title.
ggplot(data = top5_n,
mapping = aes(x = n,
y = fct_reorder(carrier, n))) +
geom_col() +
labs(y = "Carrier")
Q1i
- Create the following data frame named
carrier_per_origin
with the following three variables:origin
: the origin airportcarrier
: the airline carriern
: the number of flights operated by each carrier from each origin airport
- The
carrier_per_origin
data frame should contain the count of flights operated by each carrier departing from each origin airport.
<- flights |>
carrier_per_origin |>
__BLANK__ arrange(origin, -n)
Answer:
<- flights |>
carrier_per_origin count(origin, carrier) |>
arrange(origin, -n)