library(tidyverse)
library(nycflights13)
<- flights
flights
::datatable(flights |> head(100)) DT
Classwork 7
Data Transformation Practice
Consider the flights
data.frame from the nycflights13
package.
Question 1
- Look at the number of cancelled flights per day. Is there a pattern? Is the proportion of cancelled flights related to the average delay?
- Assume that a flight is cancelled if either
dep_delay
,arr_delay
, or both are missing.
- Assume that a flight is cancelled if either
Answer:
Question 2
- Which
carrier
has the worst arrival delays? - Calculate the proportion of flights with an arrival delay greater than 15 minutes for each carrier and for each origin.
- Can you disentangle the effects of bad airports vs. bad carriers? Why/why not?
Answer:
Question 3
- Find all destinations that are flown by at least two carriers. Use that information to rank the carriers.
Answer:
Question 4
- The following is the data.frame for Question 4.
<- read_csv("https://bcdanl.github.io/data/holiday_movies.csv") holiday_movies
The data.frame holiday_movies comes from the Internet Movie Database (IMDb).
The following is the data.frame, holiday_movies.
Variable description
tconst: alphanumeric unique identifier of the title
title_type: the type/format of the title
- (movie, video, or tvMovie)
primary_title: the more popular title / the title used by the filmmakers on promotional materials at the point of release
simple_title: the title in lowercase, with punctuation removed, for easier filtering and grouping
year: the release year of a title
runtime_minutes: primary runtime of the title, in minutes
average_rating: weighted average of all the individual user ratings on IMDb
num_votes: number of votes the title has received on IMDb (titles with fewer than 10 votes were not included in this dataset)
- The following is another data.frame holiday_movie_genres that is related with the data.frame holiday_movies:
<- read_csv("https://bcdanl.github.io/data/holiday_movie_genres.csv") holiday_movie_genres
- The data.frame holiday_movie_genres include up to three genres associated with the titles that appear in the data.frame.
Variable description
- tconst: alphanumeric unique identifier of the title
- genres: genres associated with the title, one row per genre
Q4a.
- Provide the R code to generate the data.frame, holiday_movie_with_genres, which combines the two data.frames, holiday_movies and holiday_movie_genres:
- The following shows the first four variables in holiday_movie_with_genres:
Q4b.
Provide the R code using
skimr::skim()
to see how the summary statistics—mean, median, standard deviation, minimum, maximum, first and third quartiles—of average_rating and num_votes varies by popular genres and title_type.- Consider only the five popular genres, which are selected in terms of the number of titles for each genre.
- Removes the video type of the titles when calculating the summary statistics.
Q4c.
- Provide R code to recreate the ggplot figure illustrating how the relationship between log10(num_votes) and average_rating varies by the popular genres and title_type.
- The five popular genres are selected in terms of the number of titles for each genre.
- The video type of the titles are removed in the ggplot figure.
Q4d.
- Provide a comment to illustrate how the relationship between log10(num_votes) and average_rating varies by the popular genres and title_type.
Q4e.
Provide R code to recreate the ggplot figure illustrating the annual trend of the share of number of movies by popular genre from year 1975 to 2022.
- For genres that are not popular, categorize them as “Other”.
- Consider reordering the categories in genres in descending order based on their share in the year 2022.
- Use “Set2” color palette from the
RColorBrewer
package.
- For genres that are not popular, categorize them as “Other”.
- c.f.) The following uses
geom_area()
instead:
Q4f.
Provide a comment to illustrate the annual trend of (1) the share of number of movies by popular genre from year 1975 to 2022.
- Which genre has become more popular since 2010?
Q4g.
Add the following two variables—christmas and holiday—to the data.frame holiday_movies_with_genres:
christmas:
- TRUE if the simple_title includes “christmas”, “xmas”, “x mas”
- FALSE otherwise
holiday:
- TRUE if the simple_title includes “holiday”
- FALSE otherwise
Q4h.
- Provide R code to recreate the ggplot figure illustrating the annual trend of (1) the number of movie titles with “holiday” varies by christmas.
Q4i.
- Provide R code to recreate the ggplot figure illustrating how the mean value of num_votes varies by the popular genres for the titles with “christmas”.
Question 5
- The following is the data.frame for Question 5.
<- read_csv("https://bcdanl.github.io/data/tripadvisor_cleaned.csv") tripadvisor
TripAdvisor is an online travel research company that empowers people around the world to plan and enjoy the ideal trip.
TripAdvisor wanted to know whether promoting membership on their platform could drive engagement and bookings.
To do so, TripAdvisor had just run an experiment to explore user retention by offering a random subset of customers an easier sign-up process for membership.
The following is the data.frame, tripadvisor.
Variable description
id: a unique identifier for a user.
time:
- PRE if time is before the experiment;
- POST if time is in the 28 days after the experiment.
- For each id value, there are two observations—one with time == “PRE” and the other with time == “POST”.
days_visited: Number of days a user visited the TripAdvisor website.
easier_signup:
- TRUE if a user was exposed to the easier signup process (e.g., one-click signup) during the experiment;
- FALSE otherwise.
became_member:
- TRUE if a user became a member during the experiment period;
- FALSE otherwise.
locale_en_US:
- TRUE if a user accessed the website from the US;
- FALSE otherwise.
os_type: Windows, Mac, or Others
revenue_pre: Amount of dollars a user spent on the website before the experiment
Q5a.
Using the given data.frame, tripadvisor, create the data.frame, tripadvisor, for which
- time is a factor-type variable of time with the first level, “PRE”.
Q5b.
Provide R code to recreate the ggplot figure illustrating how the relationship between time and days_visited varies by easier_signup and became_member.
- Row-wise split is based on easier_signup.
- Column-wise split is based on became_member.
Q5c.
- Provide a comment to illustrate how the relationship between time and days_visited varies by easier_signup and became_member.
Q5d.
- Provide a R code to create the data.frame Q5d that includes the variable diff, the difference between (1) the value of days_visited for time == PRE and (2) the value of days_visited for time == POST for each id.
- The resulting data.frame should look as follows:
- Include only the variables as shown above.
- Adjust the order of the variables as shown above.
Q5e.
- Provide an R code to calculate how the difference in the mean value of diff varies by easier_signup and became_member using the data.frame Q5d.
- The resulting data.frame should look as follows:
Q5f.
- Using the resulting data.frame in Q5e, discuss the following question:
- What is the effect of easier signup process on the number of days a user visited the TripAdvisor website?
- How does this effect varies by the status of became_member?
Question 6
In September 2019, YouGov survey asked 1,639 GB adults the following question:
In hindsight, do you think Britain was right/wrong to vote to leave EU?
- Right to leave
- Wrong to leave
- Don’t know
The data from the survey is in brexit.csv
.
<- read_csv('https://bcdanl.github.io/data/brexit.csv') brexit
datatable(brexit)
Q6a
- Replicate the following visualization
Q6b
- Replicate the following visualization
- How is the story this visualization telling different than the story the plot in Q6a?
Q6c
- First, calculate the proportion of wrong, right, and don’t know answers in each region and then plot these proportions (rather than the counts) and then improve axis labeling.
- Replicate the following visualization
- How is the story this visualization telling different than the story the plot in Q4b?
ggplot(q6, aes(y = opinion, x = prop,
fill = opinion)) +
geom_col() +
facet_wrap(~region,
nrow = 1, labeller = label_wrap_gen(width = 12),
# ___
+
) guides(fill = "none") +
labs(
title = "Was Britain right/wrong to vote to leave EU?",
subtitle = "YouGov Survey Results, 2-3 September 2019",
caption = "Source: bit.ly/2lCJZVg",
x = 'Percent', y = NULL
+
) scale_fill_manual(values = c(
"Wrong" = "#ef8a62",
"Right" = "#67a9cf",
"Don't know" = "gray"
+
)) scale_x_continuous(labels = scales::percent) +
theme_minimal()
Q6d.
Recreate the same visualization from the previous exercise, this time dodging the bars for opinion proportions for each region, rather than faceting by region and then improve the legend.
- How is the story this visualization telling different than the story the previous plot tells?
Discussion
Welcome to our Classwork 7 Discussion Board! 👋
This space is designed for you to engage with your classmates about the material covered in Classwork 7.
Whether you are looking to delve deeper into the content, share insights, or have questions about the content, this is the perfect place for you.
If you have any specific questions for Byeong-Hak (@bcdanl) regarding the Classwork 7 materials or need clarification on any points, don’t hesitate to ask here.
All comments will be stored here.
Let’s collaborate and learn from each other!