Classwork 7

Data Transformation Practice

Author

Byeong-Hak Choe

Published

March 10, 2025

Modified

April 2, 2025

Consider the flights data.frame from the nycflights13 package.

library(tidyverse)
library(nycflights13)

flights <- flights

DT::datatable(flights |> head(100))

Question 1

  • Look at the number of cancelled flights per day. Is there a pattern? Is the proportion of cancelled flights related to the average delay?
    • Assume that a flight is cancelled if either dep_delay, arr_delay, or both are missing.

Answer:


Question 2

  • Which carrier has the worst arrival delays?
  • Calculate the proportion of flights with an arrival delay greater than 15 minutes for each carrier and for each origin.
  • Can you disentangle the effects of bad airports vs. bad carriers? Why/why not?

Answer:


Question 3

  • Find all destinations that are flown by at least two carriers. Use that information to rank the carriers.

Answer:


Question 4

  • The following is the data.frame for Question 4.
holiday_movies <- read_csv("https://bcdanl.github.io/data/holiday_movies.csv")
  • The data.frame holiday_movies comes from the Internet Movie Database (IMDb).

  • The following is the data.frame, holiday_movies.

Variable description

  • tconst: alphanumeric unique identifier of the title

  • title_type: the type/format of the title

    • (movie, video, or tvMovie)
  • primary_title: the more popular title / the title used by the filmmakers on promotional materials at the point of release

  • simple_title: the title in lowercase, with punctuation removed, for easier filtering and grouping

  • year: the release year of a title

  • runtime_minutes: primary runtime of the title, in minutes

  • average_rating: weighted average of all the individual user ratings on IMDb

  • num_votes: number of votes the title has received on IMDb (titles with fewer than 10 votes were not included in this dataset)



  • The following is another data.frame holiday_movie_genres that is related with the data.frame holiday_movies:
holiday_movie_genres <- read_csv("https://bcdanl.github.io/data/holiday_movie_genres.csv")
  • The data.frame holiday_movie_genres include up to three genres associated with the titles that appear in the data.frame.

Variable description

  • tconst: alphanumeric unique identifier of the title
  • genres: genres associated with the title, one row per genre



Q4a.

  • Provide the R code to generate the data.frame, holiday_movie_with_genres, which combines the two data.frames, holiday_movies and holiday_movie_genres:
  • The following shows the first four variables in holiday_movie_with_genres:



Q4b.

  • Provide the R code using skimr::skim() to see how the summary statistics—mean, median, standard deviation, minimum, maximum, first and third quartiles—of average_rating and num_votes varies by popular genres and title_type.

    • Consider only the five popular genres, which are selected in terms of the number of titles for each genre.
    • Removes the video type of the titles when calculating the summary statistics.



Q4c.

  • Provide R code to recreate the ggplot figure illustrating how the relationship between log10(num_votes) and average_rating varies by the popular genres and title_type.
    • The five popular genres are selected in terms of the number of titles for each genre.
    • The video type of the titles are removed in the ggplot figure.



Q4d.

  • Provide a comment to illustrate how the relationship between log10(num_votes) and average_rating varies by the popular genres and title_type.



Q4e.

  • Provide R code to recreate the ggplot figure illustrating the annual trend of the share of number of movies by popular genre from year 1975 to 2022.

    • For genres that are not popular, categorize them as “Other”.
      • Consider reordering the categories in genres in descending order based on their share in the year 2022.
      • Use “Set2” color palette from the RColorBrewer package.

  • c.f.) The following uses geom_area() instead:


Q4f.

  • Provide a comment to illustrate the annual trend of (1) the share of number of movies by popular genre from year 1975 to 2022.

    • Which genre has become more popular since 2010?



Q4g.

  • Add the following two variables—christmas and holiday—to the data.frame holiday_movies_with_genres:

  • christmas:

    • TRUE if the simple_title includes “christmas”, “xmas”, “x mas”
    • FALSE otherwise
  • holiday:

    • TRUE if the simple_title includes “holiday”
    • FALSE otherwise



Q4h.

  • Provide R code to recreate the ggplot figure illustrating the annual trend of (1) the number of movie titles with “holiday” varies by christmas.



Q4i.

  • Provide R code to recreate the ggplot figure illustrating how the mean value of num_votes varies by the popular genres for the titles with “christmas”.


Question 5

  • The following is the data.frame for Question 5.
tripadvisor <- read_csv("https://bcdanl.github.io/data/tripadvisor_cleaned.csv")
  • TripAdvisor is an online travel research company that empowers people around the world to plan and enjoy the ideal trip.

  • TripAdvisor wanted to know whether promoting membership on their platform could drive engagement and bookings.

  • To do so, TripAdvisor had just run an experiment to explore user retention by offering a random subset of customers an easier sign-up process for membership.

  • The following is the data.frame, tripadvisor.

Variable description

  • id: a unique identifier for a user.

  • time:

    • PRE if time is before the experiment;
    • POST if time is in the 28 days after the experiment.
    • For each id value, there are two observations—one with time == “PRE” and the other with time == “POST”.
  • days_visited: Number of days a user visited the TripAdvisor website.

  • easier_signup:

    • TRUE if a user was exposed to the easier signup process (e.g., one-click signup) during the experiment;
    • FALSE otherwise.
  • became_member:

    • TRUE if a user became a member during the experiment period;
    • FALSE otherwise.
  • locale_en_US:

    • TRUE if a user accessed the website from the US;
    • FALSE otherwise.
  • os_type: Windows, Mac, or Others

  • revenue_pre: Amount of dollars a user spent on the website before the experiment


Q5a.

  • Using the given data.frame, tripadvisor, create the data.frame, tripadvisor, for which

    • time is a factor-type variable of time with the first level, “PRE”.



Q5b.

  • Provide R code to recreate the ggplot figure illustrating how the relationship between time and days_visited varies by easier_signup and became_member.

    • Row-wise split is based on easier_signup.
    • Column-wise split is based on became_member.



Q5c.

  • Provide a comment to illustrate how the relationship between time and days_visited varies by easier_signup and became_member.



Q5d.

  • Provide a R code to create the data.frame Q5d that includes the variable diff, the difference between (1) the value of days_visited for time == PRE and (2) the value of days_visited for time == POST for each id.
  • The resulting data.frame should look as follows:
  • Include only the variables as shown above.
  • Adjust the order of the variables as shown above.



Q5e.

  • Provide an R code to calculate how the difference in the mean value of diff varies by easier_signup and became_member using the data.frame Q5d.
  • The resulting data.frame should look as follows:



Q5f.

  • Using the resulting data.frame in Q5e, discuss the following question:
    • What is the effect of easier signup process on the number of days a user visited the TripAdvisor website?
    • How does this effect varies by the status of became_member?



Question 6

In September 2019, YouGov survey asked 1,639 GB adults the following question:

In hindsight, do you think Britain was right/wrong to vote to leave EU?

  • Right to leave
  • Wrong to leave
  • Don’t know

The data from the survey is in brexit.csv.

brexit <- read_csv('https://bcdanl.github.io/data/brexit.csv')
datatable(brexit)


Q6a

  • Replicate the following visualization



Q6b

  • Replicate the following visualization
    • How is the story this visualization telling different than the story the plot in Q6a?



Q6c

  • First, calculate the proportion of wrong, right, and don’t know answers in each region and then plot these proportions (rather than the counts) and then improve axis labeling.
  • Replicate the following visualization
    • How is the story this visualization telling different than the story the plot in Q4b?
ggplot(q6, aes(y = opinion, x = prop,
               fill = opinion)) +
  geom_col() +
  facet_wrap(~region,
    nrow = 1, labeller = label_wrap_gen(width = 12),
    # ___
  ) +
  guides(fill = "none") +
  labs(
    title = "Was Britain right/wrong to vote to leave EU?",
    subtitle = "YouGov Survey Results, 2-3 September 2019",
    caption = "Source: bit.ly/2lCJZVg",
    x = 'Percent', y = NULL
  ) +
  scale_fill_manual(values = c(
    "Wrong" = "#ef8a62",
    "Right" = "#67a9cf",
    "Don't know" = "gray"
  )) +
  scale_x_continuous(labels = scales::percent) +
  theme_minimal()



Q6d.

  • Recreate the same visualization from the previous exercise, this time dodging the bars for opinion proportions for each region, rather than faceting by region and then improve the legend.

    • How is the story this visualization telling different than the story the previous plot tells?

Discussion

Welcome to our Classwork 7 Discussion Board! 👋

This space is designed for you to engage with your classmates about the material covered in Classwork 7.

Whether you are looking to delve deeper into the content, share insights, or have questions about the content, this is the perfect place for you.

If you have any specific questions for Byeong-Hak (@bcdanl) regarding the Classwork 7 materials or need clarification on any points, don’t hesitate to ask here.

All comments will be stored here.

Let’s collaborate and learn from each other!

Back to top