Midterm Exam 1

Classwork 8

Author

Byeong-Hak Choe

Published

October 24, 2024

Modified

October 24, 2024

Summary for Midterm Exam 1 Performance

The following provides the descriptive statistics for each part of the Midterm Exam 1 questions:


The following describes the distribution of Midterm Exam 1 score:

Section 1. True or False

Question 1

Structured data has a predefined format and fits into traditional databases.

  1. True
  2. False

Explanation:
Structured data refers to data that is organized in a fixed format, typically in rows and columns, making it easily stored and queried in traditional relational databases. Its predefined schema ensures consistency and facilitates efficient data management.

Question 2

Dynamic pricing in sports ticket sales is influenced only by fixed factors, such as seating location and time of purchase, and is not heavily affected by real-time factors like demand, weather, or team performance.

  1. True
  2. False

Explanation:
Dynamic pricing in sports ticket sales is influenced by both fixed factors (e.g., seating location, time of purchase) and real-time factors (e.g., current demand, weather conditions, team performance). Real-time data allows for adjustments in pricing to maximize revenue and respond to changing circumstances.


Section 2. Multiple Choice

Question 3

In retail analytics, market basket analysis is used to:

  1. Determine the optimal store location.
  2. Understand which products to combine in a bundle offer.
  3. Predict customer churn rates.
  4. Analyze video footage for customer demographics.

Explanation:
Market basket analysis identifies patterns of items that frequently co-occur in transactions. This insight helps retailers create effective bundle offers, enhance cross-selling strategies, and optimize product placements to increase sales.

Question 4

Which of the following best describes a CSV file?

  1. A binary file format for images.
  2. A text file where values are separated by commas.
  3. A proprietary spreadsheet format.
  4. A database file format.
  5. Privacy concerns

Explanation:
CSV (Comma-Separated Values) files are plain text files that store tabular data, where each line represents a data record and each record consists of fields separated by commas. They are widely used for data exchange between different applications.

Question 5

Which of the following is a challenge associated with the Load stage of the ETL process?

  1. Data validation errors due to missing values
  2. Dealing with heterogeneous data formats
  3. Loading large data volumes can take days
  4. Converting data into a standardized format

Explanation:
The Load stage involves transferring transformed data into the data warehouse. Handling large volumes of data can be time-consuming, potentially taking days to complete, which poses a significant challenge in maintaining timely data availability.

Question 6-8

For Questions 6-8, consider the following data.frame, twitter_data, displayed below:

UserID Age Gender AccountType Country FollowersCount LastLoginHour
1 22 Female Standard USA 1500 22.5
2 27 Male Premium Canada 2300 14.2
3 34 Female Standard USA 800 9.8
4 19 Male Premium UK 5000 18.3
5 45 Female Standard Australia 300 2.7
6 31 Male Standard USA 1200 12.1
7 28 Female Premium India 4500 16.4
8 23 Male Standard Canada 600 20.0
9 37 Male Premium USA 3500 7.5
10 29 Female Premium UK 900 23.0
AccountAgeDays SatisfactionLevel PostsPerWeek GroupsJoined IsVerified
365 Very Satisfied 5 10 Yes
730 Satisfied 12 5 No
180 Neutral 3 12 No
1095 Very Satisfied 20 7 Yes
60 Dissatisfied 1 3 No
540 Satisfied 8 8 No
850 Very Satisfied 15 15 Yes
275 Neutral 4 4 No
400 Satisfied 18 9 Yes
660 Very Satisfied 6 6 No

Description of Variables in twitter_data:

  1. UserID: Identifier for each user
  2. Age: Age of the user in years
  3. Gender: Gender of the user
  4. AccountType: Type of social media account
  5. Country: Country of residence
  6. FollowersCount: Number of followers
  7. LastLoginHour: Time of last login in hours since midnight
  8. AccountAgeDays: Age of the account in days
  9. SatisfactionLevel: User satisfaction level
  10. PostsPerWeek: Number of posts per week
  11. GroupsJoined: Number of groups joined
  12. IsVerified: Whether the user account is verified

Question 6

What type of variable is Country in the dataset?

  1. Nominal
  2. Ordinal
  3. Interval
  4. Ratio

Explanation:
The Country variable categorizes data based on different countries without any inherent order or ranking, making it a nominal variable.

Question 7

What type of variable is LastLoginHour in the dataset?

  1. Nominal
  2. Ordinal
  3. Interval
  4. Ratio

Explanation:
LastLoginHour represents the time of the last login measured in hours since midnight. While it has meaningful differences between values, it does not have a true zero point in this context, categorizing it as an interval variable.

Question 8

What type of variable is SatisfactionLevel in the dataset?

  1. Nominal
  2. Ordinal
  3. Interval
  4. Ratio

Explanation:
SatisfactionLevel is an ordered categorical variable with levels such as “Very Dissatisfied,” “Dissatisfied,” “Neutral,” “Satisfied,” and “Very Satisfied,” indicating a ranked relationship among categories.


Section 3. Filling-in-the-Blanks

Question 9

The Royal Swedish Academy of Sciences has decided to award the 2024 Nobel Prize in Physics to U.S. scientist John J. Hopfield and British-Canadian Geoffrey E. Hinton for discoveries and inventions in machine learning, a field that enables computers to learn from and make predictions or decisions based on data, which paved the way for the artificial intelligence boom.

Explanation:
Machine learning is a subset of artificial intelligence that focuses on the development of algorithms and statistical models that enable computers to perform tasks without explicit instructions. Instead, these systems learn and make decisions based on data input. John J. Hopfield and Geoffrey E. Hinton are renowned for their contributions to this field, particularly in neural networks and deep learning, which have significantly advanced the capabilities of artificial intelligence.

Question 10

In sport analytics, we discussed a machine learning model called decision tree that makes decisions by splitting data into branches based on input variables.

Explanation:
A decision tree is a popular machine learning model used for both classification and regression tasks. In sports analytics, decision trees can help in making strategic decisions, such as predicting player performance, determining optimal game strategies, or identifying key factors that influence game outcomes.

Question 11

The five V’s of big data are Volume, Velocity, Value, Veracity, and Variety.

Explanation:
The five V’s of big data—Volume, Velocity, Value, Veracity, and Variety—describe the challenges and characteristics associated with big data.
- Volume: Refers to the vast amounts of data generated every second.
- Velocity: The speed at which new data is generated and processed.
- Value: The importance of extracting meaningful insights from the data.
- Veracity: The trustworthiness and accuracy of the data.
- Variety: The different types of data (structured, unstructured, semi-structured) and sources from which data is collected.
Understanding these dimensions is crucial for effectively managing and leveraging big data in various applications, including data analytics and business intelligence.

Question 12

The process of inserting transformed data into the data warehouse is part of the load stage in ETL.

Explanation:
ETL stands for Extract, Transform, Load, which are the three fundamental steps in data warehousing and integration.
- Extract: Data is collected from various source systems.
- Transform: The extracted data is cleaned, formatted, and transformed into a suitable structure for analysis.
- Load: The transformed data is inserted into the data warehouse or destination database for storage and future analysis.
The Load stage is critical as it ensures that the prepared data is available in the data warehouse for querying, reporting, and business intelligence purposes.

Question 13

Generative AI refers to a category of artificial intelligence capable of generating new content, such as text, images, videos, music, and code.

Explanation:
Generative AI encompasses a range of artificial intelligence technologies designed to create new content. Unlike traditional AI models that primarily analyze or classify existing data, generative models can produce novel text, images, videos, music, and even code based on the patterns and structures learned from training data. The most popular example of generative AI is large language models like GPT-4. This technology has applications in creative industries, content creation, design, and more, enabling innovative solutions and automations.


Section 4. Data Analysis with R

Question 14

Which of the following R code correctly assigns the data.frame nycflights13::airlines to the variable airlines_df? (Note that airlines_df is simply the name of the R object and can be any valid name in R.)

  1. nycflights13::airlines <- airlines_df
  2. airlines_df <- nycflights13::airlines
  3. nycflights13::airlines >= airlines_df
  4. airlines_df == nycflights13::airlines
  5. All of the above

Answer: b

Explanation:
Option b correctly assigns the airlines data.frame from the nycflights13 package to the variable airlines_df using the assignment operator <-. The other options either reverse the assignment or use incorrect operators.

Question 15

Write the R code to create a new variable called total and assign to it the sum of 8 and 12 in R.

Answer: total <- 8 + 12

Question 16

Given the data.frame df with variables height and name, which of the following expressions returns a vector containing the values in the height variable?

  1. df:height
  2. df::height
  3. df$height
  4. Both b and c

Answer: c

Explanation: The $ operator is used to extract a specific variable from a data.frame. Option b uses the :: operator incorrectly, which is meant for accessing functions from packages.

Question 17

The expression as.numeric("456") will return the numeric value 456.

  1. True
  2. False

Explanation: The as.numeric() function converts the string “456” to the numeric value 456.

Question 18

What is the result of the expression (1 + 2 * 3) ^ 2 in R?

  1. 36
  2. 49
  3. 81

Answer: b

Question 19

Given vectors a <- c(2, 4, 6) and b <- c(1, 3, 5), what is the result of a + b?

  1. c(3, 7, 11)
  2. c(2, 4, 6, 1, 3, 5)
  3. c(1, 2, 3, 4, 5, 6)
  4. Error

Answer: a

Explanation: Element-wise addition of vectors a and b results in: - 2 + 1 = 3 - 4 + 3 = 7 - 6 + 5 = 11

Question 20

To use the function read_csv() from the readr package, one of the packages in tidyverse, you first need to load the package using the R code ________.

  1. library(readr)
  2. library(skimr)
  3. library(tidyverse)
  4. All of the above
  5. Both a and c
  6. Both b and c
  7. Both a and c

Explanation: You can load the readr package specifically using library(readr), or load the entire tidyverse suite (which includes readr) using library(tidyverse).

Question 21

Consider the following data.frame df0:

x y
Na 7
2 NA
3 9

What is the result of mean(df0$y)?

  1. 7
  2. NA
  3. 8
  4. 9

Answer: b

Explanation: By default, the mean() function in R returns NA if there are any missing values (NA) in the data (unless the option, na.rm = TRUE, is specified).

Questions 22-23

Consider the following data.frame df for Questions 22-23:

id name age score
1 Anna 22 90
2 Ben 28 85
3 Carl NA 95
4 Dana 35 NA
5 Ella 40 80

Question 22

Which of the following code snippets filters observations where score is strictly between 85 and 95 (i.e., excluding 85 and 95)?

  1. df |> filter(score >= 85 | score <= 95)
  2. df |> filter(score > 85 | score < 95)
  3. df |> filter(score > 85 & score < 95)
  4. df |> filter(score >= 85 & score <= 95)

Answer: c

Explanation: This code correctly filters rows where score is greater than 85 and less than 95, excluding the boundary values.

Question 23

Which of the following expressions correctly keeps observations from df where the age variable does not have any missing values?

  1. df |> filter(is.na(age))
  2. df |> filter(!is.na(age))
  3. df |> filter(age == NA)
  4. df |> filter(age != NA)
  5. Both a and c
  6. Both b and d

Answer: b

Explanation: The expression !is.na(age) filters out rows where age is NA, keeping only those with non-missing values.

Question 24

Consider the following data.frame df3:

id value
1 15
1 15
2 25
3 35
3 35
4 45
5 55

Which of the following code snippets returns a data.frame of unique id values from df3?

  1. df3 |> select(id) |> distinct()
  2. df3 |> distinct(value)
  3. df3 |> distinct(id)
  4. Both a and c

Explanation:

  • Option a: df3 |> select(id) |> distinct() This code first selects the id variable from df3 and then applies distinct() to remove duplicate entries, resulting in a data.frame of unique id values.
  • Option c: df3 |> distinct(id) This code directly applies distinct() to the id variable, achieving the same result as option a by returning unique id values.
  • Option d: Both a and c correctly return a data.frame of unique id values.

Question 25

Which of the following code snippets correctly renames the variable name to first_name in df?

  1. df |> rename(first_name = name)
  2. df |> rename(name = first_name)
  3. df |> rename("name" = "first_name")
  4. df |> rename_variable(name = first_name)

Answer: a

Explanation: - Option a: df |> rename(first_name = name) This syntax correctly renames the existing column name to first_name using the rename() function from the dplyr package. The format is new_name = old_name.

Question 26

Which of the following code snippets correctly removes the score variable from df?

  1. df |> select(-score)
  2. df |> select(-"score")
  3. df |> select(!score)
  4. df |> select(, -score)
  5. df |> select(desc(score))

Answer: either a or b

Explanation: - Option a: df |> select(-score) This is the correct and most straightforward way to remove the score variable from df using the select() function with the minus (-) sign to indicate exclusion. - Option b: df |> select(-"score") While this can work in some cases, it’s less conventional and may lead to issues depending on the version of dplyr or specific usage contexts.

Question 27

Which of the following code snippets filters observations where age is not NA, then arranges them in ascending order of age, and then selects the name and age variables?

  1. df |> filter(!is.na(age)) |> arrange(age) |> select(name, age)
  2. df |> select(name, age) |> arrange(age) |> filter(!is.na(age))
  3. df |> arrange(age) |> filter(!is.na(age)) |> select(name, age)
  4. df |> filter(is.na(age)) |> arrange(desc(age)) |> select(name, age)
  5. All of the above

Answer: a

Explanation: - Option a: - filter(!is.na(age)): Removes rows where age is NA. - arrange(age): Sorts the remaining data in ascending order based on age. - select(name, age): Selects only the name and age variables. - This sequence correctly performs all required operations in the specified order.

Question 28

Consider the two related data.frames, students and majors:

  • students
student_id name age
1 Brad 20
2 Jason 22
4 Marcie 21
  • majors
student_id major
1 Business Administration
2 Economics
3 Data Analytics

Which of the following R code correctly joins the two related data.frames, students and majors, to produce the resulting data.frame shown below?

student_id major name age
1 Business Administration Brad 20
2 Economics Jason 22
3 Data Analytics NA NA
  1. students |> left_join(majors)
  2. majors |> left_join(students)
  3. Both a and b

Answer: b

Explanation:

  • Option b: majors |> left_join(students)
  • This correctly sets majors as the left data.frame, ensuring all records from majors are retained. For the value of student_id 3, which does not exist in students, the name and age variables are filled with NA, matching the provided in the resulting data.frame.


Section 4. Short Essay

Question 29

In R, what does the function sd(x) compute, and why can it be more useful than var(x)?

Answer:

The function sd(x) in R computes the standard deviation of the numeric vector x. Standard deviation measures the amount of variation or dispersion in a set of values reference to its mean. It is more useful than var(x)—which calculates the variance—because standard deviation is expressed in the same units as the original data x, making it more interpretable. For example, if the data represents heights in centimeters, the standard deviation will also be in centimeters, allowing for direct understanding of variability. In contrast, variance is in squared units, which can be less intuitive.

Question 30

List at least four applications of data analytics in sports analytics mentioned in the lecture, and briefly describe each one.

Answer:

  1. Player Performance Analysis:
  • Data analytics is used to evaluate individual player statistics, such as scoring efficiency, defensive actions, and endurance. This helps in identifying strengths and areas for improvement, informing coaching decisions and player development programs.
  1. Injury Prediction and Prevention:
  • By analyzing data on player movements, workloads, and physiological metrics, teams can predict potential injuries before they occur. This proactive approach allows for tailored training regimens and rest periods to minimize injury risks.
  1. Strategic Decision Making:
  • Coaches and managers use data analytics to inform game strategies, such as optimal player lineups, tactical adjustments, and in-game decision-making. Analyzing opponents’ data also aids in developing effective game plans.
  1. Fan Engagement and Marketing:
  • Teams leverage data analytics to understand fan behavior and preferences, enabling personalized marketing campaigns, enhanced in-stadium experiences, and targeted promotions. This fosters stronger fan loyalty and increases revenue through merchandise and ticket sales.
  1. Scouting and Recruitment:
  • Data-driven scouting identifies talent by analyzing player statistics, performance metrics, and potential. This objective approach enhances the recruitment process, ensuring that teams acquire players who fit their strategic needs and have the potential for future success.
  1. Revenue Optimization:
  • Teams use analytics to optimize ticket pricing, merchandise sales, and concession offerings. By understanding demand patterns and consumer behavior, they can implement dynamic pricing strategies and tailor products to maximize revenue.
Back to top