Midterm Exam 1

Classwork 8

Author

Byeong-Hak Choe

Published

October 24, 2024

Modified

October 24, 2024

Summary for Midterm Exam 1 Performance

The following provides the descriptive statistics for each part of the Midterm Exam 1 questions:

The following describes the distribution of Midterm Exam 1 score:

Section 1. True or False

Question 1

Structured data has a predefined format and fits into traditional databases.

True
False

Explanation:
Structured data refers to data that is organized in a fixed format, typically in rows and columns, making it easily stored and queried in traditional relational databases. Its predefined schema ensures consistency and facilitates efficient data management.

Question 2

Dynamic pricing in sports ticket sales is influenced only by fixed factors, such as seating location and time of purchase, and is not heavily affected by real-time factors like demand, weather, or team performance.

True
False

Explanation:
Dynamic pricing in sports ticket sales is influenced by both fixed factors (e.g., seating location, time of purchase) and real-time factors (e.g., current demand, weather conditions, team performance). Real-time data allows for adjustments in pricing to maximize revenue and respond to changing circumstances.

Section 2. Multiple Choice

Question 3

In retail analytics, market basket analysis is used to:

Determine the optimal store location.
Understand which products to combine in a bundle offer.
Predict customer churn rates.
Analyze video footage for customer demographics.

Explanation:
Market basket analysis identifies patterns of items that frequently co-occur in transactions. This insight helps retailers create effective bundle offers, enhance cross-selling strategies, and optimize product placements to increase sales.

Question 4

Which of the following best describes a CSV file?

A binary file format for images.
A text file where values are separated by commas.
A proprietary spreadsheet format.
A database file format.
Privacy concerns

Explanation:
CSV (Comma-Separated Values) files are plain text files that store tabular data, where each line represents a data record and each record consists of fields separated by commas. They are widely used for data exchange between different applications.

Question 5

Which of the following is a challenge associated with the Load stage of the ETL process?

Data validation errors due to missing values
Dealing with heterogeneous data formats
Loading large data volumes can take days
Converting data into a standardized format

Explanation:
The Load stage involves transferring transformed data into the data warehouse. Handling large volumes of data can be time-consuming, potentially taking days to complete, which poses a significant challenge in maintaining timely data availability.

Question 6-8

For Questions 6-8, consider the following data.frame, twitter_data, displayed below:

UserID	Age	Gender	AccountType	Country	FollowersCount	LastLoginHour
1	22	Female	Standard	USA	1500	22.5
2	27	Male	Premium	Canada	2300	14.2
3	34	Female	Standard	USA	800	9.8
4	19	Male	Premium	UK	5000	18.3
5	45	Female	Standard	Australia	300	2.7
6	31	Male	Standard	USA	1200	12.1
7	28	Female	Premium	India	4500	16.4
8	23	Male	Standard	Canada	600	20.0
9	37	Male	Premium	USA	3500	7.5
10	29	Female	Premium	UK	900	23.0

AccountAgeDays	SatisfactionLevel	PostsPerWeek	GroupsJoined	IsVerified
365	Very Satisfied	5	10	Yes
730	Satisfied	12	5	No
180	Neutral	3	12	No
1095	Very Satisfied	20	7	Yes
60	Dissatisfied	1	3	No
540	Satisfied	8	8	No
850	Very Satisfied	15	15	Yes
275	Neutral	4	4	No
400	Satisfied	18	9	Yes
660	Very Satisfied	6	6	No

Description of Variables in `twitter_data`:

UserID: Identifier for each user
Age: Age of the user in years
Gender: Gender of the user
AccountType: Type of social media account
Country: Country of residence
FollowersCount: Number of followers
LastLoginHour: Time of last login in hours since midnight
AccountAgeDays: Age of the account in days
SatisfactionLevel: User satisfaction level
PostsPerWeek: Number of posts per week
GroupsJoined: Number of groups joined
IsVerified: Whether the user account is verified

Question 6

What type of variable is Country in the dataset?

Nominal
Ordinal
Interval
Ratio

Explanation:
The Country variable categorizes data based on different countries without any inherent order or ranking, making it a nominal variable.

Question 7

What type of variable is LastLoginHour in the dataset?

Nominal
Ordinal
Interval
Ratio

Explanation:
LastLoginHour represents the time of the last login measured in hours since midnight. While it has meaningful differences between values, it does not have a true zero point in this context, categorizing it as an interval variable.

Question 8

What type of variable is SatisfactionLevel in the dataset?

Nominal
Ordinal
Interval
Ratio

Explanation:
SatisfactionLevel is an ordered categorical variable with levels such as “Very Dissatisfied,” “Dissatisfied,” “Neutral,” “Satisfied,” and “Very Satisfied,” indicating a ranked relationship among categories.

Section 3. Filling-in-the-Blanks

Question 9

The Royal Swedish Academy of Sciences has decided to award the 2024 Nobel Prize in Physics to U.S. scientist John J. Hopfield and British-Canadian Geoffrey E. Hinton for discoveries and inventions in machine learning, a field that enables computers to learn from and make predictions or decisions based on data, which paved the way for the artificial intelligence boom.

Explanation:
Machine learning is a subset of artificial intelligence that focuses on the development of algorithms and statistical models that enable computers to perform tasks without explicit instructions. Instead, these systems learn and make decisions based on data input. John J. Hopfield and Geoffrey E. Hinton are renowned for their contributions to this field, particularly in neural networks and deep learning, which have significantly advanced the capabilities of artificial intelligence.

Question 10

In sport analytics, we discussed a machine learning model called decision tree that makes decisions by splitting data into branches based on input variables.

Explanation:
A decision tree is a popular machine learning model used for both classification and regression tasks. In sports analytics, decision trees can help in making strategic decisions, such as predicting player performance, determining optimal game strategies, or identifying key factors that influence game outcomes.

Question 11

The five V’s of big data are Volume, Velocity, Value, Veracity, and Variety.

Explanation:
The five V’s of big data—Volume, Velocity, Value, Veracity, and Variety—describe the challenges and characteristics associated with big data.
- Volume: Refers to the vast amounts of data generated every second.
- Velocity: The speed at which new data is generated and processed.
- Value: The importance of extracting meaningful insights from the data.
- Veracity: The trustworthiness and accuracy of the data.
- Variety: The different types of data (structured, unstructured, semi-structured) and sources from which data is collected.
Understanding these dimensions is crucial for effectively managing and leveraging big data in various applications, including data analytics and business intelligence.

Question 12

The process of inserting transformed data into the data warehouse is part of the load stage in ETL.

Explanation:
ETL stands for Extract, Transform, Load, which are the three fundamental steps in data warehousing and integration.
- Extract: Data is collected from various source systems.
- Transform: The extracted data is cleaned, formatted, and transformed into a suitable structure for analysis.
- Load: The transformed data is inserted into the data warehouse or destination database for storage and future analysis.
The Load stage is critical as it ensures that the prepared data is available in the data warehouse for querying, reporting, and business intelligence purposes.

Question 13

Generative AI refers to a category of artificial intelligence capable of generating new content, such as text, images, videos, music, and code.

Explanation:
Generative AI encompasses a range of artificial intelligence technologies designed to create new content. Unlike traditional AI models that primarily analyze or classify existing data, generative models can produce novel text, images, videos, music, and even code based on the patterns and structures learned from training data. The most popular example of generative AI is large language models like GPT-4. This technology has applications in creative industries, content creation, design, and more, enabling innovative solutions and automations.

Section 4. Data Analysis with R

Question 14

Which of the following R code correctly assigns the data.frame nycflights13::airlines to the variable airlines_df? (Note that airlines_df is simply the name of the R object and can be any valid name in R.)

nycflights13::airlines <- airlines_df
airlines_df <- nycflights13::airlines
nycflights13::airlines >= airlines_df
airlines_df == nycflights13::airlines
All of the above

Answer: b

Explanation:
Option b correctly assigns the airlines data.frame from the nycflights13 package to the variable airlines_df using the assignment operator <-. The other options either reverse the assignment or use incorrect operators.

Question 15

Write the R code to create a new variable called total and assign to it the sum of 8 and 12 in R.

Answer: total <- 8 + 12

Question 16

Given the data.frame df with variables height and name, which of the following expressions returns a vector containing the values in the height variable?

df:height
df::height
df$height
Both b and c

Answer: c

Explanation: The $ operator is used to extract a specific variable from a data.frame. Option b uses the :: operator incorrectly, which is meant for accessing functions from packages.

Question 17

The expression as.numeric("456") will return the numeric value 456.

True
False

Explanation: The as.numeric() function converts the string “456” to the numeric value 456.

Question 18

What is the result of the expression (1 + 2 * 3) ^ 2 in R?

36
49
81

Answer: b

Question 19

Given vectors a <- c(2, 4, 6) and b <- c(1, 3, 5), what is the result of a + b?

c(3, 7, 11)
c(2, 4, 6, 1, 3, 5)
c(1, 2, 3, 4, 5, 6)
Error

Answer: a

Explanation: Element-wise addition of vectors a and b results in: - 2 + 1 = 3 - 4 + 3 = 7 - 6 + 5 = 11

Question 20

To use the function read_csv() from the readr package, one of the packages in tidyverse, you first need to load the package using the R code ________.

library(readr)
library(skimr)
library(tidyverse)
All of the above
Both a and c
Both b and c
Both a and c

Explanation: You can load the readr package specifically using library(readr), or load the entire tidyverse suite (which includes readr) using library(tidyverse).

Question 21

Consider the following data.frame df0:

x	y
Na	7
2	NA
3	9

What is the result of mean(df0$y)?

7
NA
8
9

Answer: b

Explanation: By default, the mean() function in R returns NA if there are any missing values (NA) in the data (unless the option, na.rm = TRUE, is specified).

Questions 22-23

Consider the following data.frame df for Questions 22-23:

id	name	age	score
1	Anna	22	90
2	Ben	28	85
3	Carl	NA	95
4	Dana	35	NA
5	Ella	40	80

Question 22

Which of the following code snippets filters observations where score is strictly between 85 and 95 (i.e., excluding 85 and 95)?

df |> filter(score >= 85 | score <= 95)
df |> filter(score > 85 | score < 95)
df |> filter(score > 85 & score < 95)
df |> filter(score >= 85 & score <= 95)

Answer: c

Explanation: This code correctly filters rows where score is greater than 85 and less than 95, excluding the boundary values.

Question 23

Which of the following expressions correctly keeps observations from df where the age variable does not have any missing values?

df |> filter(is.na(age))
df |> filter(!is.na(age))
df |> filter(age == NA)
df |> filter(age != NA)
Both a and c
Both b and d

Answer: b

Explanation: The expression !is.na(age) filters out rows where age is NA, keeping only those with non-missing values.

Question 24

Consider the following data.frame df3:

id	value
1	15
1	15
2	25
3	35
3	35
4	45
5	55

Which of the following code snippets returns a data.frame of unique id values from df3?

df3 |> select(id) |> distinct()
df3 |> distinct(value)
df3 |> distinct(id)
Both a and c

Explanation:

Option a: df3 |> select(id) |> distinct() This code first selects the id variable from df3 and then applies distinct() to remove duplicate entries, resulting in a data.frame of unique id values.
Option c: df3 |> distinct(id) This code directly applies distinct() to the id variable, achieving the same result as option a by returning unique id values.
Option d: Both a and c correctly return a data.frame of unique id values.

Question 25

Which of the following code snippets correctly renames the variable name to first_name in df?

df |> rename(first_name = name)
df |> rename(name = first_name)
df |> rename("name" = "first_name")
df |> rename_variable(name = first_name)

Answer: a

Explanation: - Option a: df |> rename(first_name = name) This syntax correctly renames the existing column name to first_name using the rename() function from the dplyr package. The format is new_name = old_name.

Question 26

Which of the following code snippets correctly removes the score variable from df?

df |> select(-score)
df |> select(-"score")
df |> select(!score)
df |> select(, -score)
df |> select(desc(score))

Answer: either a or b

Explanation: - Option a: df |> select(-score) This is the correct and most straightforward way to remove the score variable from df using the select() function with the minus (-) sign to indicate exclusion. - Option b: df |> select(-"score") While this can work in some cases, it’s less conventional and may lead to issues depending on the version of dplyr or specific usage contexts.

Question 27

Which of the following code snippets filters observations where age is not NA, then arranges them in ascending order of age, and then selects the name and age variables?

df |> filter(!is.na(age)) |> arrange(age) |> select(name, age)
df |> select(name, age) |> arrange(age) |> filter(!is.na(age))
df |> arrange(age) |> filter(!is.na(age)) |> select(name, age)
df |> filter(is.na(age)) |> arrange(desc(age)) |> select(name, age)
All of the above

Answer: a

Explanation: - Option a: - filter(!is.na(age)): Removes rows where age is NA. - arrange(age): Sorts the remaining data in ascending order based on age. - select(name, age): Selects only the name and age variables. - This sequence correctly performs all required operations in the specified order.

Question 28

Consider the two related data.frames, students and majors:

students

student_id	name	age
1	Brad	20
2	Jason	22
4	Marcie	21

majors

student_id	major
1	Business Administration
2	Economics
3	Data Analytics

Which of the following R code correctly joins the two related data.frames, students and majors, to produce the resulting data.frame shown below?

student_id	major	name	age
1	Business Administration	Brad	20
2	Economics	Jason	22
3	Data Analytics	NA	NA

students |> left_join(majors)
majors |> left_join(students)
Both a and b

Answer: b

Explanation:

Option b: majors |> left_join(students)
This correctly sets majors as the left data.frame, ensuring all records from majors are retained. For the value of student_id 3, which does not exist in students, the name and age variables are filled with NA, matching the provided in the resulting data.frame.

Section 4. Short Essay

Question 29

In R, what does the function sd(x) compute, and why can it be more useful than var(x)?

Answer:

The function sd(x) in R computes the standard deviation of the numeric vector x. Standard deviation measures the amount of variation or dispersion in a set of values reference to its mean. It is more useful than var(x)—which calculates the variance—because standard deviation is expressed in the same units as the original data x, making it more interpretable. For example, if the data represents heights in centimeters, the standard deviation will also be in centimeters, allowing for direct understanding of variability. In contrast, variance is in squared units, which can be less intuitive.

Question 30

List at least four applications of data analytics in sports analytics mentioned in the lecture, and briefly describe each one.

Answer:

Player Performance Analysis:

Data analytics is used to evaluate individual player statistics, such as scoring efficiency, defensive actions, and endurance. This helps in identifying strengths and areas for improvement, informing coaching decisions and player development programs.

Injury Prediction and Prevention:

By analyzing data on player movements, workloads, and physiological metrics, teams can predict potential injuries before they occur. This proactive approach allows for tailored training regimens and rest periods to minimize injury risks.

Strategic Decision Making:

Coaches and managers use data analytics to inform game strategies, such as optimal player lineups, tactical adjustments, and in-game decision-making. Analyzing opponents’ data also aids in developing effective game plans.

Fan Engagement and Marketing:

Teams leverage data analytics to understand fan behavior and preferences, enabling personalized marketing campaigns, enhanced in-stadium experiences, and targeted promotions. This fosters stronger fan loyalty and increases revenue through merchandise and ticket sales.

Scouting and Recruitment:

Data-driven scouting identifies talent by analyzing player statistics, performance metrics, and potential. This objective approach enhances the recruitment process, ensuring that teams acquire players who fit their strategic needs and have the potential for future success.

Revenue Optimization:

Teams use analytics to optimize ticket pricing, merchandise sales, and concession offerings. By understanding demand patterns and consumer behavior, they can implement dynamic pricing strategies and tailor products to maximize revenue.

Summary for Midterm Exam 1 Performance

Section 1. True or False

Question 1

Question 2

Section 2. Multiple Choice

Question 3

Question 4

Question 5

Question 6-8

Description of Variables in twitter_data:

Question 6

Question 7

Question 8

Section 3. Filling-in-the-Blanks

Question 9

Question 10

Question 11

Question 12

Question 13

Section 4. Data Analysis with R

Question 14

Question 15

Question 16

Question 17

Question 18

Question 19

Question 20

Question 21

Questions 22-23

Question 22

Question 23

Question 24

Question 25

Question 26

Question 27

Question 28

Section 4. Short Essay

Question 29

Question 30

Description of Variables in `twitter_data`: