Midterm Exam 1
Classwork 8
Summary for Midterm Exam 1 Performance
The following provides the descriptive statistics for each part of the Midterm Exam 1 questions:
The following describes the distribution of Midterm Exam 1 score:
Section 1. True or False
Question 1
Structured data has a predefined format and fits into traditional databases.
- True
- False
Explanation:
Structured data refers to data that is organized in a fixed format, typically in rows and columns, making it easily stored and queried in traditional relational databases. Its predefined schema ensures consistency and facilitates efficient data management.
Question 2
Dynamic pricing in sports ticket sales is influenced only by fixed factors, such as seating location and time of purchase, and is not heavily affected by real-time factors like demand, weather, or team performance.
- True
- False
Explanation:
Dynamic pricing in sports ticket sales is influenced by both fixed factors (e.g., seating location, time of purchase) and real-time factors (e.g., current demand, weather conditions, team performance). Real-time data allows for adjustments in pricing to maximize revenue and respond to changing circumstances.
Section 2. Multiple Choice
Question 3
In retail analytics, market basket analysis is used to:
- Determine the optimal store location.
- Understand which products to combine in a bundle offer.
- Predict customer churn rates.
- Analyze video footage for customer demographics.
Explanation:
Market basket analysis identifies patterns of items that frequently co-occur in transactions. This insight helps retailers create effective bundle offers, enhance cross-selling strategies, and optimize product placements to increase sales.
Question 4
Which of the following best describes a CSV file?
- A binary file format for images.
- A text file where values are separated by commas.
- A proprietary spreadsheet format.
- A database file format.
- Privacy concerns
Explanation:
CSV (Comma-Separated Values) files are plain text files that store tabular data, where each line represents a data record and each record consists of fields separated by commas. They are widely used for data exchange between different applications.
Question 5
Which of the following is a challenge associated with the Load stage of the ETL process?
- Data validation errors due to missing values
- Dealing with heterogeneous data formats
- Loading large data volumes can take days
- Converting data into a standardized format
Explanation:
The Load stage involves transferring transformed data into the data warehouse. Handling large volumes of data can be time-consuming, potentially taking days to complete, which poses a significant challenge in maintaining timely data availability.
Question 6-8
For Questions 6-8, consider the following data.frame, twitter_data
, displayed below:
UserID | Age | Gender | AccountType | Country | FollowersCount | LastLoginHour |
---|---|---|---|---|---|---|
1 | 22 | Female | Standard | USA | 1500 | 22.5 |
2 | 27 | Male | Premium | Canada | 2300 | 14.2 |
3 | 34 | Female | Standard | USA | 800 | 9.8 |
4 | 19 | Male | Premium | UK | 5000 | 18.3 |
5 | 45 | Female | Standard | Australia | 300 | 2.7 |
6 | 31 | Male | Standard | USA | 1200 | 12.1 |
7 | 28 | Female | Premium | India | 4500 | 16.4 |
8 | 23 | Male | Standard | Canada | 600 | 20.0 |
9 | 37 | Male | Premium | USA | 3500 | 7.5 |
10 | 29 | Female | Premium | UK | 900 | 23.0 |
AccountAgeDays | SatisfactionLevel | PostsPerWeek | GroupsJoined | IsVerified |
---|---|---|---|---|
365 | Very Satisfied | 5 | 10 | Yes |
730 | Satisfied | 12 | 5 | No |
180 | Neutral | 3 | 12 | No |
1095 | Very Satisfied | 20 | 7 | Yes |
60 | Dissatisfied | 1 | 3 | No |
540 | Satisfied | 8 | 8 | No |
850 | Very Satisfied | 15 | 15 | Yes |
275 | Neutral | 4 | 4 | No |
400 | Satisfied | 18 | 9 | Yes |
660 | Very Satisfied | 6 | 6 | No |
Description of Variables in twitter_data
:
UserID
: Identifier for each userAge
: Age of the user in yearsGender
: Gender of the userAccountType
: Type of social media accountCountry
: Country of residenceFollowersCount
: Number of followersLastLoginHour
: Time of last login in hours since midnightAccountAgeDays
: Age of the account in daysSatisfactionLevel
: User satisfaction levelPostsPerWeek
: Number of posts per weekGroupsJoined
: Number of groups joinedIsVerified
: Whether the user account is verified
Question 6
What type of variable is Country
in the dataset?
- Nominal
- Ordinal
- Interval
- Ratio
Explanation:
The Country
variable categorizes data based on different countries without any inherent order or ranking, making it a nominal variable.
Question 7
What type of variable is LastLoginHour
in the dataset?
- Nominal
- Ordinal
- Interval
- Ratio
Explanation:
LastLoginHour
represents the time of the last login measured in hours since midnight. While it has meaningful differences between values, it does not have a true zero point in this context, categorizing it as an interval variable.
Question 8
What type of variable is SatisfactionLevel
in the dataset?
- Nominal
- Ordinal
- Interval
- Ratio
Explanation:
SatisfactionLevel
is an ordered categorical variable with levels such as “Very Dissatisfied,” “Dissatisfied,” “Neutral,” “Satisfied,” and “Very Satisfied,” indicating a ranked relationship among categories.
Section 3. Filling-in-the-Blanks
Question 9
The Royal Swedish Academy of Sciences has decided to award the 2024 Nobel Prize in Physics to U.S. scientist John J. Hopfield and British-Canadian Geoffrey E. Hinton for discoveries and inventions in machine learning, a field that enables computers to learn from and make predictions or decisions based on data, which paved the way for the artificial intelligence boom.
Explanation:
Machine learning is a subset of artificial intelligence that focuses on the development of algorithms and statistical models that enable computers to perform tasks without explicit instructions. Instead, these systems learn and make decisions based on data input. John J. Hopfield and Geoffrey E. Hinton are renowned for their contributions to this field, particularly in neural networks and deep learning, which have significantly advanced the capabilities of artificial intelligence.
Question 10
In sport analytics, we discussed a machine learning model called decision tree that makes decisions by splitting data into branches based on input variables.
Explanation:
A decision tree is a popular machine learning model used for both classification and regression tasks. In sports analytics, decision trees can help in making strategic decisions, such as predicting player performance, determining optimal game strategies, or identifying key factors that influence game outcomes.
Question 11
The five V’s of big data are Volume, Velocity, Value, Veracity, and Variety.
Explanation:
The five V’s of big data—Volume, Velocity, Value, Veracity, and Variety—describe the challenges and characteristics associated with big data.
- Volume: Refers to the vast amounts of data generated every second.
- Velocity: The speed at which new data is generated and processed.
- Value: The importance of extracting meaningful insights from the data.
- Veracity: The trustworthiness and accuracy of the data.
- Variety: The different types of data (structured, unstructured, semi-structured) and sources from which data is collected.
Understanding these dimensions is crucial for effectively managing and leveraging big data in various applications, including data analytics and business intelligence.
Question 12
The process of inserting transformed data into the data warehouse is part of the load stage in ETL.
Explanation:
ETL stands for Extract, Transform, Load, which are the three fundamental steps in data warehousing and integration.
- Extract: Data is collected from various source systems.
- Transform: The extracted data is cleaned, formatted, and transformed into a suitable structure for analysis.
- Load: The transformed data is inserted into the data warehouse or destination database for storage and future analysis.
The Load stage is critical as it ensures that the prepared data is available in the data warehouse for querying, reporting, and business intelligence purposes.
Question 13
Generative AI refers to a category of artificial intelligence capable of generating new content, such as text, images, videos, music, and code.
Explanation:
Generative AI encompasses a range of artificial intelligence technologies designed to create new content. Unlike traditional AI models that primarily analyze or classify existing data, generative models can produce novel text, images, videos, music, and even code based on the patterns and structures learned from training data. The most popular example of generative AI is large language models like GPT-4. This technology has applications in creative industries, content creation, design, and more, enabling innovative solutions and automations.
Section 4. Data Analysis with R
Question 14
Which of the following R code correctly assigns the data.frame nycflights13::airlines
to the variable airlines_df
? (Note that airlines_df
is simply the name of the R object and can be any valid name in R.)
nycflights13::airlines <- airlines_df
airlines_df <- nycflights13::airlines
nycflights13::airlines >= airlines_df
airlines_df == nycflights13::airlines
- All of the above
Answer: b
Explanation:
Option b correctly assigns the airlines
data.frame from the nycflights13
package to the variable airlines_df
using the assignment operator <-
. The other options either reverse the assignment or use incorrect operators.
Question 15
Write the R code to create a new variable called total
and assign to it the sum of 8 and 12 in R.
Answer: total <- 8 + 12
Question 16
Given the data.frame df
with variables height
and name
, which of the following expressions returns a vector containing the values in the height
variable?
df:height
df::height
df$height
- Both b and c
Answer: c
Explanation: The $
operator is used to extract a specific variable from a data.frame. Option b uses the ::
operator incorrectly, which is meant for accessing functions from packages.
Question 17
The expression as.numeric("456")
will return the numeric value 456.
- True
- False
Explanation: The as.numeric()
function converts the string “456” to the numeric value 456.
Question 18
What is the result of the expression (1 + 2 * 3) ^ 2
in R?
36
49
81
Answer: b
Question 19
Given vectors a <- c(2, 4, 6)
and b <- c(1, 3, 5)
, what is the result of a + b
?
c(3, 7, 11)
c(2, 4, 6, 1, 3, 5)
c(1, 2, 3, 4, 5, 6)
Error
Answer: a
Explanation: Element-wise addition of vectors a and b results in: - 2 + 1 = 3 - 4 + 3 = 7 - 6 + 5 = 11
Question 20
To use the function read_csv()
from the readr
package, one of the packages in tidyverse
, you first need to load the package using the R code ________.
library(readr)
library(skimr)
library(tidyverse)
- All of the above
- Both a and c
- Both b and c
- Both a and c
Explanation: You can load the readr package specifically using library(readr)
, or load the entire tidyverse
suite (which includes readr
) using library(tidyverse)
.
Question 21
Consider the following data.frame df0
:
x | y |
---|---|
Na | 7 |
2 | NA |
3 | 9 |
What is the result of mean(df0$y)
?
7
NA
8
9
Answer: b
Explanation: By default, the mean()
function in R returns NA
if there are any missing values (NA
) in the data (unless the option, na.rm = TRUE
, is specified).
Questions 22-23
Consider the following data.frame df
for Questions 22-23:
id | name | age | score |
---|---|---|---|
1 | Anna | 22 | 90 |
2 | Ben | 28 | 85 |
3 | Carl | NA | 95 |
4 | Dana | 35 | NA |
5 | Ella | 40 | 80 |
Question 22
Which of the following code snippets filters observations where score
is strictly between 85 and 95 (i.e., excluding 85 and 95)?
df |> filter(score >= 85 | score <= 95)
df |> filter(score > 85 | score < 95)
df |> filter(score > 85 & score < 95)
df |> filter(score >= 85 & score <= 95)
Answer: c
Explanation: This code correctly filters rows where score is greater than 85 and less than 95, excluding the boundary values.
Question 23
Which of the following expressions correctly keeps observations from df where the age
variable does not have any missing values?
df |> filter(is.na(age))
df |> filter(!is.na(age))
df |> filter(age == NA)
df |> filter(age != NA)
- Both a and c
- Both b and d
Answer: b
Explanation: The expression !is.na(age)
filters out rows where age
is NA
, keeping only those with non-missing values.
Question 24
Consider the following data.frame df3
:
id | value |
---|---|
1 | 15 |
1 | 15 |
2 | 25 |
3 | 35 |
3 | 35 |
4 | 45 |
5 | 55 |
Which of the following code snippets returns a data.frame of unique id
values from df3
?
df3 |> select(id) |> distinct()
df3 |> distinct(value)
df3 |> distinct(id)
- Both a and c
Explanation:
- Option a:
df3 |> select(id) |> distinct()
This code first selects theid
variable fromdf3
and then appliesdistinct()
to remove duplicate entries, resulting in a data.frame of uniqueid
values. - Option c:
df3 |> distinct(id)
This code directly appliesdistinct()
to theid
variable, achieving the same result as option a by returning uniqueid
values. - Option d: Both a and c correctly return a data.frame of unique
id
values.
Question 25
Which of the following code snippets correctly renames the variable name
to first_name
in df
?
df |> rename(first_name = name)
df |> rename(name = first_name)
df |> rename("name" = "first_name")
df |> rename_variable(name = first_name)
Answer: a
Explanation: - Option a: df |> rename(first_name = name)
This syntax correctly renames the existing column name to first_name using the rename() function from the dplyr package. The format is new_name = old_name.
Question 26
Which of the following code snippets correctly removes the score
variable from df
?
df |> select(-score)
df |> select(-"score")
df |> select(!score)
df |> select(, -score)
df |> select(desc(score))
Answer: either a or b
Explanation: - Option a: df |> select(-score)
This is the correct and most straightforward way to remove the score
variable from df
using the select()
function with the minus (-
) sign to indicate exclusion. - Option b: df |> select(-"score")
While this can work in some cases, it’s less conventional and may lead to issues depending on the version of dplyr
or specific usage contexts.
Question 27
Which of the following code snippets filters observations where age
is not NA
, then arranges them in ascending order of age
, and then selects the name
and age
variables?
df |> filter(!is.na(age)) |> arrange(age) |> select(name, age)
df |> select(name, age) |> arrange(age) |> filter(!is.na(age))
df |> arrange(age) |> filter(!is.na(age)) |> select(name, age)
df |> filter(is.na(age)) |> arrange(desc(age)) |> select(name, age)
- All of the above
Answer: a
Explanation: - Option a: - filter(!is.na(age))
: Removes rows where age
is NA
. - arrange(age)
: Sorts the remaining data in ascending order based on age
. - select(name, age)
: Selects only the name
and age
variables. - This sequence correctly performs all required operations in the specified order.
Question 28
Consider the two related data.frames, students
and majors
:
students
student_id | name | age |
---|---|---|
1 | Brad | 20 |
2 | Jason | 22 |
4 | Marcie | 21 |
majors
student_id | major |
---|---|
1 | Business Administration |
2 | Economics |
3 | Data Analytics |
Which of the following R code correctly joins the two related data.frames, students
and majors
, to produce the resulting data.frame shown below?
student_id | major | name | age |
---|---|---|---|
1 | Business Administration | Brad | 20 |
2 | Economics | Jason | 22 |
3 | Data Analytics | NA | NA |
students |> left_join(majors)
majors |> left_join(students)
- Both a and b
Answer: b
Explanation:
- Option b:
majors |> left_join(students)
- This correctly sets majors as the left data.frame, ensuring all records from
majors
are retained. For the value ofstudent_id
3
, which does not exist instudents
, thename
andage
variables are filled withNA
, matching the provided in the resulting data.frame.
Section 4. Short Essay
Question 29
In R, what does the function sd(x)
compute, and why can it be more useful than var(x)
?
Answer:
The function sd(x)
in R computes the standard deviation of the numeric vector x
. Standard deviation measures the amount of variation or dispersion in a set of values reference to its mean. It is more useful than var(x)
—which calculates the variance—because standard deviation is expressed in the same units as the original data x
, making it more interpretable. For example, if the data represents heights in centimeters, the standard deviation will also be in centimeters, allowing for direct understanding of variability. In contrast, variance is in squared units, which can be less intuitive.
Question 30
List at least four applications of data analytics in sports analytics mentioned in the lecture, and briefly describe each one.
Answer:
- Player Performance Analysis:
- Data analytics is used to evaluate individual player statistics, such as scoring efficiency, defensive actions, and endurance. This helps in identifying strengths and areas for improvement, informing coaching decisions and player development programs.
- Injury Prediction and Prevention:
- By analyzing data on player movements, workloads, and physiological metrics, teams can predict potential injuries before they occur. This proactive approach allows for tailored training regimens and rest periods to minimize injury risks.
- Strategic Decision Making:
- Coaches and managers use data analytics to inform game strategies, such as optimal player lineups, tactical adjustments, and in-game decision-making. Analyzing opponents’ data also aids in developing effective game plans.
- Fan Engagement and Marketing:
- Teams leverage data analytics to understand fan behavior and preferences, enabling personalized marketing campaigns, enhanced in-stadium experiences, and targeted promotions. This fosters stronger fan loyalty and increases revenue through merchandise and ticket sales.
- Scouting and Recruitment:
- Data-driven scouting identifies talent by analyzing player statistics, performance metrics, and potential. This objective approach enhances the recruitment process, ensuring that teams acquire players who fit their strategic needs and have the potential for future success.
- Revenue Optimization:
- Teams use analytics to optimize ticket pricing, merchandise sales, and concession offerings. By understanding demand patterns and consumer behavior, they can implement dynamic pricing strategies and tailor products to maximize revenue.