Homework Assignment 1 - Example Answers

Author

Byeong-Hak Choe

Published

October 3, 2024

Modified

October 3, 2024

Descriptive Statistics

The following provides the descriptive statistics for each part of the homework, as well as the final score of HW1:

R Basics

Question 1

Compute the weighted mean of the vector scores <- c(85, 90, 88, 92, 87) with corresponding weights weights <- c(1, 2, 1, 1, 3).

Answer:

# Define the scores vector
scores <- c(85, 90, 88, 92, 87)

# Define the corresponding weights vector
weights <- c(1, 2, 1, 1, 3)

# Compute the weighted mean by summing the products of scores and weights and dividing by the sum of weights
weighted_mean <- sum(scores * weights) / sum(weights)

Therefore the weighted mean is 88.25.

To clarify the difference between the arithmetic mean and the weighted mean:
- Arithmetic mean: This is the simple average where all values are treated equally. You sum all the values and divide by the number of values.
- Weighted mean: This takes into account the weights assigned to each value. Each value is multiplied by its corresponding weight, and then the sum of these products is divided by the total sum of the weights.
Example:
- Let’s say you have a vector of values: x <- c(2, 4, 6). And corresponding weights: w <- c(1, 2, 3).
- For the arithmetic mean, you would ignore the weights and simply find the average of the values:
  - (Arithmetic mean) = (2 + 4 + 6) / 3 = 4
- For the weighted mean, you would multiply each value by its corresponding weight and then divide by the sum of the weights:
  - (Weighted mean) = ( (1 * 2) + (2 * 4) + (3 * 6) ) / ( 1 + 2 + 3 ) = 4.67

Question 2

Compute the interquartile range (IQR) of the following vector x <- c(5, 7, 6, 9, 100, 8, 5, 7, 6) manually, without using the IQR() function.

Answer:

# Define the data vector
x <- c(5, 7, 6, 9, 100, 8, 5, 7, 6)

# Compute the first quartile (Q1)
q1 <- quantile(x, 0.25)

# Compute the third quartile (Q3)
q3 <- quantile(x, 0.75)

# Calculate the interquartile range (IQR)
iqr_value <- q3 - q1

Therefore the IQR is 2.

Question 3

Detect the outliers in the vector x <- c(5, 7, 6, 9, 100, 8, 5, 7, 6) using the 1.5*IQR rule.

Answer:

# Calculate the lower bound for outliers using Q1 and the IQR
lower_bound <- q1 - 1.5 * iqr_value

# Calculate the upper bound for outliers using Q3 and the IQR
upper_bound <- q3 + 1.5 * iqr_value

# Find any values in the data vector that are below the lower bound (lower outliers)
outliers_lower <- x[ x < lower_bound ]

# Find any values in the data vector that are above the upper bound (upper outliers)
outliers_upper <- x[ x > upper_bound ]

There is one single value, 100, that belongs to outliers_upper. There is no lower outlier.

This rule is described in this lecture slide.
Although we primarily use filter() with data.frame, understanding vector indexing is a fundamental skill in data analysis.
- Lecture 9 covers vector indexing.

Question 4

Calculate the skewness of the vector x <- c(3, 5, 8, 12, 14, 15, 18, 20) without using any external R package. Skewness is defined as: \[ \text{Skewness} \,=\, \frac{n}{(n-1)(n-2)}\sum\left(\frac{x_{i}-\bar{x}}{s}\right)^{3} \] where \(s\) is the standard deviation of the vector x.

Answer:

# Define the data vector
x <- c(3, 5, 8, 12, 14, 15, 18, 20)

# Calculate the number of elements (n) in the vector
n <- length(x)

# Compute the mean of the vector
mean_x <- mean(x)

# Compute the standard deviation of the vector
sd_x <- sd(x)

# Calculate skewness using the formula provided
skewness <- ( n / ( (n-1) * (n-2) ) ) * sum( ( (x - mean_x)/sd_x )^3 )

Therefore skewness is approximately -0.2337.

You do not need to memorize the formula for skewness in our course.
The question is about translating a complex formula into R code.

Question 5

Calculate mode_v, the mode of a numeric vector v <- c(2, 3, 5, 5, 6, 7, 3) using the mfv() function provided by the R package modeest.

Answer:

# Define the vector of data
v <- c(2, 3, 5, 5, 6, 7, 3, 5)

# Use the modeest package's mfv() function to calculate the mode
mode_v <- modeest::mfv(v)

To use modeest::mfv(), we first install the modeest package:

# Install the modeest package
install.packages("modeest")

The notation modeest::mfv() indicates that we are using the mfv() function from the modeest package. The :: operator specifies which package the function or the object is coming from.

Question 6

Calculate, z, the vector for the standardized values in the vector x <- c(10, 20, 30, 40, 50). The standardized value of the individual value, \(z_{i}\), is defined as:

\[ z_{i} \,=\, \frac{x_{i} - \bar{x}}{s}, \] where
\(\quad\) - \(x_{i}\): \(i^{th}\) value in the vector \(x\)
\(\quad\) - \(\bar{x}\): the mean of values in \(x\)
\(\quad\) - \(s\): the standard deviation of values in \(x\)
\(\quad\) - \(z_{i}\): \(i^{th}\) value in the vector \(z\), the vector of the standardized values in \(x\).

Answer:

# Define the data vector
x <- c(10, 20, 30, 40, 50)

# Calculate the standardized values (z-scores) for each element in x
z <- (x - mean(x)) / sd(x)

Please note that the R objects mean_x and sd_x defined in Question 4 are not used in this question.

Short-Answer Questions

True/False

Data analytics aims to replace traditional business and economics courses. (False)
- Data analytics is seen as a complement to traditional business and economics courses, enhancing decision-making with data-driven insights rather than replacing foundational business knowledge.
Data analysts are the only professionals who benefit from data analytics skills. (False)
- Many professionals, including marketers, financial analysts, healthcare professionals, and even educators, benefit from data analytics skills as they apply to various fields.
Python and R are the only programming languages used in data science. (False)
- While Python and R are popular in data science, other languages like SQL, MATLAB, and even JavaScript are also used depending on the application.
The use of generative AI requires a deep understanding of the subject matter to apply it effectively. (True)
- To use generative AI effectively, it is essential to have a solid understanding of the subject area to interpret and apply the results accurately.
Machine learning algorithms need to be explicitly programmed for each task they perform. (False)
- Machine learning algorithms learn patterns from data, meaning they don’t need explicit programming for each task. Instead, they adapt and improve their performance based on the data provided.

Multiple Choice

Which of the following skills is NOT typically covered in traditional business or economics classes?
- Finding and cleaning datasets
- Why?: Data cleaning and preparation are often covered in data analytics but not typically in traditional business or economics courses.
Which of the following is a key reason why R is widely used in data analysis?
- It is specifically designed for statistical computing
- Why?: R was created for statistical analysis and is widely used because of its extensive libraries for data analysis and statistical modeling.
Which question would you ask to analyze season ticket renewals in sports analytics?
- Which type of fan engages most with team merchandise?
- Why?: This question helps to understand fan loyalty and engagement, which are important factors in predicting season ticket renewals.

While the above might be the most appropriate answer to this question, I also give full credit to the following responses:
- What factors drive last-minute individual seat ticket purchases?
  - What are the financial benefits of dynamic pricing?

Which of the following is a benefit of using Git in software development?
- It tracks changes and helps manage multiple versions of a project
- Why?: Git is a version control system that tracks changes and allows developers to manage multiple versions of a project, making collaboration more efficient.
Which of the following is a key application of data analytics in the retail sector?
- Enhancing physical store layouts based on customer behavior
- Why?: Data analytics helps retailers optimize store layouts by analyzing customer movement patterns, ultimately improving the customer experience and sales.

Short Eassay

Why are R and Python considered important tools for data analytics?
- R and Python are popular because they have a wide range of libraries and packages for data manipulation, visualization, and machine learning. They also support a large user community and are open-source, making them accessible and adaptable.
Explain why understanding the output of Generative AI tools like ChatGPT is important for data analysts or data scientists.
- Data analysts need to interpret the output of generative AI tools to ensure that the generated information is relevant, accurate, and aligned with the specific context of their analysis. This understanding helps prevent misinterpretation and misuse of AI-generated insights.
How does dynamic ticket pricing work in sports analytics?
- Dynamic ticket pricing adjusts the price of tickets in real-time based on demand, opponent strength, weather conditions, and other factors. This approach maximizes revenue by charging more when demand is high and less when demand is low.
How do business intelligence (BI) tools assist in decision-making for businesses?
- BI tools help businesses collect, process, and visualize data to support informed decision-making. They provide insights into trends, performance metrics, and areas of improvement, enabling businesses to make data-driven decisions.