Midterm Exam 1

Classwork 7

Author

Byeong-Hak Choe

Published

October 22, 2024

Modified

October 22, 2024

Summary for Midterm Exam 1 Performance

The following provides the descriptive statistics for each part of the Midterm Exam 1 questions:

The following describes the distribution of Midterm Exam 1 score:

Section 1. True or False

Question 1

Data analytics and data science have clear-cut distinctions with no overlap in activities or skill sets.

True
False

Explanation: Data analytics and data science overlap in activities and skill sets. While data science may involve more advanced techniques like machine learning, both fields focus on extracting insights from data.

Question 2

Data lakes use a schema-on-write approach, processing data before storage.

True
False

Explanation: Data lakes use a schema-on-read approach, storing data in its raw form and applying structure only when the data is read. Schema-on-write, which processes data before storage, is used by data warehouses.

Question 3

In the MapReduce framework in Hadoop, the Reduce phase occurs before the Map phase.

True
False

Explanation: In Hadoop’s MapReduce framework, the Map phase occurs first to process and generate intermediate data, followed by the Reduce phase, which aggregates and processes the results from the Map phase.

Section 2. Multiple Choice

Question 4

Which of the following is NOT an example of a Business Intelligence (BI) tool mentioned in the lecture?

Microsoft Power BI
Tableau
Looker
Eclipse

Explanation: Eclipse is an integrated development environment (IDE) used primarily for software development, not a Business Intelligence (BI) tool. We did not cover this.

Question 5

Which of the following is NOT a measure of dispersion?

Range
Variance
Median
Standard Deviation

Explanation: Median is a measure of central tendency. Range, variance, and standard deviation are measures of dispersion in a data set.

Question 6

Which of the following best describes veracity in big data?

The speed at which data is generated
The variety of data formats
The accuracy and reliability of data
The value derived from data

Explanation: Veracity refers to the trustworthiness, accuracy, and reliability of the data in big data contexts.

Question 7

Airbnb’s Dataportal was developed to:

Store user-generated content
Centralize data resources for easier access and analysis
Provide customers with booking recommendations
Replace their traditional database systems

Explanation: Airbnb developed Dataportal to centralize their data resources, making it easier for employees to access and analyze data efficiently.

Question 8

The stages of the ETL process are:

Extract, Transform, Load
Evaluate, Transform, Load
Extract, Transfer, Link
Extract, Transform, Log

Explanation: ETL stands for Extract, Transform, Load—the three stages involved in moving data from source systems to a data warehouse or database.

Question 9

Which statement about data lakes is TRUE?

Data lakes process data using a schema-on-write approach
Data lakes store only structured data
Data lakes store all data types in raw, unprocessed form
Data lakes are less flexible than data warehouses

Explanation: Data lakes store structured, semi-structured, and unstructured data in its raw form, allowing for flexible analysis and schema application upon reading the data.

Questions 10-13

For Questions 10-13, consider the following data.frame, netflix_data, displayed below:

UserID	Age	Gender	SubscriptionPlan	FavoriteGenre	HoursWatched	LastLoginTime
1	25	Female	Basic	Drama	15.5	23.5
2	34	Male	Premium	Comedy	20.3	8.2
3	28	Female	Standard	Action	12.0	15.3
4	45	Male	Premium	Horror	25.7	21.0
5	23	Female	Basic	Sci-Fi	8.2	2.5
6	37	Male	Standard	Romance	18.9	13.7
7	31	Female	Premium	Documentary	22.5	9.8
8	29	Male	Basic	Thriller	16.8	18.4
9	41	Male	Standard	Animation	19.4	7.1
10	26	Female	Premium	Fantasy	14.1	12.6

AccountMonths	Satisfaction	nDevices	LastMovieRating	nProfiles	Language
12	5	2	7.8	1	English
24	4	3	6.5	2	Spanish
6	3	1	8.2	1	English
36	5	4	5.9	3	French
4	2	1	9.0	1	English
18	4	2	7.1	2	German
30	5	3	8.5	2	English
9	3	1	6.8	1	Spanish
15	4	2	7.6	2	English
21	5	3	8.0	3	French

Description of Variables in `netflix_data`:

UserID: Identifier for each user
Age: Age of the user in years
Gender: Gender of the user
SubscriptionPlan: Type of Netflix subscription
FavoriteGenre: User’s favorite genre
HoursWatched: Average hours watched per week
LastLoginTime: Time of last login in hours since midnight
AccountMonths: Age of the account in months
Satisfaction: User satisfaction rating (1 to 5 stars)
nDevices: Number of devices connected
LastMovieRating: Rating of the last watched movie (1.0 to 10.0)
nProfiles: Number of profiles on the account
Language: User’s preferred language

Question 10

What type of variable is FavoriteGenre in the dataset?

Nominal
Ordinal
Interval
Ratio

Explanation: FavoriteGenre is a categorical variable with no inherent ordering among the categories (e.g., Drama, Comedy, Action). Since there is no ranking or order, it is classified as a nominal variable.

Question 11

What type of variable is SubscriptionPlan in the dataset?

Nominal
Ordinal
Interval
Ratio

Explanation: SubscriptionPlan is an ordinal variable because the plans (Basic, Standard, Premium) have a meaningful order, but the difference between them is not numerically quantified. Ordinal variables have ordered categories, and in this case, “Premium” is higher than “Standard,” which is higher than “Basic.”

Question 12

What type of variable is LastLoginTime in the dataset?

Nominal
Ordinal
Interval
Ratio

Explanation: LastLoginTime is measured in hours since midnight, and the difference between times is meaningful. However, there is no true zero point (midnight is arbitrary), which makes it an interval variable rather than a ratio variable.

Question 13

What type of variable is Satisfaction in the dataset?

Nominal
Ordinal
Interval
Ratio

Explanation: Satisfaction is an ordinal variable because the satisfaction ratings (from 1 to 5 stars) indicate an order, but the differences between them are not necessarily equal or meaningful in the same way. The order is significant, but the numerical spacing is not uniform, making it ordinal.

Section 3. Filling-in-the-Blanks

Question 14:

The Royal Swedish Academy of Sciences has decided to award the 2024 Nobel Prize in Physics to U.S. scientist John J. Hopfield and British-Canadian Geoffrey E. Hinton for discoveries and inventions in ____________________, a field that enables computers to learn from and make predictions or decisions based on data, which paved the way for the artificial intelligence boom.

Answer: “machine learning”

Explanation: Machine learning is a branch of artificial intelligence (AI) that involves the development of algorithms that allow computers to learn from and make predictions or decisions based on data. Geoffrey Hinton is widely known for his contributions to deep learning, a subfield of machine learning, while John Hopfield has contributed to neural network models. These discoveries laid the foundation for modern AI technologies.

Question 15:

RStudio is an ________________________________________ (IDE) for R, providing a console, syntax-highlighting editor, and tools for plotting and debugging.

Answer: “integrated development environment”

Explanation: RStudio is an IDE that offers an interface for writing and executing R code. It provides a user-friendly console, code editor with syntax highlighting, and tools for creating visualizations, debugging, and managing R projects efficiently.

Question 16:

____________________ data refers to data that has a predefined format and fits into traditional databases.

Answer: “Structured”

Explanation: Structured data refers to data that is organized in a specific, predefined format, usually in rows and columns (like a table) that fit into traditional databases such as SQL. It is easy to search and analyze using relational database management systems.

Question 17:

A data ____________________ is a centralized repository that stores data in its raw, unaltered form, accommodating future analytical needs.

Answer: “lake”

Explanation: A data lake is a storage repository that holds large amounts of raw data in its native format until it is needed for analysis. Unlike data warehouses, data lakes can store unstructured, semi-structured, and structured data, allowing flexibility for future data processing and analysis.

Question 18:

In a relational database, each row is uniquely identified by a ____________________.

Answer: “key”

Explanation: In relational databases, a key is a unique identifier for each row (or record) in a table. It ensures that each record can be uniquely distinguished from others.

Question 19:

The distributed file system used by Hadoop to store data across multiple servers is called ____________________.

Answer: “HDFS (Hadoop Distributed File System)”

Explanation: HDFS is a distributed file system designed to run on commodity hardware. It is used by Hadoop to store large volumes of data across multiple servers, providing high fault tolerance and scalability. HDFS is a key component in the Hadoop ecosystem for processing big data.

Section 4. Data Analysis with R

Question 20

Which of the following R code correctly assigns the nycflights13::airlines data.frame to the variable df_airlines? (Note that df_airlines is simply the name of the R object and can be any valid name in R.)

nycflights13::airlines <- df_airlines
df_airlines <- nycflights13::airlines
nycflights13::airlines <= df_airlines
df_airlines == nycflights13::airlines

Answer: b. df_airlines <- nycflights13::airlines

Explanation: The correct assignment operator in R is <-. The right-hand side value (nycflights13::airlines) is assigned to the left-hand variable (df_airlines). The other choices are either incorrect operators or use an invalid assignment direction.

Question 21

Which of the following R code correctly calculate the number of elements in a vector x <- c(1,2,3,4,5)?

nrow(x)
sd(x)
sum(x)
length(x)

Answer: d. length(x)

Explanation: The length() function returns the number of elements in a vector. The other options either calculate the number of rows (for data frames or matrices) or the sum or standard deviation of the vector elements.

Question 22

Write the R code to create a new variable called result and assign to it the sum of 5 and 7 in R.

Answer: ______________________________________________

Answer: result <- 5 + 7

Explanation: In R, you use the <- operator to assign values to variables. Here, the expression 5 + 7 calculates the sum, which is then assigned to the variable result.

Question 23

Given the data.frame df with variables age and name, which of the following expressions returns a vector containing the values in the age variable?

df:age
df::age
df$age
Both b and c

Answer: c. df$age

Explanation: In R, the $ operator is used to access specific variables (columns) within a data frame. The df$age expression will return the age column from df. The other options use invalid syntax.

Question 24

The expression as.numeric("123") will return the numeric value 123.

True
False

Answer: a. True

Explanation: The as.numeric() function in R converts character data into numeric format, so “123” will correctly be converted to the numeric value 123.

Question 25

What is the result of the expression (4 + 3) ^ 2 in R?

Answer: d. 49

Explanation: The expression (4 + 3) ^ 2 first adds 4 and 3 to get 7, then raises 7 to the power of 2, which results in 49.

Question 26

Given vectors a <- c(1, 2, 3) and b <- c(4, 5, 6), what is the result of a + b?

c(5, 7, 9)
c(4, 5, 6, 1, 2, 3)
c(1, 2, 3, 4, 5, 6)
Error

Answer: a. c(5, 7, 9)

Explanation: In R, adding two vectors element-wise results in a new vector where each element is the sum of corresponding elements. So, a + b results in c(1 + 4, 2 + 5, 3 + 6) or c(5, 7, 9).

Question 27

Which of the following functions is part of the tidyverse package and is used to read a CSV file into a data.frame?

read.csv()
read_csv()
read.table()
load()

Answer: b. read_csv()

Explanation: read_csv() is a function from the readr package, which is part of the tidyverse. It is faster and more efficient than base R’s read.csv(). read.csv() is from base R, and read.table() is used to read data from a text file, while load() is used for loading R-specific binary files.

Question 28

To use the function skim() from the skimr package, you first need to load the package using the R code ________.

library(skimr)
load(skimr)
skimr
skimr::skim

Answer: a. library(skimr)

Explanation: The library() function is used to load R packages that are installed. To use the skim() function, you need to load the skimr package with library(skimr).

Question 29

The filter() function can use both logical operators like & and comparison operators like > within the same logical condition.

True
False

Answer: a. True

Explanation: The filter() function from the dplyr package can combine multiple conditions using logical operators (&, |, etc.) and comparison operators (>, <, ==) within the same logical statement.

Question 30

Consider the following data.frame df0:

x	y
1	4
2	NA
Na	6

What is the result of mean(df0$y)?

4
NA
5
6

Answer: b. NA

Explanation: When calculating the mean of a vector that contains NA values, R will return NA by default. In this case, since df0$y contains an NA, the result is NA. We can use skim() to remove missing values before calculating the mean.

Questions 31-32

Consider the following data.frame df for Questions 31-32:

id	name	age	score
1	Alice	25	85
2	Bob	30	90
3	Charlie	35	75
4	David	NA	80
5	Eve	45	NA

Question 31

Which of the following code snippets keeps observations where score is between 80 and 90 inclusive?

df |> filter(score > 80 & score < 90)
df |> filter(score >= 80 & score <= 90)
df |> filter(score >= 80 | score <= 90)
df |> filter(score > 80 | score < 90)

Answer: b. df |> filter(score >= 80 & score <= 90)

Explanation: The filter() function in tidyverse allows for selecting rows that meet certain conditions. The correct condition for keeping scores between 80 and 90 inclusive requires using >= and <=. Option a is incorrect because it excludes 80 and 90, while options c and d use | (OR), which selects scores either greater than or equal to 80 or less than or equal to 90, which is not the intended condition.

Question 32

Which of the following expressions correctly keeps observations from df where the age variable has missing values?

df |> filter(is.na(age))
df |> filter(!is.na(age))
df |> filter(age == NA)
df |> filter(age != NA)
Both a and c
Both b and d

Answer: a. df |> filter(is.na(age))

Explanation: The function is.na() checks for missing values (NA). To filter observations with missing values in age, we use filter(is.na(age)). Option c is incorrect because age == NA does not work in R (use is.na() instead). Option b filters for non-missing values, which is not what the question asks.

Question 33

The arrange() function can sort data based on multiple variables.

True
False

Answer: a. True

Explanation: The arrange() function from dplyr can sort data based on one or more variables. If multiple variables are specified, the data is sorted by the first variable, and in the case of ties, by the second, and so on.

Question 34

Consider the following data.frame df3:

id	value
1	10
2	20
2	20
3	30
4	40
4	40
5	50

Which of the following code snippets returns a data.frame of unique id values from df3?

df3 |> select(id) |> distinct()
df3 |> distinct(value)
df3 |> distinct(id)
Both A and C

Answer: d. Both A and C

Explanation: The distinct() function in tidyverse removes duplicate observations. Both df3 |> select(id) |> distinct() and df3 |> distinct(id) will return a data.frame with unique id values. Option b operates on the value variable, which is not relevant for this question.

Question 35

Which of the following code snippets correctly renames the variable age to years in df?

df |> rename(years = age)
df |> rename(age = years)
df |> rename("age" = "years")
df |> rename_variable(age = years)

Answer: a. df |> rename(years = age)

Explanation: In tidyvserse, the correct syntax for renaming a variable is rename(new_name = old_name). Hence, to rename age to years, you need to use rename(years = age).

Question 36

Which of the following code snippets correctly removes the age variable from df?

df |> select(-age)
df |> select(-"age")
df |> select(!age)
df |> select(, -age)
df |> select(desc(age))

Answer: a. df |> select(-age) and b. df |> select(-"age") (a is preferred though)

Explanation: To remove a variable in tidyverse, you can use select() with the minus sign - before the variable name. Option a correctly removes age. Option b works but has unfavorable syntax (-“age”), and the others use invalid approaches.

Question 37

Which of the following code snippets filters observations where age is not NA, then arranges them in descending order of age, and then selects the name and age variables?

df |> filter(!is.na(age)) |> arrange(desc(age)) |> select(name, age)
df |> select(name, age) |> arrange(desc(age)) |> filter(!is.na(age))
df |> arrange(desc(age)) |> filter(!is.na(age)) |> select(name, age)
df |> filter(is.na(age)) |> arrange(age) |> select(name, age)

Answer: a. df |> filter(!is.na(age)) |> arrange(desc(age)) |> select(name, age)

Explanation: This sequence of operations first filters observations where age is not missing (!is.na(age)), arranges them in descending order of age, and then selects the name and age variables. The other options either mix up the sequence or apply incorrect filtering.

Question 38

Consider the two related data.frames, df_1 and df_2:

df_1

id	name	age
1	Alice	19
2	Bob	21
4	Olivia	20

df_2

id	major
1	Economics
2	Business Administration
3	Data Analytics

Which of the following R code correctly join the two related data.frames, df_1 and df_2, to produce the resulting data.frame shown below?

id	major	name	age
1	Economics	Alice	19
2	Business Administration	Bob	21
3	Data Analytics	NA	NA

df_1 |> left_join(df_2)
df_2 |> left_join(df_1)
both a and b

Answer: b. df_2 |> left_join(df_1)

Explanation: A left_join() takes all the observations from the left data frame (df_2 in this case) and matches them with observations from the right data frame (df_1). The question specifies that df_2 should be the left table, which makes option b correct.

Section 4. Short Essay

Question 39

In R, what does the function sd(x) compute, and why can it be more useful than var(x)?

Answer:

The function sd(x) computes the standard deviation of the values in x, which is a measure of the amount of variation or dispersion in a set of values. It is often more useful than var(x) (which computes the variance) because standard deviation is in the same unit as the data, making it easier to interpret and compare across datasets.

Question 40

What is the primary limitation of Hadoop’s MapReduce, and how is it addressed by technologies like Apache Storm and Apache Spark?

Answer:

The primary limitation of Hadoop’s MapReduce is its inefficiency for real-time and iterative processing. MapReduce processes data in batch mode, which leads to latency and makes it unsuitable for real-time analytics. This is addressed by technologies like Apache Storm and Apache Spark. Apache Storm enables real-time stream processing, and Apache Spark improves performance for both real-time and iterative tasks by utilizing in-memory processing and optimized execution engines.

Summary for Midterm Exam 1 Performance

Section 1. True or False

Question 1

Question 2

Question 3

Section 2. Multiple Choice

Question 4

Question 5

Question 6

Question 7

Question 8

Question 9

Questions 10-13

Description of Variables in netflix_data:

Question 10

Question 11

Question 12

Question 13

Section 3. Filling-in-the-Blanks

Question 14:

Question 15:

Question 16:

Question 17:

Question 18:

Question 19:

Section 4. Data Analysis with R

Question 20

Question 21

Question 22

Question 23

Question 24

Question 25

Question 26

Question 27

Question 28

Question 29

Question 30

Questions 31-32

Question 31

Question 32

Question 33

Question 34

Question 35

Question 36

Question 37

Question 38

Section 4. Short Essay

Question 39

Question 40

Description of Variables in `netflix_data`: