Midterm Exam 1
Classwork 7
Summary for Midterm Exam 1 Performance
The following provides the descriptive statistics for each part of the Midterm Exam 1 questions:
The following describes the distribution of Midterm Exam 1 score:

Section 1. True or False
Question 1
Data analytics and data science have clear-cut distinctions with no overlap in activities or skill sets.
- True
- False
Explanation: Data analytics and data science overlap in activities and skill sets. While data science may involve more advanced techniques like machine learning, both fields focus on extracting insights from data.
Question 2
Data lakes use a schema-on-write approach, processing data before storage.
- True
- False
Explanation: Data lakes use a schema-on-read approach, storing data in its raw form and applying structure only when the data is read. Schema-on-write, which processes data before storage, is used by data warehouses.
Question 3
In the MapReduce framework in Hadoop, the Reduce phase occurs before the Map phase.
- True
- False
Explanation: In Hadoop’s MapReduce framework, the Map phase occurs first to process and generate intermediate data, followed by the Reduce phase, which aggregates and processes the results from the Map phase.
Section 2. Multiple Choice
Question 4
Which of the following is NOT an example of a Business Intelligence (BI) tool mentioned in the lecture?
- Microsoft Power BI
- Tableau
- Looker
- Eclipse
Explanation: Eclipse is an integrated development environment (IDE) used primarily for software development, not a Business Intelligence (BI) tool. We did not cover this.
Question 5
Which of the following is NOT a measure of dispersion?
- Range
- Variance
- Median
- Standard Deviation
Explanation: Median is a measure of central tendency. Range, variance, and standard deviation are measures of dispersion in a data set.
Question 6
Which of the following best describes veracity in big data?
- The speed at which data is generated
- The variety of data formats
- The accuracy and reliability of data
- The value derived from data
Explanation: Veracity refers to the trustworthiness, accuracy, and reliability of the data in big data contexts.
Question 7
Airbnb’s Dataportal was developed to:
- Store user-generated content
- Centralize data resources for easier access and analysis
- Provide customers with booking recommendations
- Replace their traditional database systems
Explanation: Airbnb developed Dataportal to centralize their data resources, making it easier for employees to access and analyze data efficiently.
Question 8
The stages of the ETL process are:
- Extract, Transform, Load
- Evaluate, Transform, Load
- Extract, Transfer, Link
- Extract, Transform, Log
Explanation: ETL stands for Extract, Transform, Load—the three stages involved in moving data from source systems to a data warehouse or database.
Question 9
Which statement about data lakes is TRUE?
- Data lakes process data using a schema-on-write approach
- Data lakes store only structured data
- Data lakes store all data types in raw, unprocessed form
- Data lakes are less flexible than data warehouses
Explanation: Data lakes store structured, semi-structured, and unstructured data in its raw form, allowing for flexible analysis and schema application upon reading the data.
Questions 10-13
For Questions 10-13, consider the following data.frame, netflix_data
, displayed below:
UserID | Age | Gender | SubscriptionPlan | FavoriteGenre | HoursWatched | LastLoginTime |
---|---|---|---|---|---|---|
1 | 25 | Female | Basic | Drama | 15.5 | 23.5 |
2 | 34 | Male | Premium | Comedy | 20.3 | 8.2 |
3 | 28 | Female | Standard | Action | 12.0 | 15.3 |
4 | 45 | Male | Premium | Horror | 25.7 | 21.0 |
5 | 23 | Female | Basic | Sci-Fi | 8.2 | 2.5 |
6 | 37 | Male | Standard | Romance | 18.9 | 13.7 |
7 | 31 | Female | Premium | Documentary | 22.5 | 9.8 |
8 | 29 | Male | Basic | Thriller | 16.8 | 18.4 |
9 | 41 | Male | Standard | Animation | 19.4 | 7.1 |
10 | 26 | Female | Premium | Fantasy | 14.1 | 12.6 |
AccountMonths | Satisfaction | nDevices | LastMovieRating | nProfiles | Language |
---|---|---|---|---|---|
12 | 5 | 2 | 7.8 | 1 | English |
24 | 4 | 3 | 6.5 | 2 | Spanish |
6 | 3 | 1 | 8.2 | 1 | English |
36 | 5 | 4 | 5.9 | 3 | French |
4 | 2 | 1 | 9.0 | 1 | English |
18 | 4 | 2 | 7.1 | 2 | German |
30 | 5 | 3 | 8.5 | 2 | English |
9 | 3 | 1 | 6.8 | 1 | Spanish |
15 | 4 | 2 | 7.6 | 2 | English |
21 | 5 | 3 | 8.0 | 3 | French |
Description of Variables in netflix_data
:
UserID
: Identifier for each userAge
: Age of the user in yearsGender
: Gender of the userSubscriptionPlan
: Type of Netflix subscriptionFavoriteGenre
: User’s favorite genreHoursWatched
: Average hours watched per weekLastLoginTime
: Time of last login in hours since midnightAccountMonths
: Age of the account in monthsSatisfaction
: User satisfaction rating (1 to 5 stars)nDevices
: Number of devices connectedLastMovieRating
: Rating of the last watched movie (1.0 to 10.0)nProfiles
: Number of profiles on the accountLanguage
: User’s preferred language
Question 10
What type of variable is FavoriteGenre
in the dataset?
- Nominal
- Ordinal
- Interval
- Ratio
Explanation: FavoriteGenre
is a categorical variable with no inherent ordering among the categories (e.g., Drama, Comedy, Action). Since there is no ranking or order, it is classified as a nominal variable.
Question 11
What type of variable is SubscriptionPlan
in the dataset?
- Nominal
- Ordinal
- Interval
- Ratio
Explanation: SubscriptionPlan
is an ordinal variable because the plans (Basic, Standard, Premium) have a meaningful order, but the difference between them is not numerically quantified. Ordinal variables have ordered categories, and in this case, “Premium” is higher than “Standard,” which is higher than “Basic.”
Question 12
What type of variable is LastLoginTime
in the dataset?
- Nominal
- Ordinal
- Interval
- Ratio
Explanation: LastLoginTime
is measured in hours since midnight, and the difference between times is meaningful. However, there is no true zero point (midnight is arbitrary), which makes it an interval variable rather than a ratio variable.
Question 13
What type of variable is Satisfaction
in the dataset?
- Nominal
- Ordinal
- Interval
- Ratio
Explanation: Satisfaction
is an ordinal variable because the satisfaction ratings (from 1 to 5 stars) indicate an order, but the differences between them are not necessarily equal or meaningful in the same way. The order is significant, but the numerical spacing is not uniform, making it ordinal.
Section 3. Filling-in-the-Blanks
Question 14:
The Royal Swedish Academy of Sciences has decided to award the 2024 Nobel Prize in Physics to U.S. scientist John J. Hopfield and British-Canadian Geoffrey E. Hinton for discoveries and inventions in ____________________, a field that enables computers to learn from and make predictions or decisions based on data, which paved the way for the artificial intelligence boom.
Answer: “machine learning”
Explanation: Machine learning is a branch of artificial intelligence (AI) that involves the development of algorithms that allow computers to learn from and make predictions or decisions based on data. Geoffrey Hinton is widely known for his contributions to deep learning, a subfield of machine learning, while John Hopfield has contributed to neural network models. These discoveries laid the foundation for modern AI technologies.
Question 15:
RStudio is an ________________________________________ (IDE) for R, providing a console, syntax-highlighting editor, and tools for plotting and debugging.
Answer: “integrated development environment”
Explanation: RStudio is an IDE that offers an interface for writing and executing R code. It provides a user-friendly console, code editor with syntax highlighting, and tools for creating visualizations, debugging, and managing R projects efficiently.
Question 16:
____________________ data refers to data that has a predefined format and fits into traditional databases.
Answer: “Structured”
Explanation: Structured data refers to data that is organized in a specific, predefined format, usually in rows and columns (like a table) that fit into traditional databases such as SQL. It is easy to search and analyze using relational database management systems.
Question 17:
A data ____________________ is a centralized repository that stores data in its raw, unaltered form, accommodating future analytical needs.
Answer: “lake”
Explanation: A data lake is a storage repository that holds large amounts of raw data in its native format until it is needed for analysis. Unlike data warehouses, data lakes can store unstructured, semi-structured, and structured data, allowing flexibility for future data processing and analysis.
Question 18:
In a relational database, each row is uniquely identified by a ____________________.
Answer: “key”
Explanation: In relational databases, a key is a unique identifier for each row (or record) in a table. It ensures that each record can be uniquely distinguished from others.
Question 19:
The distributed file system used by Hadoop to store data across multiple servers is called ____________________.
Answer: “HDFS (Hadoop Distributed File System)”
Explanation: HDFS is a distributed file system designed to run on commodity hardware. It is used by Hadoop to store large volumes of data across multiple servers, providing high fault tolerance and scalability. HDFS is a key component in the Hadoop ecosystem for processing big data.
Section 4. Data Analysis with R
Question 20
Which of the following R code correctly assigns the nycflights13::airlines
data.frame to the variable df_airlines
? (Note that df_airlines
is simply the name of the R object and can be any valid name in R.)
nycflights13::airlines <- df_airlines
df_airlines <- nycflights13::airlines
nycflights13::airlines <= df_airlines
df_airlines == nycflights13::airlines
Answer: b. df_airlines <- nycflights13::airlines
Explanation: The correct assignment operator in R is <-. The right-hand side value (nycflights13::airlines) is assigned to the left-hand variable (df_airlines). The other choices are either incorrect operators or use an invalid assignment direction.
Question 21
Which of the following R code correctly calculate the number of elements in a vector x <- c(1,2,3,4,5)
?
nrow(x)
sd(x)
sum(x)
length(x)
Answer: d. length(x)
Explanation: The length() function returns the number of elements in a vector. The other options either calculate the number of rows (for data frames or matrices) or the sum or standard deviation of the vector elements.
Question 22
Write the R code to create a new variable called result
and assign to it the sum of 5 and 7 in R.
Answer: ______________________________________________
Answer: result <- 5 + 7
Explanation: In R, you use the <- operator to assign values to variables. Here, the expression 5 + 7 calculates the sum, which is then assigned to the variable result.
Question 23
Given the data.frame df
with variables age
and name
, which of the following expressions returns a vector containing the values in the age
variable?
df:age
df::age
df$age
- Both b and c
Answer: c. df$age
Explanation: In R, the $
operator is used to access specific variables (columns) within a data frame. The df$age
expression will return the age
column from df
. The other options use invalid syntax.
Question 24
The expression as.numeric("123")
will return the numeric value 123.
- True
- False
Answer: a. True
Explanation: The as.numeric()
function in R converts character data into numeric format, so “123” will correctly be converted to the numeric value 123.
Question 25
What is the result of the expression (4 + 3) ^ 2
in R?
- 3.5
- 9
- 14
- 49
Answer: d. 49
Explanation: The expression (4 + 3) ^ 2 first adds 4 and 3 to get 7, then raises 7 to the power of 2, which results in 49.
Question 26
Given vectors a <- c(1, 2, 3)
and b <- c(4, 5, 6)
, what is the result of a + b
?
c(5, 7, 9)
c(4, 5, 6, 1, 2, 3)
c(1, 2, 3, 4, 5, 6)
Error
Answer: a. c(5, 7, 9)
Explanation: In R, adding two vectors element-wise results in a new vector where each element is the sum of corresponding elements. So, a + b
results in c(1 + 4, 2 + 5, 3 + 6)
or c(5, 7, 9)
.
Question 27
Which of the following functions is part of the tidyverse package and is used to read a CSV file into a data.frame?
read.csv()
read_csv()
read.table()
load()
Answer: b. read_csv()
Explanation: read_csv() is a function from the readr package, which is part of the tidyverse. It is faster and more efficient than base R’s read.csv(). read.csv() is from base R, and read.table() is used to read data from a text file, while load() is used for loading R-specific binary files.
Question 28
To use the function skim()
from the skimr
package, you first need to load the package using the R code ________.
library(skimr)
load(skimr)
skimr
skimr::skim
Answer: a. library(skimr)
Explanation: The library()
function is used to load R packages that are installed. To use the skim()
function, you need to load the skimr
package with library(skimr)
.
Question 29
The filter()
function can use both logical operators like &
and comparison operators like >
within the same logical condition.
- True
- False
Answer: a. True
Explanation: The filter()
function from the dplyr package can combine multiple conditions using logical operators (&
, |
, etc.) and comparison operators (>
, <
, ==
) within the same logical statement.
Question 30
Consider the following data.frame df0
:
x | y |
---|---|
1 | 4 |
2 | NA |
Na | 6 |
What is the result of mean(df0$y)
?
- 4
NA
- 5
- 6
Answer: b. NA
Explanation: When calculating the mean of a vector that contains NA
values, R will return NA
by default. In this case, since df0$y
contains an NA
, the result is NA
. We can use skim()
to remove missing values before calculating the mean.
Questions 31-32
Consider the following data.frame df
for Questions 31-32:
id | name | age | score |
---|---|---|---|
1 | Alice | 25 | 85 |
2 | Bob | 30 | 90 |
3 | Charlie | 35 | 75 |
4 | David | NA | 80 |
5 | Eve | 45 | NA |
Question 31
Which of the following code snippets keeps observations where score
is between 80 and 90 inclusive?
df |> filter(score > 80 & score < 90)
df |> filter(score >= 80 & score <= 90)
df |> filter(score >= 80 | score <= 90)
df |> filter(score > 80 | score < 90)
Answer: b. df |> filter(score >= 80 & score <= 90)
Explanation: The filter()
function in tidyverse
allows for selecting rows that meet certain conditions. The correct condition for keeping scores between 80 and 90 inclusive requires using >=
and <=
. Option a is incorrect because it excludes 80 and 90, while options c and d use |
(OR), which selects scores either greater than or equal to 80 or less than or equal to 90, which is not the intended condition.
Question 32
Which of the following expressions correctly keeps observations from df
where the age
variable has missing values?
df |> filter(is.na(age))
df |> filter(!is.na(age))
df |> filter(age == NA)
df |> filter(age != NA)
- Both a and c
- Both b and d
Answer: a. df |> filter(is.na(age))
Explanation: The function is.na()
checks for missing values (NA
). To filter observations with missing values in age
, we use filter(is.na(age))
. Option c is incorrect because age == NA
does not work in R (use is.na()
instead). Option b filters for non-missing values, which is not what the question asks.
Question 33
The arrange()
function can sort data based on multiple variables.
- True
- False
Answer: a. True
Explanation: The arrange()
function from dplyr can sort data based on one or more variables. If multiple variables are specified, the data is sorted by the first variable, and in the case of ties, by the second, and so on.
Question 34
Consider the following data.frame df3
:
id | value |
---|---|
1 | 10 |
2 | 20 |
2 | 20 |
3 | 30 |
4 | 40 |
4 | 40 |
5 | 50 |
Which of the following code snippets returns a data.frame of unique id
values from df3
?
df3 |> select(id) |> distinct()
df3 |> distinct(value)
df3 |> distinct(id)
- Both A and C
Answer: d. Both A and C
Explanation: The distinct()
function in tidyverse
removes duplicate observations. Both df3 |> select(id) |> distinct()
and df3 |> distinct(id)
will return a data.frame with unique id
values. Option b operates on the value
variable, which is not relevant for this question.
Question 35
Which of the following code snippets correctly renames the variable age
to years
in df
?
df |> rename(years = age)
df |> rename(age = years)
df |> rename("age" = "years")
df |> rename_variable(age = years)
Answer: a. df |> rename(years = age)
Explanation: In tidyvserse, the correct syntax for renaming a variable is rename(new_name = old_name)
. Hence, to rename age
to years
, you need to use rename(years = age)
.
Question 36
Which of the following code snippets correctly removes the age
variable from df
?
df |> select(-age)
df |> select(-"age")
df |> select(!age)
df |> select(, -age)
df |> select(desc(age))
Answer: a. df |> select(-age)
and b. df |> select(-"age")
(a is preferred though)
Explanation: To remove a variable in tidyverse, you can use select()
with the minus sign -
before the variable name. Option a correctly removes age. Option b works but has unfavorable syntax (-“age”), and the others use invalid approaches.
Question 37
Which of the following code snippets filters observations where age
is not NA
, then arranges them in descending order of age
, and then selects the name
and age
variables?
df |> filter(!is.na(age)) |> arrange(desc(age)) |> select(name, age)
df |> select(name, age) |> arrange(desc(age)) |> filter(!is.na(age))
df |> arrange(desc(age)) |> filter(!is.na(age)) |> select(name, age)
df |> filter(is.na(age)) |> arrange(age) |> select(name, age)
Answer: a. df |> filter(!is.na(age)) |> arrange(desc(age)) |> select(name, age)
Explanation: This sequence of operations first filters observations where age
is not missing (!is.na(age)
), arranges them in descending order of age
, and then selects the name
and age
variables. The other options either mix up the sequence or apply incorrect filtering.
Question 38
Consider the two related data.frames, df_1
and df_2
:
df_1
id | name | age |
---|---|---|
1 | Alice | 19 |
2 | Bob | 21 |
4 | Olivia | 20 |
df_2
id | major |
---|---|
1 | Economics |
2 | Business Administration |
3 | Data Analytics |
Which of the following R code correctly join the two related data.frames, df_1
and df_2
, to produce the resulting data.frame shown below?
id | major | name | age |
---|---|---|---|
1 | Economics | Alice | 19 |
2 | Business Administration | Bob | 21 |
3 | Data Analytics | NA | NA |
df_1 |> left_join(df_2)
df_2 |> left_join(df_1)
- both a and b
Answer: b. df_2 |> left_join(df_1)
Explanation: A left_join()
takes all the observations from the left data frame (df_2
in this case) and matches them with observations from the right data frame (df_1
). The question specifies that df_2
should be the left table, which makes option b correct.
Section 4. Short Essay
Question 39
In R, what does the function sd(x)
compute, and why can it be more useful than var(x)
?
Answer:
The function sd(x)
computes the standard deviation of the values in x
, which is a measure of the amount of variation or dispersion in a set of values. It is often more useful than var(x)
(which computes the variance) because standard deviation is in the same unit as the data, making it easier to interpret and compare across datasets.
Question 40
What is the primary limitation of Hadoop’s MapReduce, and how is it addressed by technologies like Apache Storm and Apache Spark?
Answer:
The primary limitation of Hadoop’s MapReduce is its inefficiency for real-time and iterative processing. MapReduce processes data in batch mode, which leads to latency and makes it unsuitable for real-time analytics. This is addressed by technologies like Apache Storm and Apache Spark. Apache Storm enables real-time stream processing, and Apache Spark improves performance for both real-time and iterative tasks by utilizing in-memory processing and optimized execution engines.