x | y |
---|---|
1 | 4 |
2 | NA |
Na | 6 |
Midterm Exam 1
Fall 2024
Section 2. Multiple Choice
Question 4
Which of the following is NOT an example of a Business Intelligence (BI) tool mentioned in the lecture?
- Microsoft Power BI
- Tableau
- Looker
- Eclipse
- Eclipse
Explanation: Eclipse is an integrated development environment (IDE) used primarily for software development, not a Business Intelligence (BI) tool. We did not cover this.
Question 5
Which of the following is NOT a measure of dispersion?
- Range
- Variance
- Median
- Standard Deviation
- Median
Explanation: Median is a measure of central tendency. Range, variance, and standard deviation are measures of dispersion in a data set.
Section 3. Filling-in-the-Blanks
Question 15:
RStudio is an ________________________________________ (IDE) for R, providing a console, syntax-highlighting editor, and tools for plotting and debugging.
โintegrated development environmentโ
Explanation: RStudio is an IDE that offers an interface for writing and executing R code. It provides a user-friendly console, code editor with syntax highlighting, and tools for creating visualizations, debugging, and managing R projects efficiently.
Section 4. Data Analysis with R
Question 20
Which of the following R code correctly assigns the nycflights13::airlines
data.frame to the variable df_airlines
? (Note that df_airlines
is simply the name of the R object and can be any valid name in R.)
nycflights13::airlines <- df_airlines
df_airlines <- nycflights13::airlines
nycflights13::airlines <= df_airlines
df_airlines == nycflights13::airlines
df_airlines <- nycflights13::airlines
Explanation: The correct assignment operator in R is <-
. The right-hand side data.frame (nycflights13::airlines
) is assigned to the left-hand object name (df_airlines
). The other choices are either incorrect operators or use an invalid assignment direction.
Question 21
Which of the following R code correctly calculate the number of elements in a vector x <- c(1,2,3,4,5)
?
nrow(x)
sd(x)
sum(x)
length(x)
length(x)
Explanation: The length()
function returns the number of elements in a vector. The other options either calculate the number of rows (for data.frame) or the sum or standard deviation of the vector elements.
Question 22
Write the R code to create a new variable called result
and assign to it the sum of 5 and 7 in R.
result <- 5 + 7
Explanation: In R, you use the <-
operator to assign values to variables. Here, the expression 5 + 7
calculates the sum, which is then assigned to the variable result.
Question 23
Given the data.frame df
with variables age
and name
, which of the following expressions returns a vector containing the values in the age
variable?
df:age
df::age
df$age
- Both b and c
df$age
Explanation: In R, the $
operator is used to access specific variables (columns) within a data frame. The df$age
expression will return the age
column from df
. The other options use invalid syntax.
Question 24
The expression as.numeric("123")
will return the numeric value 123.
- True
- False
- True
Explanation: The as.numeric()
function in R converts character data into numeric format, so โ123โ will correctly be converted to the numeric value 123.
Question 25
What is the result of the expression (4 + 3) ^ 2
in R?
- 3.5
- 9
- 14
- 49
- 49
Explanation: The expression (4 + 3) ^ 2
first adds 4 and 3 to get 7, then raises 7 to the power of 2, which results in 49.
Question 26
Given vectors a <- c(1, 2, 3)
and b <- c(4, 5, 6)
, what is the result of a + b
?
c(5, 7, 9)
c(4, 5, 6, 1, 2, 3)
c(1, 2, 3, 4, 5, 6)
Error
c(5, 7, 9)
Explanation: In R, adding two vectors element-wise results in a new vector where each element is the sum of corresponding elements. So, a + b
results in c(1 + 4, 2 + 5, 3 + 6)
or c(5, 7, 9)
.
Question 27
Which of the following functions is part of the tidyverse package and is used to read a CSV file into a data.frame?
read.csv()
read_csv()
read.table()
load()
read_csv()
Explanation: read_csv()
is a function from the readr package, which is part of the tidyverse. It is faster and more efficient than base Rโs read.csv()
. read.csv()
is from base R, and read.table()
is used to read data from a text file, while load()
is used for loading R-specific binary files.
Question 28
To use the function skim()
from the skimr
package, you first need to load the package using the R code ________.
library(skimr)
load(skimr)
skimr
skimr::skim
library(skimr)
Explanation: The library()
function is used to load R packages that are installed. To use the skim()
function, you need to load the skimr
package with library(skimr)
.
Question 29
The filter()
function can use both logical operators like &
and comparison operators like >
within the same logical condition.
- True
- False
- True
Explanation: The filter()
function from the dplyr package can combine multiple conditions using logical operators (&
, |
, etc.) and comparison operators (>
, <
, ==
) within the same logical statement.
Question 30
Consider the following data.frame df0
:
What is the result of mean(df0$y)
?
- 4
NA
- 5
- 6
NA
Explanation: When calculating the mean of a vector that contains NA
values, R will return NA
by default. In this case, since df0$y
contains an NA
, the result is NA
. We can use skim()
to remove missing values before calculating the mean.
Questions 31-32
Consider the following data.frame df
for Questions 31-32:
id | name | age | score |
---|---|---|---|
1 | Alice | 25 | 85 |
2 | Bob | 30 | 90 |
3 | Charlie | 35 | 75 |
4 | David | NA | 80 |
5 | Eve | 45 | NA |
Question 31
Which of the following code snippets keeps observations where score
is between 80 and 90 inclusive?
df |> filter(score > 80 & score < 90)
df |> filter(score >= 80 & score <= 90)
df |> filter(score >= 80 | score <= 90)
df |> filter(score > 80 | score < 90)
df |> filter(score >= 80 & score <= 90)
Explanation: The filter()
function in tidyverse
allows for selecting rows that meet certain conditions. The correct condition for keeping scores between 80 and 90 inclusive requires using >=
and <=
. Option a is incorrect because it excludes 80 and 90, while options c and d use |
(OR), which selects scores either greater than or equal to 80 or less than or equal to 90, which is not the intended condition.
Question 32
Which of the following expressions correctly keeps observations from df
where the age
variable has missing values?
df |> filter(is.na(age))
df |> filter(!is.na(age))
df |> filter(age == NA)
df |> filter(age != NA)
- Both a and c
- Both b and d
df |> filter(is.na(age))
Explanation: The function is.na()
checks for missing values (NA
). To filter observations with missing values in age
, we use filter(is.na(age))
. Option c is incorrect because age == NA
does not work in R (use is.na()
instead). Option b filters for non-missing values, which is not what the question asks.
Question 33
The arrange()
function can sort data based on multiple variables.
- True
- False
- True
Explanation: The arrange()
function from dplyr can sort data based on one or more variables. If multiple variables are specified, the data is sorted by the first variable, and in the case of ties, by the second, and so on.
Question 34
Consider the following data.frame df3
:
id | value |
---|---|
1 | 10 |
2 | 20 |
2 | 20 |
3 | 30 |
4 | 40 |
4 | 40 |
5 | 50 |
Which of the following code snippets returns a data.frame of unique id
values from df3
?
df3 |> select(id) |> distinct()
df3 |> distinct(value)
df3 |> distinct(id)
- Both A and C
- Both A and C
Explanation: The distinct()
function in tidyverse
removes duplicate observations. Both df3 |> select(id) |> distinct()
and df3 |> distinct(id)
will return a data.frame with unique id
values. Option b operates on the value
variable, which is not relevant for this question.
Question 35
Which of the following code snippets correctly renames the variable age
to years
in df
?
df |> rename(years = age)
df |> rename(age = years)
df |> rename("age" = "years")
df |> rename_variable(age = years)
df |> rename(years = age)
Explanation: In tidyvserse, the correct syntax for renaming a variable is rename(new_name = old_name)
. Hence, to rename age
to years
, you need to use rename(years = age)
.
Question 36
Which of the following code snippets correctly removes the age
variable from df
?
df |> select(-age)
df |> select(-"age")
df |> select(!age)
df |> select(, -age)
df |> select(desc(age))
df |> select(-age)
and b.df |> select(-"age")
(a is preferred though)
Explanation: To remove a variable in tidyverse, you can use select()
with the minus sign -
before the variable name. Option a correctly removes age. Option b works but has unfavorable syntax (-โageโ), and the others use invalid approaches.
Question 37
Which of the following code snippets filters observations where age
is not NA
, then arranges them in descending order of age
, and then selects the name
and age
variables?
df |> filter(!is.na(age)) |> arrange(desc(age)) |> select(name, age)
df |> select(name, age) |> arrange(desc(age)) |> filter(!is.na(age))
df |> arrange(desc(age)) |> filter(!is.na(age)) |> select(name, age)
df |> filter(is.na(age)) |> arrange(age) |> select(name, age)
df |> filter(!is.na(age)) |> arrange(desc(age)) |> select(name, age)
Explanation: This sequence of operations first filters observations where age
is not missing (!is.na(age)
), arranges them in descending order of age
, and then selects the name
and age
variables. The other options either mix up the sequence or apply incorrect filtering.
Section 4. Short Essay
Question 39
In R, what does the function sd(x)
compute, and why can it be more useful than var(x)
?
The function sd(x)
computes the standard deviation of the values in x
, which is a measure of how far each data point deviates from the mean, on average. It is often more useful than var(x)
(which computes the variance) because standard deviation is in the same unit as the data, making it easier to interpret and compare across datasets.