Midterm Exam I

Version A

Published

October 8, 2025

Section 1. Multiple Choice

Question 1

Which tool is an Integrated Development Environment (IDE) that you can install on your computer to develop programs primarily using the R programming language?

  1. Posit Cloud
  2. Google Colab
  3. Jupyter Notebook
  4. RStudio

d

Explanation: RStudio (now under Posit) is a dedicated desktop IDE for R that provides a console, script editor, package management, plotting, and a full debugging workflow. Posit Cloud and Google Colab are browser-based environments (great for portability, but not desktop IDEs), and Jupyter Notebook is a notebook interface mainly used with Python (though R kernels exist, it is not the primary R IDE).

Question 2

Which version-control tool allows users to track, compare, and merge code changes?

  1. GitHub
  2. Git
  3. R
  4. Stack Overflow

b

Explanation: Git is the version-control system that performs the actual tracking of changes, branching, merging, and diffing. GitHub is a hosting platform built around Git repositories (remote collaboration, pull requests, issues). R is a programming language, and Stack Overflow is a Q&A site.

Question 3

A key distinction between Traditional Programming and Machine Learning (ML) in the context of image recognition (like for a cat) is that ML:

  1. Requires programmers to explicitly define rules for “cat” features (pointy ears, whiskers).
  2. Is less effective at finding hidden patterns in pixels than traditional methods.
  3. Learns the patterns from thousands of labeled cat/not-cat pictures rather than relying on fixed, explicit rules.
  4. Can be used for image recognition and but not for classification tasks like spam detection.

c

Explanation: In ML, models discover patterns from labeled data (e.g., many cat vs. not-cat images) rather than relying on hand-crafted rules. Traditional programming would require explicit feature rules (option a). ML often outperforms manual rules on high-dimensional pattern recognition (contradicting b) and is widely used for both image recognition and many other classification tasks such as spam detection (contradicting d).

Question 4

What was the key contribution of Moneyball to sports analytics?

  1. It created new baseball rules for team selection.
  2. It replaced human coaches with AI models.
  3. It popularized data-driven decision-making in sports management.
  4. It eliminated the use of scouts.

c

Explanation: Moneyball demonstrated that rigorous statistical analysis (e.g., on-base percentage) could uncover undervalued players and improve roster-building decisions. It did not change MLB rules, eliminate scouts, or replace coaches with AI; instead, it shifted culture toward evidence-based decision-making.

Question 5

Which statement best captures the distinctive role of Business Intelligence (BI) compared to other data-driven systems?

  1. BI focuses on automating decisions through artificial intelligence and predictive modeling.
  2. BI emphasizes descriptive and diagnostic insights that help managers understand why outcomes occurred.
  3. BI replaces human judgment by forecasting business outcomes using statistical learning.
  4. BI primarily stores large volumes of unprocessed data for future retrieval.

b

Explanation: BI primarily supports descriptive and diagnostic analytics (dashboards, reporting, drill-downs) to help stakeholders understand what happened and why. Predictive/automated decision-making is more aligned with data science and ML (contradicting a and c), and large-scale raw data storage is the role of data lakes/warehouses (contradicting d).

Question 6

What distinguishes Deep Learning from general Machine Learning?

  1. Deep learning is the overarching field of AI, while machine learning is just a sub-area.
  2. Deep learning is an advanced machine learning methodology that is uniquely suited for complex tasks and primarily relies on artificial neural networks.
  3. Deep learning models are capable of making predictions or generating outputs, while machine learning models are not.
  4. Deep learning is the practice of designing structured inputs to guide generative AI systems.

b

Explanation: Deep learning is a subfield of machine learning that uses multi-layer (deep) neural networks, often excelling on unstructured data (images, audio, text). ML as a whole includes many methods (decision trees, regression models). Options a and c invert relationships, and d describes prompt engineering, not deep learning.

Question 7

In the context of a Large Language Model (LLM), what is the function of Pre-training?

  1. To specialize the model for a specific task like medical Q&A or legal summarization.
  2. To have humans rank or score model answers to align them with preferences.
  3. To read a vast amount of text to learn general language patterns, resulting in a foundation model with broad knowledge.
  4. To reduce model biases by explicitly removing skewed data from the training corpus.

c

Explanation: Pre-training exposes the model to large corpora so it learns broad statistical patterns of language (syntax, semantics, world knowledge). Fine-tuning specializes models for specific tasks (a), RLHF uses human preference signals (b), and while data curation can address some biases (d), that is not the core function of pre-training.

Question 8

Which of the following best describes an “embedding” in the context of GPT models?

  1. A single number representing a word’s position in a sentence.
  2. A long list of numbers that captures a word’s meaning and allows words with similar meanings to have similar numerical representations.
  3. The final text output generated by the decoder.
  4. A specific part of the training dataset used for fine-tuning.

b

Explanation: An embedding is a dense vector representation capturing semantic relationships among tokens; similar meanings tend to have nearby vectors. It is not a single scalar (a), not the model’s generated text (c), and not a dataset subset (d).

Section 2. Filling-in-the-Blanks

Question 9

The practice of designing clear, structured inputs to guide generative AI systems toward accurate, useful, and context-appropriate outputs is known as _________________________.

prompt engineering

Explanation: Prompt engineering focuses on crafting instructions, constraints, and context so generative models produce reliable, relevant outputs. It includes strategies like role prompting, exemplars, delimiters, and stepwise guidance.

Question 10

A ________________________________ is the smallest unit of text a Large Language Model (LLM) processes, which can be a single character, a whole word, or a part of a word.

token

Explanation: LLMs operate on tokens, which result from tokenization rules; in English they often correspond to subwords. Token granularity affects sequence length limits and cost/performance tradeoffs.

Question 11

In RLHF, ________________________________ is used to incorporate human preferences and guide the model’s behavior.

a reward

Explanation: RLHF first trains a reward model using human preference data, such as ranked outputs. The language model is then further adjusted to produce responses that score higher under this reward model, aligning its behavior with human judgments.

Question 12

In GPT, ________________________________ encodes the order of words in a sentence.

positional encoding

Explanation: Transformers do not naturally track the order of tokens, so positional encodings inject sequence-order information, allowing the model to distinguish where each token appears and attend to relative relationships properly.

Question 13

Before the ________________________________, other language models struggled; the ________________________________ solved these issues by utilizing a(n) ________________________________, which allows the AI to concentrate on the most relevant parts of a text.

transformer; transformer; attention mechanism

Explanation: The transformer architecture uses self-attention instead of step-by-step processing, allowing the model to directly compare all tokens in a sequence at once. By focusing computational weight on the most relevant tokens, it improves both performance and scalability.

Section 4. Data Analysis with R

Question 14

You are working in R and want to use the table1 dataset that is included in the tidyr package. You have not yet loaded tidyr with library(), but you previously loaded the tidyverse package earlier in your session. To ensure that you explicitly use the dataset from tidyr, which command is the most reliable way to access it?

  1. table1
  2. tidyverse::table1
  3. tidyr$table1
  4. tidyr::table1
  5. library(tidyr::table1)

d

Explanation: Using the namespace operator tidyr::table1 accesses the object explicitly from the tidyr package without attaching it. tidyverse::table1 is invalid because tidyverse is a meta-package and does not export table1. tidyr$table1 is not how exported objects are accessed, and library(tidyr::table1) is invalid syntax.

Question 15

The comment character in R is _______________________________, and the keyboard shortcut to add or remove a comment in Posit Cloud on a Windows or Mac machine is ______________________________________________.

#; Windows: Ctrl+Shift+C, Mac: Cmd+Shift+C

Explanation: Lines beginning with # are treated as comments in R. In RStudio/Posit Cloud, the toggle-comment shortcut applies or removes # for the current line or selected block on both Windows (Ctrl+Shift+C) and macOS (Cmd+Shift+C).

Questions 16-17

Consider the following two vectors, a and b:

a <- c(5, 20)
b <- c(20, 5)

Question 16

What does a + b return?

  1. c(5, 20, 5)
  2. c(5, 20, 20, 5)
  3. c(25, 25, 25, 25)
  4. c(25, 25, 25)
  5. c(25, 25)

e

Explanation: Vector addition in R operates element-wise. So the operation calculates c(5+20, 20+5) = c(25, 25). The result has the same length as the input vectors, and each element is the sum of the corresponding elements from a and b.

Question 17

What does sqrt( a / b ) return?

c(0.5, 2)

Explanation: First, R performs element-wise division: a / b, which returns c(5/20, 20/5), which is c(1/4, 4). Then it applies the square root to each element, yielding c(1/2, 2). This demonstrates vectorized operations in R.

Questions 18-19

Suppose you create a factor variable, major:

major <- as.factor(c("ECON", "DANL", "ECON", "MGMT", "DANL"))

Question 18

What does levels(major) return?

c("DANL", "ECON", "MGMT")

Explanation: When converting a character vector to a factor, R extracts unique values and sorts them alphabetically by default. Therefore, the levels are “DANL”, “ECON”, and “MGMT”, regardless of the original order in the input.

Question 19

What does nlevels(major) return?

3

Explanation: Since the factor has three unique categories (levels), R returns 3 when calling nlevels(). This function counts the number of unique factor levels, not the number of values.

Question 20

Suppose the absolute pathname for the CSV file custdata.csv uploaded to your Posit Cloud project is:

/cloud/project/mydata/custdata.csv

The working directory for your Posit Cloud project is:

/cloud/project

Using the file’s relative pathname, write R code to read the CSV file as a data.frame and assign it to an object named df.

df <- read_csv("mydata/custdata.csv")

Explanation: Since the working directory is /cloud/project, the relative path to the CSV file omits /cloud/project and starts from the next folder level: “mydata/custdata.csv”. R’s read.csv() reads the file and assigns it to the object df.

Question 21

Consider the following data.frame df0:

x y
1 7
NA 2
3 NA

What does is.na(df0$x * df0$y) return?

  1. c(FALSE, TRUE, FALSE)
  2. c(FALSE, FALSE, TRUE)
  3. c(FALSE, FALSE, FALSE)
  4. c(FALSE, TRUE, TRUE)
  5. Error

d

Explanation: R multiplies element-wise:
- Row 1: 1 * 7 = 7 → not NAFALSE - Row 2: NA * 2 = NATRUE - Row 3: 3 * NA = NATRUE

Thus, is.na() returns c(FALSE, TRUE, TRUE).

Questions 22-23

Consider the following data.frame df for Questions 22-23:

id name age score
1 Anna 22 90
2 Ben 28 85
3 Carl NA 95
4 Dana 35 NA
5 Ella 40 80

Question 22

Which of the following code snippets filters observations where score is strictly between 85 and 95 (i.e., excluding 85 and 95)?

  1. df |> filter(score >= 85 | score <= 95)
  2. df |> filter(score => 85 | score =< 95)
  3. df |> filter(score > 85 | score < 95)
  4. df |> filter(score > 85 & score < 95)
  5. df |> filter(score >= 85 & score <= 95)
  6. df |> filter(score => 85 & score =< 95)

d

Explanation: Strictly between 85 and 95 means score > 85 and score < 95. The & operator enforces both conditions simultaneously. Option c uses | which would include values less than 85 or greater than 95, and other options misuse syntax or include boundary values.

Question 23

Which of the following expressions correctly keeps observations from df where the age variable does not have any missing values?

  1. df |> filter(is.na(age))
  2. df |> filter(!is.na(age))
  3. df |> filter(age == NA)
  4. df |> filter(age != NA)
  5. Both a and c
  6. Both b and d

b

Explanation: To filter out missing values, use !is.na(age). Using == NA or != NA does not work for missing values in R because NA represents an unknown and cannot be compared directly.

Questions 24–25

Consider the following data.frame flights_df for Questions 24–25:

origin dest dep_delay
JFK LAX 10
LGA ORD 45
JFK LAX 5
EWR MIA 60
LGA ORD 45
EWR SEA 20

Below provides data type of each variable:
- origin: character - dest: character - dep_delay: numeric

Question 24

Which of the following code snippets arranges the observations first by origin in ascending order, and then by dep_delay in descending order?

  1. flights_df |> arrange(origin, -dep_delay)
  2. flights_df |> arrange(origin, desc(dep_delay))
  3. flights_df |> arrange(desc(origin), dep_delay)
  4. flights_df |> arrange(desc(origin), desc(dep_delay))
  5. Both a and b

e

Explanation: Both -dep_delay and desc(dep_delay) sort in descending order. When combined with origin in ascending order (default), both (a) and (b) produce the correct ordering. Options (c) and (d) incorrectly sort origin in descending order.

Question 25

Which of the following expressions correctly returns all unique origin–destination combinations from flights_df?

  1. flights_df |> distinct(origin, dest)
  2. flights_df |> select(origin, dest)
  3. flights_df |> filter(!is.na(origin), !is.na(dest))
  4. flights_df |> distinct()
  5. Both a and d

a

Explanation: distinct(origin, dest) returns unique combinations for those two columns explicitly. distinct() with no arguments returns unique observations across all variables, which also gives unique origindest pairs only because each observation is already defined by those two variables. select() only extracts variables and does not remove duplicates.

Question 26

Which of the following code snippets correctly renames the variable score in df to exam_score?

  1. df |> rename(score = exam_score)
  2. df |> rename(exam_score = score)
  3. df |> rename("exam_score" = "score")
  4. df |> rename(df, exam_score = score)
  5. Both b and c

e (b or c deserves the full credit)

Explanation: The correct syntax in rename() is new_name = old_name. Both (b) and (c) follow this correctly. Option (a) reverses the direction, and option (d) introduces unnecessary arguments and incorrect form.

Question 27

Which of the following code snippets filters observations where age is 30 or older and score is below 90?

  1. df |> filter(age >= 30 | score < 90)
  2. df |> filter(age >= 30 & score < 90)
  3. df |> filter(age > 30 & score > 90)
  4. df |> filter(age >= 30, score < 90)
  5. Both b and d

e

Explanation: To enforce that both conditions must be satisfied simultaneously, use &. Option (a) uses |, which would allow observations where only one condition is met. Option (c) uses the wrong inequality for score. Option (d) is equivalent to option (b).

Question 28

Using the nycflights13::flights data.frame, which of the following code snippets correctly counts how many unique destination airports (dest) exist for each origin airport?

a.

df <- nycflights13::flights |> 
  distinct(origin, dest)

df_EWR <- df |> filter(origin == "EWR")
df_JFK <- df |> filter(origin == "JFK")
df_LGA <- df |> filter(origin == "LGA")

nrow(df_EWR)
nrow(df_JFK)
nrow(df_LGA)

b.

df <- nycflights13::flights |> 
  filter(origin == "EWR" | origin == "JFK" | origin == "LGA") |> 
  distinct(dest)

nrow(df)

c.

df <- nycflights13::flights |> 
  filter(origin == "EWR" & origin == "JFK" & origin == "LGA") |> 
  distinct(dest)

nrow(df)

d.

df <- nycflights13::flights |> 
  distinct(dest)

nrow(df)

e. Both a and b

f. Both a and c

a

Explanation: Option (a) correctly first extracts unique (origin, dest) combinations and then counts how many unique dest per each origin. Option (b) mistakenly pools all three origins together before counting distinct destinations, losing the per-origin grouping. Option (c) filters using &, which can never be true for mutually exclusive origins. Option (d) ignores origin entirely.

Section 4. Short Essay

Question 29

Why is the median often preferred over the mean as a measure of central tendency when a dataset contains outliers?

The median is less sensitive to extreme values because it depends only on the middle position in an ordered dataset, not on the magnitude of all values. Outliers can pull the mean sharply in one direction, distorting the central tendency and giving a misleading picture of the “typical” value. In skewed distributions or datasets with extreme highs/lows, the median provides a more robust and representative summary. For this reason, analysts often use the median for income, housing price, or other economic data known to contain large outliers.

Question 30

  • Define AI Alignment and explain why it is hard, referencing the failure mode of a “single-objective optimizer.”
  • Analyze why companies alone and governments alone cannot solve the alignment challenge, citing at least two reasons for each, and explain what is needed for an effective solution.

AI Alignment refers to the challenge of ensuring that powerful AI systems reliably act according to human values, ethical norms, and societal goals. It is difficult because advanced AI systems can optimize objectives in ways that technically satisfy a metric but violate human intent—this is the “single-objective optimizer” failure mode. When a model pushes one goal to an extreme without broader context or value constraints, it can generate harmful or unintended outcomes even while maximizing its target metric.

Why companies alone cannot solve it:
1. Companies face pressure to deploy quickly for competitive advantage, which may lead to cutting corners on long-term safety and value alignment. 2. Corporate profit incentives do not necessarily align with broader public welfare and ethical standards, especially in global contexts beyond their direct accountability.

Why governments alone cannot solve it:
1. Governments often lack the technical expertise and agility to regulate fast-moving AI developments effectively. 2. Global AI deployment crosses jurisdictions, and national policies cannot fully enforce alignment across private-sector labs or international competitors without global coordination.

What is needed:
An effective solution requires collaboration between researchers, companies, governments, and international institutions. This includes shared safety standards, transparent evaluation protocols, incentive structures that reward responsible development, and oversight mechanisms that span beyond national or corporate boundaries.

Back to top