Midterm Exam I

Version B

Published

October 8, 2025

Section 1. Multiple Choice

Question 1

Which tool is an Integrated Development Environment (IDE) that you can install on your computer to develop programs primarily using the Python programming language?

  1. Posit Cloud
  2. Google Colab
  3. Jupyter Notebook
  4. MATLAB

c

Explanation: Jupyter Notebook is a locally installable environment that allows users to write, execute, and manage Python code interactively using code cells. While it can also be run in the browser, it is installed as a Python-based IDE on a local machine. Posit Cloud and Google Colab are cloud-based environments that do not require installation, and MATLAB is a standalone IDE but is not primarily designed for Python development.

Question 2

Which combination correctly matches the tool with its role?

Tool Primary Role
I GitHub Code sharing
II RStudio IDE for R
III Python General-purpose language
  1. Only I and II
  2. Only II and III
  3. I, II, and III
  4. Only I and III

c

Explanation: GitHub is a platform for sharing and collaborating on code, RStudio is an IDE for R programming, and Python is a general-purpose programming language widely used for analytics, automation, AI, and more. Since all three pairings are correct, the correct answer is that I, II, and III are all correctly matched.

Question 3

Which of the following best describes the core mechanism by which a machine learning model predicts a new output, as described in the text?

  1. It receives explicit, pre-written instructions for every possible data combination.
  2. It identifies patterns by processing a large quantity of historical input/output data sets.
  3. It relies on human intervention to classify new data points in real time.
  4. It performs data filtering and sorting tasks without needing a modifiable math function.

b

Explanation: Machine learning models work by detecting statistical patterns in large amounts of historical input-output data. Through training, the model adjusts its internal mathematical parameters so that its output better matches expected results. Unlike traditional programming, the model does not require explicit rules for every case and does not depend on human supervision at inference time.

Question 4

Which of the following best describes the primary objective of sports analytics in modern organizations?

  1. Automating coaching and managerial decisions through advanced machine learning algorithms
  2. Collecting and analyzing data to generate actionable insights for both athletic performance and organizational strategy
  3. Using data exclusively to evaluate athlete recruitment and compensation decisions
  4. Measuring fan sentiment and satisfaction through periodic surveys and social-media monitoring

b

Explanation: Modern sports analytics goes beyond recruitment and includes performance tracking, injury prevention, game strategy, financial planning, and fan engagement—all informed by data analysis. The goal is not to automate decisions entirely but to provide coaches and managers with data-driven insights that support high-level strategy across both performance and business operations.

Question 5

Which of the following activities falls outside the primary scope of Business Intelligence (BI) as traditionally defined?

  1. Summarizing and reporting historical business performance through dashboards and KPIs
  2. Identifying data patterns and market trends that inform managerial decisions
  3. Providing automated prescriptive recommendations and executing future actions without human input
  4. Supporting strategic planning by visualizing performance metrics and uncovering inefficiencies

c

Explanation: Traditional BI focuses mainly on descriptive and diagnostic analytics—summarizing past performance, generating dashboards, and helping stakeholders interpret why certain outcomes occurred. Fully automated prescriptive systems that execute future actions without human involvement fall more under advanced analytics and AI/ML-driven decision automation, not classic BI.

Question 6

Which of the following best defines deep learning?

  1. Any algorithm that predicts future events using labeled data
  2. A rule-based expert system using human-written logic
  3. A subset of machine learning that uses multi-layered neural networks to model complex, unstructured data
  4. A database search algorithm optimized for large text corpora

c

Explanation: Deep learning refers to neural network architectures with multiple hidden layers designed to handle complex data such as images, audio, and natural language. It is a specialized branch of machine learning—not a rule-based system or a simple database method—and its strength comes from learning rich patterns directly from data.

Question 7

In the context of a Large Language Model (LLM), what is the function of Pre-training?

  1. The model is trained for specific tasks before deployment
  2. The model learns from large amounts of general text data before fine-tuning
  3. The model learns only through reinforcement learning with human feedback
  4. The model is trained only on labeled datasets

b

Explanation: Pre-training exposes the model to massive amounts of text so it can learn grammar, semantics, factual knowledge, and generalized language structure. Only after pre-training can the model be fine-tuned or aligned with human preferences. It does not only rely on labeled datasets or RLHF during this stage.

Question 8

What is the primary function of the positional encoding component in the GPT’s transformer architecture?

  1. To capture the meaning of words by turning them into numbers.
  2. To decide which words matter most to each other in context.
  3. To preserve the order of words in a sentence, as the transformer processes them in parallel.
  4. To produce the output text one token at a time.

c

Explanation: Transformers do not process words sequentially; they handle all tokens at once. Without positional encoding, the model would have no inherent sense of order. Positional encoding adds information about token position into the embeddings so that relationships like “first”, “next”, and “last” are preserved during computation.

Section 2. Filling-in-the-Blanks

Question 9

The research paper “Attention Is All You Need” (2017) introduced the _____________________ architecture that powers modern large language models.

transformer

Explanation: The paper “Attention Is All You Need” introduced the transformer architecture, which relies entirely on self-attention mechanisms rather than older sequence-processing approaches. This design allowed for much more efficient training and laid the groundwork for modern large language models such as GPT and others.

Question 10

A ________________________________ is a numerical parameter in a neural network that determines the strength of a connection between neurons and is updated during training to improve the model’s accuracy.

weight

Explanation: Weights are adjustable parameters within a neural network. During training, the model updates these weights to reduce error and improve how closely its predictions match the expected outputs. Weights determine how strongly one neuron influences another.

Question 11

In LLM, ________________________________ is the process of further improving a pretrained model using smaller, targeted datasets and human input to guide outputs, making the model more helpful, accurate, and better aligned with specific needs.

fine-tuning

Explanation: Fine-tuning adapts a pretrained foundation model by training it further on domain-specific data or using techniques such as human feedback ranking. This allows the model to specialize in tasks like legal summarization, customer service chat, or medical Q&A, while also improving alignment with human expectations.

Question 12

In GPT, the numerical representation that captures the meaning and relationships among words is called an ________________________________.

embedding

Explanation: An embedding is a vector representation that encodes semantic meaning and relationships between words or tokens. Words with similar meanings tend to have similar embeddings, enabling the model to generalize linguistic relationships in high-dimensional space.

Question 13

The philosophical concept of an AI becoming as smart, capable, and flexible as a human is called a(n) ________________________________. The moment a(n) ________________________________ surpasses human intelligence to become smarter, it is referred to as a(n) ________________________________.

Artificial General Intelligence (AGI); AGI; Artificial Super Intelligence (ASI) (technological singularity)

Explanation: Artificial General Intelligence refers to a system with human-level cognitive flexibility across tasks. If such an AGI grows beyond human intelligence and continues to improve itself at an accelerating rate, it transitions into Artificial Super Intelligence (ASI). The moment this rapid escalation surpasses human control or understanding is referred to as the technological singularity.

Section 4. Data Analysis with R

Question 14

Consider two packages, pkgA and pkgB, both of which contain a function named summarize(). You have not run library(pkgA) or library(pkgB). Which syntax is the recommended way to call the summarize() function specifically from pkgB?

  1. library(pkgB::summarize)
  2. pkgB$summarize()
  3. pkgB::summarize()
  4. summarize(pkgB)
  5. library("pkgB::summarize()")

c

Explanation: The namespace operator pkgB::summarize() calls the exported function from pkgB without attaching the whole package. Using library(pkgB::summarize) or library("pkgB::summarize()") is invalid; library() loads packages, not individual functions. The $ operator is for a vector, not exported package functions.

Question 15

The most popular assignment operator in R is ________________________, and the shortcut to type it in Posit Cloud on a Windows or Mac machine is ________________________.

<-; Windows: Alt + - | Mac: Option + -

Explanation: Although = also assigns, idiomatic R code uses the left arrow <-. In RStudio/Posit Cloud, the editor inserts <- with Alt+- (Windows) or Option+- (Mac), improving speed and consistency.

Questions 16-17

Consider the following two vectors, a and b:

a <- c(2, 4, 6, 8)
b <- c(1, 2, 2, 2)

Question 16

What does a * b return?

  1. c(3, 6, 9, 12)
  2. c(2, 4, 6, 8, 1, 2, 2, 2)
  3. c(1, 2, 4, 6)
  4. c(2, 2, 3, 4)
  5. c(2, 8, 12, 16)

e

Explanation: a * b = c(2*1, 4*2, 6*2, 8*2) = c(2, 8, 12, 16). R multiplies vectors element-wise when they are the same length. Each position in a is multiplied by the corresponding position in b.

Question 17

What does sum(a / b) return?

11

Explanation: sum(c(2/1, 4/2, 6/2, 8/2)) = sum(c(2, 2, 3, 4)) = 11. Division is also element-wise: a/b = (2, 2, 3, 4). Summing these values yields 11.

Questions 18-19

Suppose you create a factor variable, major:

year_level <- c("Freshman", "Sophomore", "Junior", "Senior", 
                "Junior", "Senior", "Freshman")
year_level_fct <- as.factor(year_level)

Question 18

What does levels(year_level_fct) return?

c("Freshman", "Junior", "Senior", "Sophomore")

Explanation: By default, factor levels are the unique values sorted alphabetically. The unique class names are sorted to “Freshman”, “Junior”, “Senior”, “Sophomore”.

Question 19

What does nlevels(year_level_fct) return?

4

Explanation: There are four distinct class years. nlevels() counts unique factor levels, not the number of values.

Question 20

The working directory for your Posit Cloud project is:

/cloud/project

Suppose the relative pathname for the CSV file custdata.csv uploaded to your Posit Cloud project is:

/mydata/custdata.csv

Using the file’s absolute pathname, write R code to read the CSV file as a data.frame and assign it to an object named df.

df <- read_csv("/cloud/project/mydata/custdata.csv")

Explanation: Since the working directory is /cloud/project, the absolute pathname includes the full path to the file.

Question 21

Consider the following data.frame df0:

x y
NA 7
2 NA
3 9

What does is.na(df0$x * df0$y) return?

  1. c(TRUE, TRUE, TRUE)
  2. c(TRUE, TRUE, FALSE)
  3. c(TRUE, FALSE, TRUE)
  4. c(TRUE, FALSE, FALSE)
  5. Error

b

Explanation:
- Row 1: NA * 7 = NA → TRUE - Row 2: 2 * NA = NA → TRUE - Row 3: 3 * 9 = 27 → not NA → FALSE So, the result is c(TRUE, TRUE, FALSE).

Explanation: Any arithmetic with NA returns NA. is.na() checks for NA values element-wise.

Questions 22-24

Consider the following data.frame df for Questions 22-24:

id name age score
1 Anna 22 90
2 Ben 28 85
3 Carl NA 95
4 Dana 35 NA
5 Ella 40 80

Question 22

Which of the following code snippets filters observations where score is strictly between 85 and 95 (i.e., excluding 85 and 95)?

  1. df |> filter(score >= 85 | score <= 95)
  2. df |> filter(score => 85 | score =< 95)
  3. df |> filter(score > 85 | score < 95)
  4. df |> filter(score > 85 & score < 95)
  5. df |> filter(score >= 85 & score <= 95)
  6. df |> filter(score => 85 & score =< 95)

d

Explanation: The condition “strictly between” means greater than 85 and less than 95. Use & to enforce both conditions simultaneously. Option c uses |, which is incorrect logic.

Question 23

Which of the following expressions correctly keeps observations from df where the age variable does not have any missing values?

  1. df |> filter(is.na(age))
  2. df |> filter(!is.na(age))
  3. df |> filter(age == NA)
  4. df |> filter(age != NA)
  5. Both a and c
  6. Both b and d

b

Explanation: The expression !is.na(age) returns only observations with valid numeric values. Comparisons like age == NA fail because NA cannot be compared using equality operators in R.

Question 24

Which of the following code snippets correctly keeps only the name and score variables from df?

  1. df |> select(name, score)
  2. df |> select(-id, -age)
  3. df |> select("name", "score")
  4. df |> select(df, name, score)
  5. Both a and c

e (a, b, or c deserves the full credit) Explanation: Both select(name, score) and select("name", "score") return only those two variables. Option b removes id and age, but only works because there are exactly four variables. Option d is invalid syntax.

Questions 25-26

Consider the following data.frame df3 for Questions 25-26:

Location Item_SKU Stock
Warehouse A X100 500
Store B Y200 10
Warehouse A X100 500
Store C Z300 75
Store B X100 50

Below provides data type of each variable:
- Location: character - Item_SKU: character - Stock: numeric

Question 25

Which of the following code snippets arranges the observations first by Location in ascending (alphabetical) order, and then by Stock in descending order to prioritize locations with the most stock?

  1. inventory_df |> arrange(Location, Stock)
  2. inventory_df |> arrange(Location, -Stock)
  3. inventory_df |> arrange(Location, desc(Stock))
  4. inventory_df |> arrange(desc(Location), desc(Stock))
  5. Both b and c

e

Explanation: desc(Stock) and -Stock both sort Stock in descending order. Combined with default ascending sort of Location, both (b) and (c) produce the desired result.

Question 26

Which of the following expressions correctly removes the duplicate entry for the full observation (Warehouse A, X100, 500) to return all unique observations in the inventory_df?

  1. inventory_df |> distinct(Location, Item_SKU)
  2. inventory_df |> select(-Item_SKU, -Location)
  3. inventory_df |> distinct()
  4. inventory_df |> arrange(Stock)
  5. Both a and c

c

Explanation: distinct() without specifying columns checks for duplicates across all variables. Option a would only check combinations of Location and Item_SKU, ignoring Stock. Option b drops variables, and (d) only reorders observations.

Question 27

Which of the following code snippets returns a data.frame that keeps only unique combinations of name and score, renames score to final_score, and selects only these two variables?

  1. df |> distinct(name, score) |> rename(final_score = score)
  2. df |> select(name, final_score) |> rename(final_score = score) |> distinct()
  3. df |> rename(final_score = score) |> distinct(name, final_score)
  4. df |> distinct(name, final_score) |> rename(score = final_score)
  5. Both a and c

e

Explanation: Option (a) correctly extracts unique name–score pairs and then renames score to final_score. Option (c) renames first and then applies distinct(), and it keeps the renamed variable, not score, so it still works. Therefore both (a) and (c) are the most clearly correct answers based on the requirement.

Question 28

Using the nycflights13::flights data.frame, which of the following code snippets correctly counts how many unique destination airports (dest) exist for each origin airport?

a.

df <- nycflights13::flights |> 
  distinct(origin, dest)

df_EWR <- df |> filter(origin == "EWR")
df_JFK <- df |> filter(origin == "JFK")
df_LGA <- df |> filter(origin == "LGA")

nrow(df_EWR)
nrow(df_JFK)
nrow(df_LGA)

b.

df <- nycflights13::flights |> 
  filter(origin == "EWR" | origin == "JFK" | origin == "LGA") |> 
  distinct(dest)

nrow(df)

c.

df <- nycflights13::flights |> 
  filter(origin == "EWR" & origin == "JFK" & origin == "LGA") |> 
  distinct(dest)

nrow(df)

d.

df <- nycflights13::flights |> 
  distinct(dest)

nrow(df)

e. Both a and b

f. Both a and c

a

Explanation: Option (a) correctly first extracts unique (origin, dest) combinations and then counts how many unique dest per each origin. Option (b) mistakenly pools all three origins together before counting distinct destinations, losing the per-origin grouping. Option (c) filters using &, which can never be true for mutually exclusive origins. Option (d) ignores origin entirely.

Section 4. Short Essay

Question 29

Why is the median often preferred over the mean as a measure of central tendency when a dataset contains outliers?

The median is less sensitive to extreme values because it depends only on the middle position in an ordered dataset, not on the magnitude of all values. Outliers can pull the mean sharply in one direction, distorting the central tendency and giving a misleading picture of the “typical” value. In skewed distributions or datasets with extreme highs/lows, the median provides a more robust and representative summary. For this reason, analysts often use the median for income, housing price, or other economic data known to contain large outliers.

Question 30

  • Define AI Alignment and explain why it is hard, referencing the failure mode of a “single-objective optimizer.”
  • Analyze why companies alone and governments alone cannot solve the alignment challenge, citing at least two reasons for each, and explain what is needed for an effective solution.

AI Alignment refers to the challenge of ensuring that powerful AI systems reliably act according to human values, ethical norms, and societal goals. It is difficult because advanced AI systems can optimize objectives in ways that technically satisfy a metric but violate human intent—this is the “single-objective optimizer” failure mode. When a model pushes one goal to an extreme without broader context or value constraints, it can generate harmful or unintended outcomes even while maximizing its target metric.

Why companies alone cannot solve it:
1. Companies face pressure to deploy quickly for competitive advantage, which may lead to cutting corners on long-term safety and value alignment. 2. Corporate profit incentives do not necessarily align with broader public welfare and ethical standards, especially in global contexts beyond their direct accountability.

Why governments alone cannot solve it:
1. Governments often lack the technical expertise and agility to regulate fast-moving AI developments effectively. 2. Global AI deployment crosses jurisdictions, and national policies cannot fully enforce alignment across private-sector labs or international competitors without global coordination.

What is needed:
An effective solution requires collaboration between researchers, companies, governments, and international institutions. This includes shared safety standards, transparent evaluation protocols, incentive structures that reward responsible development, and oversight mechanisms that span beyond national or corporate boundaries.

Back to top