x <- c(10, 20, 30, 40, 50)
y <- c(1, 2, 3, 4, 5)Midterm Exam I
Version C
Section 1. Multiple Choice
Question 1
Which platform provides a cloud-based environment for writing and executing Python notebooks without local installation?
- RStudio
- Jupyter Notebook
- Google Colab
- Cursor
c
Explanation:
Google Colab is a cloud-based platform that allows users to run Python notebooks without local installation.
Question 2
Which of the following best defines a dashboard in data analytics?
- A statistical model used to predict future business performance
- A database that stores large volumes of raw transactional data
- A programming language used for building web-based data systems
- A visual interface that displays key data, metrics, and trends for quick interpretation and decision-making
d
Explanation:
A dashboard is a visual interface showing key metrics and trends for decision-making.
Question 3
Which of the following best describes the goal of unsupervised learning in machine learning?
- To train a model using labeled input–output pairs to predict future outcomes
- To uncover hidden patterns or groupings in data without using predefined labels
- To optimize an agent’s actions through rewards and penalties over time
- To evaluate the accuracy of supervised models using cross-validation
b
Explanation:
Unsupervised learning uncovers hidden patterns in unlabeled data.
Question 4
In sports analytics, a decision tree model is trained to predict whether a football team will run or pass in the next play based on variables like Off_Pers, n_th_Down, and distance_to_next_down.
Off_Pers: Offensive Personnel (e.g., Value “11” meaning 1 running back, 1 tight end, and 3 wide receivers)
Which statement best describes how this model makes its predictions?
- It averages all input variables to estimate the probability of a run play.
- It groups plays into clusters of similar offensive formations without using the run/pass label.
- It calculates overall win probabilities using a single regression equation combining all predictors linearly.
- It divides the dataset into branches using threshold-based rules (e.g., “if
distance_to_next_down< 5 yards”) to classify outcomes step by step.
d
Explanation:
A decision tree makes predictions by splitting data into branches using rule-based thresholds.
Question 5
Which of the following statements best reflects the “Co-Intelligence” principles for effective AI use discussed in class?
- Always let AI handle routine decisions independently to save time and reduce human error.
- Treat AI like a sentient collaborator that can verify facts and reason about truth on its own.
- Avoid using AI until the technology matures further, since today’s systems are unreliable and likely to be replaced soon.
- Use AI broadly for idea generation and analysis, but always review outputs critically, add human judgment, and document what works.
d
Explanation:
Co-Intelligence emphasizes using AI with human oversight, judgment, and reflection.
Question 6
Why are descriptive statistics important in data analytics?
- They make raw data more interpretable by summarizing key characteristics such as central tendency and variability.
- They automatically identify causal relationships between independent and dependent variables.
- They are mainly used for visualizing data, not for numerical summaries.
- They replace the need for further inferential or predictive analysis.
a
Explanation:
Descriptive statistics summarize key features of a dataset, making it easier to interpret.
Question 7
Which of the following best defines supervised learning?
- Algorithms that discover hidden patterns in unlabeled data
- Algorithms that optimize decisions by trial and error with rewards
- Algorithms trained using labeled input–output pairs
- Algorithms that cluster observations based on distance metrics
c
Explanation:
Supervised learning uses labeled input–output pairs for training predictive models.
Question 8
Which statement most accurately captures why the Transformer architecture revolutionized natural language processing?
- It eliminated the need for large datasets by using manually defined linguistic rules.
- It processes inputs strictly in order, ensuring each token depends on the one immediately before it.
- It encodes position information by sorting words alphabetically before training.
- It replaced sequential token processing with an attention mechanism that models relationships between all words in a sequence in parallel.
d
Explanation:
Transformers use attention to model relationships between all words in parallel.
Section 2. Filling-in-the-Blanks
Question 9
The class of models that generate images from text prompts by gradually transforming random noise into visual outputs are called ________________________________.
The newer family of models that can interpret and generate across text and vision (e.g., describing or creating images) are known as ________________________________ models.
Diffusion models; Multimodal models
Explanation:
Diffusion models generate images by denoising random noise; multimodal models work across text, images, and other modalities.
Question 10
An ________________________________ is an AI system that can plan, act, and learn autonomously using external tools and memory, turning a single prompt into a multi-step workflow.
AI agent
Explanation:
AI agents execute multi-step tasks autonomously using planning, tools, and memory.
Question 11
At large scales, LLMs can display unexpected skills—such as writing code or expressing creativity—that were never directly programmed.
These are known as ________________________________.
Emergent abilities
Explanation:
Emergent abilities arise when model scale enables capabilities not explicitly engineered.
Question 12
What does the acronym GPT stand for?
Provide both meanings covered in lecture:
1. ________________________________
2. ________________________________
- Generative Pre-trained Transformer
- General Purpose Technology
Explanation:
- Generative Pre-trained Transformer describes the model architecture and training method: it generates outputs, is pre-trained on large corpora, and uses the transformer architecture.
- General Purpose Technology refers to innovations with wide-reaching economic and technological influence, capable of transforming many sectors over time, similar to electricity or the internet.
Question 13
A neural network consists of interconnected nodes organized into layers.
The ________________________________ layer serves as the entry point for data,
the ________________________________ layers transform and learn internal representations,
and the ________________________________ layer produces the final prediction or result.
Input layer; hidden layers; output layer
Explanation:
Input → hidden → output is the standard neural network architecture.
Section 3. Data Analysis with R
Question 14
During an R session, a student loads both the dplyr and MASS packages.
Both contain a function named select().
After loading both, the student runs select(df, name, score) and gets an unexpected error.
Which explanation best describes what likely happened?
- The
MASSpackage disablesdplyrautomatically when both are loaded.
- R randomly chooses between the two functions based on which package was installed most recently.
- The
select()function can only be used after attaching thetidyversemetapackage.
- The
dplyrversion ofselect()was masked by theMASSversion, so R is using the wrong function for data-frame column selection.
d
Explanation:
When both packages are loaded, the later-loaded select() masks the other, so R may be using the wrong version.
Question 15
In R, what is the difference between a parameter and an argument in a function?
- A parameter is the actual input value passed into a function, while an argument is the variable name used inside the function.
- A parameter is defined in the function’s declaration, while an argument is the actual value supplied when the function is called.
- Both terms mean the same thing and can be used interchangeably in R.
- A parameter refers only to numeric inputs, while arguments refer to text inputs.
b
Explanation:
Parameters are defined in the function body; arguments are the values provided when calling the function.
Question 16
Using the native pipe in R (R ≥ 4.1), write two equivalent one-liners that return all unique values of a variable named var from a data frame called df.
Answer 1: ______________________________________________
Answer 2: ______________________________________________
Answer 1: df |> select(var) |> distinct()
Answer 2: df |> distinct(var)
Explanation:
Both return the unique values of a variable; one uses select() and distinct(), the other uses only distinct(), which is more efficient.
Questions 17-18
Consider the following two vectors, x and y:
Question 17
What does x[ y > 3 ] return?
c(3, 4, 5)c(TRUE, TRUE, TRUE, FALSE, FALSE)c(10, 20, 30)c(40, 50)c(30, 40, 50)
d
Explanation:
x[y > 3] returns c(40, 50).
Question 18
What does sum( (x * y)[y < 3] ) return?
Answer: ______________________________________________
50
Explanation:
(x*y) = 10,40,90,160,250 and y<3 selects first two → 10 + 40 = 50.
Question 19
The working directory for your Posit Cloud project is:
/cloud/project
Suppose the relative pathname for the CSV file custdata.csv uploaded to your Posit Cloud project is:
/mydata/custdata.csv
Using the file’s absolute pathname, write R code to read the CSV file as a data.frame with the readr package and assign it to an object named df_customers.
Answer: ______________________________________________
df_customers <- readr::read_csv("/cloud/project/mydata/custdata.csv")
Explanation:
Uses absolute path for Posit Cloud.
Questions 20–21
Consider the following two vectors, submitted_1 and submitted_2:
submitted_1 <- c(TRUE, FALSE, NA, TRUE, NA, FALSE, TRUE)
submitted_2 <- c(TRUE, FALSE, TRUE, FALSE, TRUE)Question 20
What does sum(!is.na(submitted_1)) return?
Answer: ______________________________________________
5
Explanation:
There are five non-NA values.
Question 21
What does sum(as.numeric(submitted_2)) return?
Answer: ______________________________________________
3
Explanation:
TRUE=1, FALSE=0 → sum=3.
Questions 22–24
Consider the following data.frame students for Questions 22–24:
| sid | name | credits | gpa |
|---|---|---|---|
| 101 | Ava | 12 | 3.2 |
| 102 | Blake | 15 | 2.9 |
| 103 | Ava | NA | 3.8 |
| 104 | Diego | 18 | NA |
| 105 | Eli | 9 | 3.5 |
Question 22
Which code filters observations where gpa is strictly between 3.0 and 3.7 (including 3.0 and 3.7)?
students |> filter(gpa >= 3.0 | gpa <= 3.7)students |> filter(gpa > 3.0 & gpa < 3.7)students |> filter(gpa > 3.0 | gpa < 3.7)students |> filter(gpa >= 3.0 & gpa <= 3.7)students |> filter(gpa => 3.0 & gpa =< 3.7)
d
Explanation:
The correct range-inclusive filter is gpa >= 3.0 & gpa <= 3.7.
Question 23
Which expression keeps only observations where credits is not missing or name is “Ava”?
students |> filter(credits == NA & name == "Ava")students |> filter(credits == NA | name == "Ava")students |> filter(credits != NA & name == "Ava")students |> filter(credits != NA | name == "Ava")students |> filter(is.na(credits) & name == "Ava")students |> filter(is.na(credits) | name == "Ava")students |> filter(!is.na(credits) & name == "Ava")students |> filter(!is.na(credits) | name == "Ava")
h
Explanation:
Keep rows where credits is NOT NA or name == “Ava”.
Question 24
Which code keeps only the name and gpa variables?
students |> select(name, gpa)students |> select(-sid, -credits)students |> select("name", -"credits")students |> select(students, name, gpa)- Both a and b
e
Explanation:
- Option a selects
nameandgpadirectly.
- Option b removes
sidandcredits, leaving onlynameandgpa.
Thus, both are correct.
Questions 25–26
Consider the following data.frame sales_df for Questions 25–26:
| Region | Product | Revenue |
|---|---|---|
| East | A100 | 250 |
| West | B200 | 400 |
| East | A100 | 250 |
| South | C300 | 300 |
| West | A100 | 200 |
| South | B200 | 300 |
Below provides data type of each variable:
Region: characterProduct: characterRevenue: numeric
Question 25
Which of the following code snippets arranges the observations first by Region alphabetically, then by Revenue in descending order?
sales_df |> arrange(Region, Revenue)sales_df |> arrange(Region, -Revenue)sales_df |> arrange(Region, desc(Revenue))sales_df |> arrange(desc(Region), desc(Revenue))- Both b and c
e
Explanation:
- Option b uses
-Revenue, which sorts Revenue in descending order.
- Option c uses
desc(Revenue), which also sorts Revenue in descending order.
Since both arrange alphabetically by Region first and then sort Revenue from highest to lowest, both are correct.
Question 26
Which code correctly returns all unique Region–Product–Revenue combinations?
sales_df |> distinct(Region, Product)sales_df |> select(-Product, -Region)sales_df |> distinct()sales_df |> select(-Product, -Region) |> distinct()- Both a and c
- Both a and d
- Both c and d
c
Explanation:
distinct() alone returns all unique combinations of all columns.
Question 27
Consider the following data.frame orders:
| id | cust | total |
|---|---|---|
| 1 | A | 120 |
| 2 | B | NA |
| 3 | A | 50 |
| 4 | C | 200 |
| 5 | B | 120 |
| 6 | A | NA |
Below provides data type of each variable:
id: numericcust: charactertotal: numeric
Which code snippet correctly replicates the subset of orders as shown below:
| id | cust | total |
|---|---|---|
| 4 | C | 200 |
| 5 | B | 120 |
| 1 | A | 120 |
a.
orders |>
filter(total >= 120 | is.na(total)) |>
arrange(total, cust)b.
orders |>
filter(!is.na(total) & total >= 120 ) |>
arrange(-total, desc(cust))c.
orders |>
filter(total >= 120 & is.na(total)) |>
arrange(total, desc(cust))d.
orders |>
filter(!is.na(total) & total > 120 ) |>
arrange(desc(total), cust)b
Explanation:
The correct subset uses !is.na(total) and total >= 120, then arranges by -total and desc(cust). Option b matches exactly.
Question 28
Using the nycflights13::flights data.frame, which of the following code snippets correctly counts how many distinct airlines (carrier) operate from each origin airport? Select all that apply.
Below provides values in the carrier variable:
| carrier |
|---|
| UA |
| AA |
| B6 |
| DL |
| EV |
| MQ |
| US |
| WN |
| VX |
| FL |
| AS |
| 9E |
| F9 |
| HA |
| YV |
| OO |
a.
df <- nycflights13::flights |>
filter(origin == "EWR" & origin == "JFK" & origin == "LGA") |>
distinct(carrier)
nrow(df)b.
df <- nycflights13::flights |>
filter(origin == "EWR" | origin == "JFK" | origin == "LGA") |>
distinct(carrier)
nrow(df)c.
df <- nycflights13::flights
df_EWR <- df |> filter(origin == "EWR")
df_JFK <- df |> filter(origin == "JFK")
df_LGA <- df |> filter(origin == "LGA")
nrow(df_EWR)
nrow(df_JFK)
nrow(df_LGA)d.
df <- nycflights13::flights |>
distinct(origin, carrier)
df_EWR <- df |> filter(origin == "EWR")
df_JFK <- df |> filter(origin == "JFK")
df_LGA <- df |> filter(origin == "LGA")
nrow(df_EWR)
nrow(df_JFK)
nrow(df_LGA)e.
df <- nycflights13::flights |>
filter(origin == "EWR" | origin == "JFK" | origin == "LGA") |>
distinct(carrier)
nrow(df)f.
df <- nycflights13::flights |>
filter(origin == "EWR" | origin == "JFK" | origin == "LGA") |>
distinct(origin, carrier)
nrow(df)g.
df <- nycflights13::flights |>
filter(origin == "EWR" & origin == "JFK" & origin == "LGA") |>
distinct(origin, carrier)
nrow(df)d, f
Explanation:
To count distinct carriers per origin, you need distinct (origin, carrier) pairs.
- Option d does this correctly by first generating distinct pairs, then filtering per origin.
- Option f also works because it filters origins first, then keeps distinct
(origin, carrier)combinations.
The others are incorrect due to incorrect logic (AND instead of OR) or counting carriers pooled together across origins rather than separately.
Section 4. Short Answer
Question 29
What is the interquartile range (IQR)? What is the standard deviation (SD)? Explain why the IQR is often preferred over the SD when summarizing the dispersion of a dataset that contains outliers.
IQR vs SD
Interquartile Range (IQR):
The IQR is the range between the 25th percentile (Q1) and the 75th percentile (Q3). It measures the spread of the middle 50% of the data.
Standard Deviation (SD):
The SD measures the average distance of each data point from the mean. It reflects overall variability, including extreme values.
Why IQR is preferred with outliers:
- SD is sensitive to outliers because squaring deviations magnifies extreme values.
- IQR is robust—it ignores the lowest 25% and highest 25% of data, making it less influenced by extreme or skewed values.
- For skewed distributions or datasets with outliers, IQR provides a more stable summary of spread.
Question 30
Explain the following concepts in the context of Large Language Models (LLMs):
- What is pre-training and what is its main purpose?
- What is fine-tuning and how does it differ from pre-training?
- How can bias arise in an LLM’s output, and what methods can be used to reduce it?
- What are some ethical or legal concerns associated with the pre-training and fine-tuning processes?
Pre-training:
- The model is trained on very large datasets (such as text from websites, books, and code repositories).
- Purpose: to learn general patterns of language, including structure, grammar, and relationships between words and concepts.
- The goal is to build a broad foundation of capabilities.
Fine-tuning:
- The model is further trained using a smaller, focused dataset designed for a specific domain or task.
- Purpose: refine the model’s behavior for targeted applications (e.g., summarization, safety alignment, medical or legal tasks) and address issues such as bias, harmful outputs, or inappropriate responses from pre-training.
- Difference from pre-training: pre-training is broad, while fine-tuning is narrower and guided, allowing the model to better align with desired performance, norms, or safety standards.
How bias arises:
- Training data may reflect cultural, demographic, or ideological imbalances.
- LLMs can learn stereotypes, misinformation, or harmful associations.
- User prompts may trigger biased patterns learned from data.
Methods to reduce bias
- Data filtering and dataset curation: This involves carefully selecting, cleaning, and filtering training data to remove harmful, misleading, or unbalanced patterns before they influence the model.
- Reinforcement Learning from Human Feedback (RLHF): Human evaluators guide model behavior by rating outputs, helping the system learn responses that are safer, more helpful, and less biased.
- Safety alignment training: Additional training phases are used to refine the model toward safer and more responsible behavior, reducing harmful or biased outputs.
- Bias audits, evaluations, and dataset diversification: Regular assessments help identify problematic behaviors, while diversifying input data reduces overrepresentation or underrepresentation of groups.
- Constitutional or rule-based fine-tuning: Models are optimized using predefined rules or principles that promote fairness, safety, and neutrality, helping reduce harmful or biased tendencies.
Ethical and legal concerns
- Copyright and data ownership issues: Large datasets may include copyrighted material, raising questions about legal rights and appropriate usage.
- Exposure to toxic or harmful content: Pre-training may include offensive or harmful data, which can influence model outputs if not mitigated.
- Privacy risks from scraped data: Training data may inadvertently contain personal or sensitive information, creating potential privacy and confidentiality concerns.
- Propagation of biases and unequal representation: Models can reinforce biases present in the data, leading to unfair or distorted outputs that disadvantage certain groups.
- Potential misuse of generated content: Generative systems can enable creation of misleading or harmful content such as deepfakes or misinformation, leading to ethical risks and societal harm.