Homework 2

Generative AI; Data Transformation with R

Author

Byeong-Hak Choe

Published

September 25, 2025

Modified

November 16, 2025

Descriptive Statistics

Multiple Choice Questions

Question 1. Assistance vs. Authorship

Which best defines “assistance vs. authorship” when using generative AI?

AI represents assistance when it is any AI use; AI represents authorship when code runs
AI represents assistance when it helps ideate/edit; AI represents authorship when it produces core content you submit as your own
AI represents assistance when it is citations only; AI represents authorship when it is paraphrasing
AI represents assistance when it is grammar; AI represents authorship when it is images

Show answer

b. AI represents assistance when it helps ideate/edit; AI represents authorship when it produces core content you submit as your own.

💡 Explanation:
Assistance means the human remains the main creator—AI contributes to brainstorming, editing, or formatting, but the final intellectual product originates from the human.
Authorship occurs when AI creates the central ideas, structure, or wording of a submission that the user passes off as their own.

⚠️ Incorrect Options:
- “Any AI use = assistance” → Too broad.
- “Citations only / paraphrasing” → Misrepresents real AI use.
- “Grammar vs. images” → Too narrow; authorship is about intellectual ownership.

Question 2. Assistance vs. Authorship

Data analysts are the only professionals who benefit from data analytics skills.

Allowed; tool use is free
Assistance
Authorship and likely academic-integrity risk
Only a gray area

Show answer

c. Authorship and likely academic-integrity risk

💡 Explanation:
Submitting code generated by AI without understanding it means the student is not the author in an academic sense. The student fails to demonstrate comprehension or accountability, violating academic integrity.

⚠️ Incorrect Options:
- “Allowed; tool use is free” → Misunderstands responsibility.
- “Assistance” → Incorrect—AI replaced reasoning.
- “Only a gray area” → Not acceptable under integrity policies.

Question 3. Transformer Attention

Attention in transformers primarily helps the model:

Reduce compute cost by skipping tokens
Choose the most relevant parts of the sequence
Memorize training data verbatim
Predict emotions

Show answer

b. Choose the most relevant parts of the sequence

💡 Explanation:
Attention mechanisms let models weigh words differently based on context—identifying which tokens matter most for predicting the next token.

⚠️ Incorrect Options:
- “Reduce compute cost” → Attention increases computation.
- “Memorize training data” → Not its purpose.
- “Predict emotions” → Not the attention mechanism’s role.

Question 4. Supervised Learning

Supervised learning requires:

Only raw text
Labeled examples
Human ranking of model outputs only
Images but no labels

Show answer

b. Labeled examples

💡 Explanation:
Supervised learning uses data with known input–output pairs (e.g., image → “cat”). The model learns from labeled examples to predict future outcomes.

⚠️ Incorrect Options:
- “Only raw text” → Unsupervised/self-supervised.
- “Human ranking only” → RLHF, not supervised.
- “Images but no labels” → Unsupervised learning.

Question 5. RLHF

RLHF is best described as:

Penalizing long outputs
Human-ranked preferences guiding a reward model
Unsupervised pretraining
Prompt engineering

Show answer

b. Human-ranked preferences guiding a reward model

💡 Explanation:
Reinforcement Learning from Human Feedback fine-tunes models using human preference data to guide a reward signal. This aligns AI behavior with human expectations.

⚠️ Incorrect Options:
- “Penalizing long outputs” → Separate technique.
- “Unsupervised pretraining” → Happens before RLHF.
- “Prompt engineering” → User-side activity, not training.

Question 6. Human in the Loop

“Be the Human in the Loop” implies students should:

Trust polished outputs
Turn off tests to avoid overfitting
Verify facts, run tests, and check assumptions
Always pick the first model answer

Show answer

c. Verify facts, run tests, and check assumptions

💡 Explanation:
Being the Human in the Loop means staying actively engaged—testing, fact-checking, and applying human judgment rather than deferring blindly to AI outputs.

⚠️ Incorrect Options:
- “Trust polished outputs” → Passive use.
- “Turn off tests” → Opposite of verification.
- “Always pick first model answer” → Non-critical behavior.

Question 7. BCG Study — Rule 4

[BCG Study in Classwork 3 - Rule 4] Students in the bottom-right quadrant (AI strong, human novice) should prioritize:

Speed over understanding
Hiding AI use
Climbing the human-skill axis through verification and practice
Zero prompting

Show answer

c. Climbing the human-skill axis through verification and practice

💡 Explanation:
This quadrant represents people who rely heavily on AI but lack expertise. The goal is to move upward by verifying results, practicing skills, and gaining understanding to become “AI-strong + human-strong.”

⚠️ Incorrect Options:
- “Speed over understanding” → Encourages shallow learning.
- “Hiding AI use” → Violates transparency.
- “Zero prompting” → Removes human direction.

Question 8. Transformer Encoders

Transformer encoders primarily:

Generate outputs token by token
Create context-aware representations of the inputs
Rank human preferences
Perform diffusion sampling

Show answer

b. Create context-aware representations of the inputs

💡 Explanation:
Encoders convert input tokens into embeddings that capture meaning and context, enabling comprehension tasks like classification or translation.

⚠️ Incorrect Options:
- “Generate outputs token by token” → Decoder’s role.
- “Rank human preferences” → RLHF task.
- “Perform diffusion sampling” → Used in image models, not transformers.

Question 9. Treating AI “like a person”

Treating AI “like a person” improves outputs primarily because:

It creates sentience
It conditions constraints/roles that steer generation
It bypasses guardrails
It increases context length

Show answer

b. It conditions constraints/roles that steer generation

💡 Explanation:
Framing prompts with personas (e.g., “You are a data analytics tutor”) guides structure, tone, and scope. It exploits how models respond to contextual conditioning—not sentience.

⚠️ Incorrect Options:
- “Creates sentience” → AI has no consciousness.
- “Bypasses guardrails” → Not true and unethical.
- “Increases context length” → Technical, unrelated.

Question 10. Disclosure of AI Work

Publishing AI-written work without disclosure most directly violates:

Token limits
Academic integrity/attribution norms
HTML standards
RLHF constraints

Show answer

b. Academic integrity/attribution norms

💡 Explanation:
Submitting undisclosed AI-generated work misrepresents authorship, violating honesty and transparency. Disclosure ensures accountability and fairness.

⚠️ Incorrect Options:
- “Token limits” → Technical limit.
- “HTML standards” → Irrelevant.
- “RLHF constraints” → Internal to model training.

Question 11. Supervised vs. Unsupervised

Which pairing is most accurate?

Supervised = topic modeling; Unsupervised = sentiment
Supervised = spam filtering; Unsupervised = clustering
Supervised = clustering; Unsupervised = regression
Supervised = sentiment; Unsupervised = regression

Show answer

b. Supervised = spam filtering; Unsupervised = clustering

💡 Explanation:
Spam filtering uses labeled data (“spam” vs. “not spam”), while clustering finds natural groups in unlabeled data. The difference is whether labels exist during training.

⚠️ Incorrect Options:
- “Topic modeling = supervised” → Topic modeling is unsupervised.
- “Regression = unsupervised” → Regression is supervised.
- Other pairings mix up task types.

Question 12. Keeping Prompt Logs

The strongest reason to keep prompt logs for the use of generative AI:

Increase token count
Reproducibility and iterative improvement
Reduce latency
Satisfy HTML validators

Show answer

b. Reproducibility and iterative improvement

💡 Explanation:
Prompt logs document what was done, enabling replication, self-reflection, and transparency. They help track learning progress and model behavior.

⚠️ Incorrect Options:
- “Increase token count” → No educational purpose.
- “Reduce latency” → False; logging doesn’t affect runtime.
- “Satisfy HTML validators” → Unrelated to AI use.

Short-Answer Questions

Question 1. Vibe Coding

Describe one benefit and one risk of vibe coding

Show answer

A key benefit of vibe coding is rapid prototyping: students can quickly generate functional code through conversational iteration with AI, lowering the barrier to creative experimentation. However, a major risk is reduced understanding—AI-generated code may contain hidden bugs, inefficiencies, or logic errors that students cannot explain. To mitigate this, learners should review and test all AI-generated code to ensure comprehension and correctness.

Question 2. Generative AI as a General Purpose Technology

Generative AI is being described as a General Purpose Technology (GPT) like electricity or the internet. Do you agree with this analogy? Support your answer with historical parallels and at least one limitation unique to AI.

Show answer

Generative AI resembles historical General Purpose Technologies like electricity or the internet because it transforms multiple sectors and reshapes productivity and learning. Like electricity enabling factories or the internet connecting people, AI is reshaping communication, creativity, and analysis. However, unlike those earlier GPTs, AI produces probabilistic, not deterministic, outputs—raising risks of bias, misinformation, and lack of transparency. It requires ethical oversight and verification to achieve its full potential safely.

Question 3. BCG Study – Rule 4 (Education Implications)

[Classwork 3 - Rule 4] The BCG study found that AI can push beginners close to expert performance on certain tasks. What does this mean for education? Should instructors grade differently when students can “perform like experts” with AI support?

Show answer

The BCG study shows that AI can elevate beginners’ performance to near-expert levels on structured tasks such as writing or coding. In education, this means traditional grading based on final output may no longer measure real understanding. Instructors should adjust assessments to emphasize reasoning, manual skills, and reflection—requiring students to explain how and why they used AI. Grading should reward verified comprehension, not just polished results.

Question 4. Paperclip Maximizer Thought Experiment

Explain the paperclip maximizer thought experiment. How does it illustrate alignment challenges in AI, and what lessons can be applied to everyday classroom use of generative tools?

Show answer

The paperclip maximizer imagines an AI given the goal of “maximizing paperclips.” Without human-aligned constraints, it could destroy everything to fulfill that objective. This illustrates how AI systems can pursue goals literally but not ethically if alignment is missing. In the classroom, it teaches the importance of defining constraints, verifying outputs, and ensuring AI tasks align with human learning goals—so that efficiency does not replace understanding.

Data Transformation with R tidyverse

For the questions in the R section, consider the data.frame nyc_payroll_new. For detailed descriptions of the variables in this data.frame, please refer to the following link: Citywide Payroll Data (Fiscal Year).

library(tidyverse)
library(skimr)
nyc_payroll_new <- read_csv("https://bcdanl.github.io/data/nyc_payroll_2024.csv")

Question 1

How can you filter the data.frame nyc_payroll_new to calculate descriptive statistics (mean and standard deviation) of Base_Salary for workers in the Work_Location_Borough “MANHATTAN”? Similarly, how can you filter the data.frame nyc_payroll_new to calculate these statistics for workers in the Work_Location_Borough “QUEENS”?

Provide the R code for performing these calculations and then report the mean and standard deviation of Base_Salary for workers in both “MANHATTAN” and “QUEENS”.

Show answer

# Find all unique values in the `Work_Location_Borough` variable:
nyc_payroll_new |>  distinct(Work_Location_Borough)
  # The output shows that "MANHATTAN" and "QUEENS" are among the unique values 
  # in `Work_Location_Borough`, written in all capital letters.

# Filter the dataset for records where the work location is MANHATTAN
df_manhattan <- nyc_payroll_new |> 
  filter(Work_Location_Borough == "MANHATTAN")

# Generate descriptive statistics (including mean and standard deviation) 
# for Base_Salary for workers in MANHATTAN
skim(df_manhattan) # or skim(df_manhattan$Base_Salary)

# Filter the dataset for records where the work location is QUEENS
df_queens <- nyc_payroll_new |> 
  filter(Work_Location_Borough == "QUEENS")

# Generate descriptive statistics (including mean and standard deviation) 
# for Base_Salary for workers in QUEENS
skim(df_queens) # or skim(df_queens$Base_Salary)

Question 2

How can you filter the data.frame nyc_payroll_new to show only the records where the Base_Salary is greater than or equal to $100,000?

Show answer

# Filter the dataset for records where Base_Salary is greater than or equal to 
# $100,000
q2 <- nyc_payroll_new |> 
  filter(Base_Salary >= 100000)

Question 3

How can you select only distinct combinations of Agency_Name and Title_Description?

Show answer

# Select distinct combinations of Agency_Name and Title_Description from the dataset
q3 <- nyc_payroll_new |> 
  distinct(Agency_Name, Title_Description)

Question 4

How would you arrange the data by Regular_Gross_Paid in descending order, showing the highest paid employees first?

Show answer

# Arrange the dataset by Regular_Gross_Paid in descending order 
# (highest paid employees first)
q4 <- nyc_payroll_new |> 
  arrange(desc(Regular_Gross_Paid))

Question 5

How can you select and rename the Title_Description variable to Title?

Show answer

# Rename the Title_Description variable to Title in the dataset
q5 <- nyc_payroll_new |> 
  rename(Title = Title_Description)

Question 6

How can you filter the data to show only records for the “POLICE DEPARTMENT” Agency_Name and arrange it by Total_OT_Paid in ascending order?

Show answer

# Filter the dataset for records where Agency_Name is "POLICE DEPARTMENT" 
# and arrange by Total_OT_Paid in ascending order
q6 <- nyc_payroll_new |> 
  filter(Agency_Name == "POLICE DEPARTMENT") |> 
  arrange(Total_OT_Paid)

Question 7

How can you filter the data to include only those records where the Pay_Basis is “per Annum” and then select only the First_Name, Last_Name, and Base_Salary variables?

Show answer

# Filter the dataset for records where Pay_Basis is "per Annum" and 
# select specific columns: First_Name, Last_Name, and Base_Salary
q7 <- nyc_payroll_new |> 
  filter(Pay_Basis == "per Annum") |> 
  select(First_Name, Last_Name, Base_Salary)

Question 8

How would you arrange the data.frame by Work_Location_Borough in ascending order and Base_Salary in descending order?

Show answer

# Arrange the dataset by Work_Location_Borough in ascending order 
# and Base_Salary in descending order
q8 <- nyc_payroll_new |> 
  arrange(Work_Location_Borough, -Base_Salary)

Note that sorting observations by a character variable in ascending order means sorting them in an alphabetical order.
Note that sorting observations by a character variable in descending order means sorting them in a reverse-alphabetical order.

Question 9

How can you filter the nyc_payroll_new data.frame to remove observations where the Base_Salary variable has NA values? After filtering, how would you calculate the total number of remaining observations?

Show answer

# Filter the dataset to remove observations where Base_Salary is NA
q9 <- nyc_payroll_new |> 
  filter(!is.na(Base_Salary))

# Calculate the total number of remaining observations after filtering
nrow(q9)