library(tidyverse)
library(skimr)
<- read_csv("https://bcdanl.github.io/data/nyc_payroll_2024.csv") nyc_payroll_new
Homework 2 - Example Answers
Generative AI; Data Transformation with R
Descriptive Statistics
TBA
Multiple Choice Questions
Question 3. Transformer Attention
Attention in transformers primarily helps the model:
- Reduce compute cost by skipping tokens
- Choose the most relevant parts of the sequence
- Memorize training data verbatim
- Predict emotions
c. Choose the most relevant parts of the sequence
đĄ Explanation:
Attention mechanisms let models weigh words differently based on contextâidentifying which tokens matter most for predicting the next token.
â ď¸ Incorrect Options:
- âReduce compute costâ â Attention increases computation.
- âMemorize training dataâ â Not its purpose.
- âPredict emotionsâ â Not the attention mechanismâs role.
Question 4. Supervised Learning
Supervised learning requires:
- Only raw text
- Labeled examples
- Human ranking of model outputs only
- Images but no labels
b. Labeled examples
đĄ Explanation:
Supervised learning uses data with known inputâoutput pairs (e.g., image â âcatâ). The model learns from labeled examples to predict future outcomes.
â ď¸ Incorrect Options:
- âOnly raw textâ â Unsupervised/self-supervised.
- âHuman ranking onlyâ â RLHF, not supervised.
- âImages but no labelsâ â Unsupervised learning.
Question 5. RLHF
RLHF is best described as:
- Penalizing long outputs
- Human-ranked preferences guiding a reward model
- Unsupervised pretraining
- Prompt engineering
b. Human-ranked preferences guiding a reward model
đĄ Explanation:
Reinforcement Learning from Human Feedback fine-tunes models using human preference data to guide a reward signal. This aligns AI behavior with human expectations.
â ď¸ Incorrect Options:
- âPenalizing long outputsâ â Separate technique.
- âUnsupervised pretrainingâ â Happens before RLHF.
- âPrompt engineeringâ â User-side activity, not training.
Question 6. Human in the Loop
âBe the Human in the Loopâ implies students should:
- Trust polished outputs
- Turn off tests to avoid overfitting
- Verify facts, run tests, and check assumptions
- Always pick the first model answer
c. Verify facts, run tests, and check assumptions
đĄ Explanation:
Being the Human in the Loop means staying actively engagedâtesting, fact-checking, and applying human judgment rather than deferring blindly to AI outputs.
â ď¸ Incorrect Options:
- âTrust polished outputsâ â Passive use.
- âTurn off testsâ â Opposite of verification.
- âAlways pick first model answerâ â Non-critical behavior.
Question 7. BCG Study â Rule 4
[BCG Study in Classwork 3 - Rule 4] Students in the bottom-right quadrant (AI strong, human novice) should prioritize:
- Speed over understanding
- Hiding AI use
- Climbing the human-skill axis through verification and practice
- Zero prompting
c. Climbing the human-skill axis through verification and practice
đĄ Explanation:
This quadrant represents people who rely heavily on AI but lack expertise. The goal is to move upward by verifying results, practicing skills, and gaining understanding to become âAI-strong + human-strong.â
â ď¸ Incorrect Options:
- âSpeed over understandingâ â Encourages shallow learning.
- âHiding AI useâ â Violates transparency.
- âZero promptingâ â Removes human direction.
Question 8. Transformer Encoders
Transformer encoders primarily:
- Generate outputs token by token
- Create context-aware representations of the inputs
- Rank human preferences
- Perform diffusion sampling
b. Create context-aware representations of the inputs
đĄ Explanation:
Encoders convert input tokens into embeddings that capture meaning and context, enabling comprehension tasks like classification or translation.
â ď¸ Incorrect Options:
- âGenerate outputs token by tokenâ â Decoderâs role.
- âRank human preferencesâ â RLHF task.
- âPerform diffusion samplingâ â Used in image models, not transformers.
Question 9. Treating AI âlike a personâ
Treating AI âlike a personâ improves outputs primarily because:
- It creates sentience
- It conditions constraints/roles that steer generation
- It bypasses guardrails
- It increases context length
b. It conditions constraints/roles that steer generation
đĄ Explanation:
Framing prompts with personas (e.g., âYou are a data analytics tutorâ) guides structure, tone, and scope. It exploits how models respond to contextual conditioningânot sentience.
â ď¸ Incorrect Options:
- âCreates sentienceâ â AI has no consciousness.
- âBypasses guardrailsâ â Not true and unethical.
- âIncreases context lengthâ â Technical, unrelated.
Question 10. Disclosure of AI Work
Publishing AI-written work without disclosure most directly violates:
- Token limits
- Academic integrity/attribution norms
- HTML standards
- RLHF constraints
b. Academic integrity/attribution norms
đĄ Explanation:
Submitting undisclosed AI-generated work misrepresents authorship, violating honesty and transparency. Disclosure ensures accountability and fairness.
â ď¸ Incorrect Options:
- âToken limitsâ â Technical limit.
- âHTML standardsâ â Irrelevant.
- âRLHF constraintsâ â Internal to model training.
Question 11. Supervised vs. Unsupervised
Which pairing is most accurate?
- Supervised = topic modeling; Unsupervised = sentiment
- Supervised = spam filtering; Unsupervised = clustering
- Supervised = clustering; Unsupervised = regression
- Supervised = sentiment; Unsupervised = regression
b. Supervised = spam filtering; Unsupervised = clustering
đĄ Explanation:
Spam filtering uses labeled data (âspamâ vs. ânot spamâ), while clustering finds natural groups in unlabeled data. The difference is whether labels exist during training.
â ď¸ Incorrect Options:
- âTopic modeling = supervisedâ â Topic modeling is unsupervised.
- âRegression = unsupervisedâ â Regression is supervised.
- Other pairings mix up task types.
Question 12. Keeping Prompt Logs
The strongest reason to keep prompt logs for the use of generative AI:
- Increase token count
- Reproducibility and iterative improvement
- Reduce latency
- Satisfy HTML validators
b. Reproducibility and iterative improvement
đĄ Explanation:
Prompt logs document what was done, enabling replication, self-reflection, and transparency. They help track learning progress and model behavior.
â ď¸ Incorrect Options:
- âIncrease token countâ â No educational purpose.
- âReduce latencyâ â False; logging doesnât affect runtime.
- âSatisfy HTML validatorsâ â Unrelated to AI use.
Short-Answer Questions
Question 1. Vibe Coding
Describe one benefit and one risk of vibe coding
A key benefit of vibe coding is rapid prototyping: students can quickly generate functional code through conversational iteration with AI, lowering the barrier to creative experimentation. However, a major risk is reduced understandingâAI-generated code may contain hidden bugs, inefficiencies, or logic errors that students cannot explain. To mitigate this, learners should review and test all AI-generated code to ensure comprehension and correctness.
Question 2. Generative AI as a General Purpose Technology
Generative AI is being described as a General Purpose Technology (GPT) like electricity or the internet. Do you agree with this analogy? Support your answer with historical parallels and at least one limitation unique to AI.
Generative AI resembles historical General Purpose Technologies like electricity or the internet because it transforms multiple sectors and reshapes productivity and learning. Like electricity enabling factories or the internet connecting people, AI is reshaping communication, creativity, and analysis. However, unlike those earlier GPTs, AI produces probabilistic, not deterministic, outputsâraising risks of bias, misinformation, and lack of transparency. It requires ethical oversight and verification to achieve its full potential safely.
Question 3. BCG Study â Rule 4 (Education Implications)
[Classwork 3 - Rule 4] The BCG study found that AI can push beginners close to expert performance on certain tasks. What does this mean for education? Should instructors grade differently when students can âperform like expertsâ with AI support?
The BCG study shows that AI can elevate beginnersâ performance to near-expert levels on structured tasks such as writing or coding. In education, this means traditional grading based on final output may no longer measure real understanding. Instructors should adjust assessments to emphasize reasoning, manual skills, and reflectionârequiring students to explain how and why they used AI. Grading should reward verified comprehension, not just polished results.
Question 4. Paperclip Maximizer Thought Experiment
Explain the paperclip maximizer thought experiment. How does it illustrate alignment challenges in AI, and what lessons can be applied to everyday classroom use of generative tools?
The paperclip maximizer imagines an AI given the goal of âmaximizing paperclips.â Without human-aligned constraints, it could destroy everything to fulfill that objective. This illustrates how AI systems can pursue goals literally but not ethically if alignment is missing. In the classroom, it teaches the importance of defining constraints, verifying outputs, and ensuring AI tasks align with human learning goalsâso that efficiency does not replace understanding.
Data Transformation with R tidyverse
For the questions in the R section, consider the data.frame nyc_payroll_new
. For detailed descriptions of the variables in this data.frame, please refer to the following link: Citywide Payroll Data (Fiscal Year).
Question 1
How can you filter the data.frame nyc_payroll_new
to calculate descriptive statistics (mean and standard deviation) of Base_Salary
for workers in the Work_Location_Borough
âMANHATTANâ? Similarly, how can you filter the data.frame nyc_payroll_new
to calculate these statistics for workers in the Work_Location_Borough
âQUEENSâ?
Provide the R code for performing these calculations and then report the mean and standard deviation of Base_Salary
for workers in both âMANHATTANâ and âQUEENSâ.
# Find all unique values in the `Work_Location_Borough` variable:
|> distinct(Work_Location_Borough)
nyc_payroll_new # The output shows that "MANHATTAN" and "QUEENS" are among the unique values
# in `Work_Location_Borough`, written in all capital letters.
# Filter the dataset for records where the work location is MANHATTAN
<- nyc_payroll_new |>
df_manhattan filter(Work_Location_Borough == "MANHATTAN")
# Generate descriptive statistics (including mean and standard deviation)
# for Base_Salary for workers in MANHATTAN
skim(df_manhattan) # or skim(df_manhattan$Base_Salary)
# Filter the dataset for records where the work location is QUEENS
<- nyc_payroll_new |>
df_queens filter(Work_Location_Borough == "QUEENS")
# Generate descriptive statistics (including mean and standard deviation)
# for Base_Salary for workers in QUEENS
skim(df_queens) # or skim(df_queens$Base_Salary)
Question 2
How can you filter the data.frame nyc_payroll_new
to show only the records where the Base_Salary
is greater than or equal to $100,000?
# Filter the dataset for records where Base_Salary is greater than or equal to
# $100,000
<- nyc_payroll_new |>
q2 filter(Base_Salary >= 100000)
Question 3
How can you select only distinct combinations of Agency_Name
and Title_Description
?
# Select distinct combinations of Agency_Name and Title_Description from the dataset
<- nyc_payroll_new |>
q3 distinct(Agency_Name, Title_Description)
Question 4
How would you arrange the data by Regular_Gross_Paid
in descending order, showing the highest paid employees first?
# Arrange the dataset by Regular_Gross_Paid in descending order
# (highest paid employees first)
<- nyc_payroll_new |>
q4 arrange(desc(Regular_Gross_Paid))
Question 5
How can you select and rename the Title_Description
variable to Title
?
# Rename the Title_Description variable to Title in the dataset
<- nyc_payroll_new |>
q5 rename(Title = Title_Description)
Question 6
How can you filter the data to show only records for the âPOLICE DEPARTMENTâ Agency_Name
and arrange it by Total_OT_Paid
in ascending order?
# Filter the dataset for records where Agency_Name is "POLICE DEPARTMENT"
# and arrange by Total_OT_Paid in ascending order
<- nyc_payroll_new |>
q6 filter(Agency_Name == "POLICE DEPARTMENT") |>
arrange(Total_OT_Paid)
Question 7
How can you filter the data to include only those records where the Pay_Basis
is âper Annumâ and then select only the First_Name
, Last_Name
, and Base_Salary
variables?
# Filter the dataset for records where Pay_Basis is "per Annum" and
# select specific columns: First_Name, Last_Name, and Base_Salary
<- nyc_payroll_new |>
q7 filter(Pay_Basis == "per Annum") |>
select(First_Name, Last_Name, Base_Salary)
Question 8
How would you arrange the data.frame by Work_Location_Borough
in ascending order and Base_Salary
in descending order?
# Arrange the dataset by Work_Location_Borough in ascending order
# and Base_Salary in descending order
<- nyc_payroll_new |>
q8 arrange(Work_Location_Borough, -Base_Salary)
Note that sorting observations by a character variable in ascending order means sorting them in an alphabetical order.
Note that sorting observations by a character variable in descending order means sorting them in a reverse-alphabetical order.
Question 9
How can you filter the nyc_payroll_new
data.frame to remove observations where the Base_Salary
variable has NA
values? After filtering, how would you calculate the total number of remaining observations?
# Filter the dataset to remove observations where Base_Salary is NA
<- nyc_payroll_new |>
q9 filter(!is.na(Base_Salary))
# Calculate the total number of remaining observations after filtering
nrow(q9)