Homework 4

Data Transfomration with pandas I

Author

Byeong-Hak Choe

Published

April 10, 2026

Modified

April 23, 2026

📌 Directions

  • Submit your Jupyter Notebook (*.ipynb) to Brightspace using this format:
    • danl-210-hw4-LASTNAME-FIRSTNAME.ipynb
      (e.g., danl-210-hw4-choe-byeonghak.ipynb)
  • Due: April 20, 2026, 11:59 P.M.
  • Questions? Email Prof. Choe:



Part 1. NYC Payroll Data



Below is nyc_payroll DataFrame that reads the file nyc_payroll.csv containing data of how the New York City’s budget is being spent on salary and overtime pay for all municipal employees (Source: NYC OpenData).

nyc_payroll = pd.read_csv('https://bcdanl.github.io/data/nyc_payroll_2025.zip')

Variable Description

  • Fiscal_Year: Fiscal Year;
  • Payroll_Number: Payroll Number;
  • Agency_Name: The Payroll agency that the employee works for;
  • Last_Name: Last name of employee;
  • First_Name: First name of employee;
  • Mid_Init: Middle initial of employee;
  • Agency_Start_Date: Date which employee began working for their current agency;
  • Work_Location_Borough: Borough of employee’s primary work location;
  • Title_Description: Civil service title description of the employee;
  • Leave_Status_as_of_June_30: Status of employee as of the close of the relevant fiscal year;
  • Base_Salary: Base Salary assigned to the employee;
  • Pay_Basis: Lists whether the employee is paid on an hourly, per diem or annual basis;
  • Regular_Hours: Number of regular hours employee worked in the fiscal year;
  • Regular_Gross_Paid: The amount paid to the employee for base salary during the fiscal year;
  • OT_Hours: Overtime Hours worked by employee in the fiscal year;
  • Total_OT_Paid: Total overtime pay paid to the employee in the fiscal year;
  • Total_Other_Pay: Includes any compensation in addition to gross salary and overtime pay, ie Differentials, lump sums, uniform allowance, meal allowance, retroactive pay increases, settlement amounts, and bonus pay, if applicable.

For more details in the variable description, see: Citywide Payroll Data (Fiscal Year) on NYC Open Data.



Question 1

Select “First_Name”, “Last_Name”, “Base_Salary”, and “Total_OT_Paid”, then sort the DataFrame with these selected variables by “Base_Salary” in descending order and display the top 10 entries.

# 1. .sort_values() with head()
(
  nyc_payroll[['First_Name', 'Last_Name', 'Base_Salary', 'Total_OT_Paid']]
  .sort_values('Base_Salary', ascending = False)
  .head(10)
)
# 2. nlargest()
(
  nyc_payroll[['First_Name', 'Last_Name', 'Base_Salary', 'Total_OT_Paid']]
  .nlargest(10, 'Base_Salary', keep = 'all')
)



Question 2

Using set_index(), change the DataFrame’s index to “Last_Name”, then locate the data for a specific last name, say “BROWN”, and display their “Agency_Name”, “Base_Salary”, and “Total_OT_Paid”.

(
  nyc_payroll
  .set_index('Last_Name')
  .loc["BROWN"][['Agency_Name', 'Base_Salary', 'Total_OT_Paid']]
)



Question 3

Find the 5 employees with the highest “Regular_Gross_Paid” and calculate their average “OT_Hours”. Also, reset the index if you have changed it previously.

(
  nyc_payroll
  .nlargest(5, 'Regular_Gross_Paid', keep = "all")['OT_Hours']
  .mean()
)



Question 4

Sort the DataFrame by “Fiscal_Year” and “Total_Other_Pay” in descending order, then set “First_Name” as the index and use the loc accessor to retrieve the “Total_Other_Pay” for a specific first name, say “MICHAEL”.

(
    nyc_payroll
    .sort_values(['Fiscal_Year', 'Total_Other_Pay'], ascending=False)
    .set_index('First_Name')
    .loc["MICHAEL"][['Total_Other_Pay']]
)



Question 5

Sort the DataFrame first by “Work_Location_Borough” alphabetically, and then by “Total_Compensation” (sum of “Base_Salary” and “Total_OT_Paid”) in descending order within each borough.

nyc_payroll.columns

nyc_payroll['Total_Compensation'] = nyc_payroll['Base_Salary'] + nyc_payroll['Total_OT_Paid']

(
    nyc_payroll
    .sort_values(['Work_Location_Borough', 'Total_Compensation'], ascending=[True, False])
    [['Work_Location_Borough', 'Total_Compensation',     # This is to relocate variables
      'Fiscal_Year', 'Payroll_Number', 'Agency_Name', 'Last_Name',
       'First_Name', 'Mid_Init', 'Agency_Start_Date',
       'Title_Description', 'Leave_Status_as_of_June_30', 'Base_Salary',
       'Pay_Basis', 'Regular_Hours', 'Regular_Gross_Paid', 'OT_Hours',
       'Total_OT_Paid', 'Total_Other_Pay']]
)



Question 6

  • Select employees who have “OT_Hours” greater than 0, calculate their “OT_Rate” (“Total_OT_Paid” / “OT_Hours”), and then find the employee with the highest “OT_Rate”. Display their full name and “OT_Rate”.
q6 = (
    nyc_payroll
    .query("OT_Hours > 0")
)

q6['OT_Rate'] = q6['Total_OT_Paid'] / q6['OT_Hours']

(
    q6
    .nlargest(1, "OT_Rate", keep = 'all')
    [['First_Name', 'Last_Name', 'OT_Rate']]
)



Question 7

Create a new DataFrame that includes employees from the “DEPARTMENT OF EDUCATION ADMIN” agency where the variables are “First_Name”, “Last_Name”, “Title_Description”, “Base_Salary”, and “Total_OT_Paid”. Additionally, include a new variable “Total_Compensation” which is the sum of “Base_Salary” and “Total_OT_Paid”.

(
    nyc_payroll
    .query("Agency_Name == 'DEPARTMENT OF EDUCATION ADMIN'")
    [['First_Name', 'Last_Name', 'Title_Description', 'Base_Salary', 'Total_OT_Paid', 'Total_Compensation']]
)



Question 8

  • How many employees have a “Base_Salary” within the top 10% of the DataFrame?
top10 = nyc_payroll['Base_Salary'].quantile(.9)

nyc_payroll[ nyc_payroll['Base_Salary'] >= top10 ].shape[0]



Question 9

Filter the DataFrame for employees who have “OT_Hours” greater than 0 but less than 100, and their “Leave_Status_as_of_June_30” is “ACTIVE”.

(
    nyc_payroll
    .query("OT_Hours > 0 & OT_Hours < 100 & Leave_Status_as_of_June_30 == 'ACTIVE'")
)



Question 10

Find the unique job titles in the “DEPARTMENT OF EDUCATION ADMIN” agency and count how many there are.

# 1. value_counts()
(
    nyc_payroll
    .query('Agency_Name == "DEPARTMENT OF EDUCATION ADMIN"')['Title_Description']
    .value_counts()
    .reset_index()
    .shape[0]
)
# 2. nunique()
(
    nyc_payroll
    .query('Agency_Name == "DEPARTMENT OF EDUCATION ADMIN"')['Title_Description']
    .nunique()
)



Question 11

  • Identify the employee(s) with the highest “Total_OT_Paid” in the DataFrame.
    • Include their “First_Name”, “Last_Name”, and “Total_OT_Paid”.
(
    nyc_payroll
    .nlargest(1, 'Total_OT_Paid', keep="all")[['First_Name', 'Last_Name', 'Total_OT_Paid']]
)



Question 12

  • What percentage of the values is missing for each variable?
nyc_payroll.info()

nyc_payroll['Payroll_Number'].isna().sum() / nyc_payroll.shape[0]
for var in nyc_payroll.columns:
  print(var, ':', 100 * nyc_payroll[var].isna().sum() / nyc_payroll.shape[0])



Question 13

  • Fill missing values in the “Last_Name” variable with “UNKNOWN”.
nyc_payroll['Last_Name'] = nyc_payroll['Last_Name'].fillna("UNKNOWN")




Part 2. Spotify Data



Below is spotify DataFrame that reads the file spotify_all.csv containing data of Spotify users’ playlist information (Source: Spotify Million Playlist Dataset Challenge).

spotify = pd.read_csv('https://bcdanl.github.io/data/spotify_all.csv')


Variable Description

  • pid: playlist ID; unique ID for playlist
  • playlist_name: a name of playlist
  • pos: a position of the track within a playlist (starting from 0)
  • artist_name: name of the track’s primary artist
  • track_name: name of the track
  • duration_ms: duration of the track in milliseconds
  • album_name: name of the track’s album

Definition of a Song

  • In Part 2, a song is defined as a combination of a artist_name value and a track_name value.

  • E.g., the following provides the 12 distinct songs with the same track_nameI Love You:


Question 14

  • Write a Python code to identify the top five songs with the highest frequency of appearances in the spotify DataFrame.
songs_top_5 = (
    spotify[['artist_name', 'track_name']]
    .value_counts()
    .reset_index()
    .nlargest(5, 'count', keep="all")
    )



Question 15

  • Write a Python code to create a DataFrame that contains information about how often each song occurs within the spotify DataFrame.
    • In this new DataFrame, each observation should represent a distinct song.
  • Then, write a Python code to identify top 5% songs based on their frequency of appearances.
songs_freq = (
    spotify[['artist_name', 'track_name']]
    .value_counts()
    .reset_index()
    )

cond = songs_freq['count'] >= songs_freq['count'].quantile(.95)
songs_top_5_pct = songs_freq[ cond ]



Question 16

  • Write a Python code to list all artists who have more than 50 unique songs in the spotify DataFrame.
artists_50_songs = (
    spotify[['artist_name', 'track_name']]
    .value_counts()
    .reset_index()['artist_name']
    .value_counts()
    .reset_index()
    .query('count > 50')
    )



Question 17

  • Write a Python code to create a DataFrame that identifies all the playlists featuring the song “One Dance” by Drake.
playlist_onedance = (
    spotify
    .query('artist_name == "Drake" & track_name == "One Dance"')
    )

playlist_songs_w_onedance = (
    spotify[ spotify['pid'].isin(playlist_onedance['pid']) ]
    )



Question 18

  • Write a Python code to identify the longest and shortest duration songs (based on duration_ms) for each unique artist.
# Shortest song per artist
shortest_songs = (
    spotify
    .sort_values(["artist_name", "duration_ms"])
    .drop_duplicates(subset=["artist_name"], keep="first")
    [['artist_name', 'track_name', 'duration_ms']]
    )

# Longest song per artist
longest_songs = (
    spotify
    .sort_values(["artist_name", "duration_ms"])
    .drop_duplicates(subset=["artist_name"], keep="last")
    [['artist_name', 'track_name', 'duration_ms']]
)



Question 19

Write a Python code to find the same song(s) appearing more than once in the same playlist.

# 1. Using drop_duplicates() with index-based filtering.
non_dup = spotify.drop_duplicates(subset=["pid", "artist_name", "track_name"], keep=False)
dup_T = ~spotify.index.isin( non_dup.index )
duplicates = spotify[ dup_T ]
# Using duplicated() with standard filtering.
dup_T = spotify.duplicated(subset=["pid", "artist_name", "track_name"], keep = False)
duplicates = (
    spotify[ dup_T ]
    )



Question 20

  • Write a Python code to filter all songs that appear in more than 100 different playlists.
songs_100_plist = (
    spotify
    .drop_duplicates(subset=["pid", "artist_name", "track_name"], keep=False)[['artist_name', 'track_name']]
    .value_counts()
    .reset_index()
    .query('count > 100')
    )




✅ End of Homework 4

Back to top