Homework 2

Pandas Basics

Author

Byeong-Hak Choe

Published

February 24, 2025

Modified

February 24, 2025

Direction

  • Please submit your Jupyter Notebook for Part 1, Part 2, and Part 3 in Homework 2 to Brightspace with the name below:

    • danl_210_hw2_LASTNAME_FIRSTNAME.ipynb
      ( e.g., danl_210_hw2_choe_byeonghak.ipynb )
  • The due is March 3, 2025, 10:30 A.M.

  • Please send Byeong-Hak an email (bchoe@geneseo.edu) if you have any questions.







Part 1. NYC Payroll



Below is nyc_payroll DataFrame that reads the file nyc_payroll.csv containing data of how the New York City’s budget is being spent on salary and overtime pay for all municipal employees (Source: NYC OpenData).

nyc_payroll = pd.read_csv('https://bcdanl.github.io/data/nyc_payroll_2024.csv')

Variable Description

  • Fiscal_Year: Fiscal Year;
  • Payroll_Number: Payroll Number;
  • Agency_Name: The Payroll agency that the employee works for;
  • Last_Name: Last name of employee;
  • First_Name: First name of employee;
  • Mid_Init: Middle initial of employee;
  • Agency_Start_Date: Date which employee began working for their current agency;
  • Work_Location_Borough: Borough of employee’s primary work location;
  • Title_Description: Civil service title description of the employee;
  • Leave_Status_as_of_June_30: Status of employee as of the close of the relevant fiscal year;
  • Base_Salary: Base Salary assigned to the employee;
  • Pay_Basis: Lists whether the employee is paid on an hourly, per diem or annual basis;
  • Regular_Hours: Number of regular hours employee worked in the fiscal year;
  • Regular_Gross_Paid: The amount paid to the employee for base salary during the fiscal year;
  • OT_Hours: Overtime Hours worked by employee in the fiscal year;
  • Total_OT_Paid: Total overtime pay paid to the employee in the fiscal year;
  • Total_Other_Pay: Includes any compensation in addition to gross salary and overtime pay, ie Differentials, lump sums, uniform allowance, meal allowance, retroactive pay increases, settlement amounts, and bonus pay, if applicable.

Question 1

Select “First_Name”, “Last_Name”, “Base_Salary”, and “Total_OT_Paid”, then sort the DataFrame with these selected variables by “Base_Salary” in descending order and display the top 10 entries.

Answer



Question 2

Using set_index(), change the DataFrame’s index to “Last_Name”, then locate the data for a specific last name, say “BROWN”, and display their “Agency_Name”, “Base_Salary”, and “Total_OT_Paid”.

Answer



Question 3

Find the 5 employees with the highest “Regular_Gross_Paid” and calculate their average “OT_Hours”. Also, reset the index if you have changed it previously.

Answer



Question 4

Sort the DataFrame by “Fiscal_Year” and “Total_Other_Pay” in descending order, then set “First_Name” as the index and use the loc accessor to retrieve the “Total_Other_Pay” for a specific first name, say “MICHAEL”.

Answer



Question 5

Sort the DataFrame first by “Work_Location_Borough” alphabetically, and then by “Total_Compensation” (sum of “Base_Salary” and “Total_OT_Paid”) in descending order within each borough.

Answer



Question 6

  • Select employees who have “OT_Hours” greater than 0, calculate their “OT_Rate” (“Total_OT_Paid” / “OT_Hours”), and then find the employee with the highest “OT_Rate”. Display their full name and “OT_Rate”.

Answer



Question 7

Create a new DataFrame that includes employees from the “DEPARTMENT OF EDUCATION ADMIN” agency where the variables are “First_Name”, “Last_Name”, “Title_Description”, “Base_Salary”, and “Total_OT_Paid”. Additionally, include a new variable “Total_Compensation” which is the sum of “Base_Salary” and “Total_OT_Paid”.

Answer



Question 8

  • How many employees have a “Base_Salary” within the top 10% of the DataFrame?

Answer





Question 9

Filter the DataFrame for employees who have “OT_Hours” greater than 0 but less than 100, and their “Leave_Status_as_of_June_30” is “ACTIVE”.

Answer



Question 10

Find the unique job titles in the “DEPARTMENT OF EDUCATION ADMIN” agency and count how many there are.

Answer



Question 11

  • Identify the employee(s) with the highest “Total_OT_Paid” in the DataFrame.
    • Include their “First_Name”, “Last_Name”, and “Total_OT_Paid”.

Answer



Question 12

  • What percentage of the values is missing for each variable?

Answer



Question 13

  • Fill missing values in the “Last_Name” variable with “UNKNOWN”.

Answer




Part 2. NFL



  • The following is the DataFrame for Part 2.
#| echo: true
NFL2022_stuffs = pd.read_csv('https://bcdanl.github.io/data/NFL2022_stuffs.csv')
  • NFL2022_stuffs is the DataFrame that contains information about NFL games in year 2022, in which the unit of observation is a single play for each drive in a NFL game.

Variable description

  • play_id: Numeric play identifier that when used with game_id and drive provides the unique identifier for a single play
  • game_id: Ten digit identifier for NFL game.
  • drive: Numeric drive number in the game.
  • week: Season week.
  • posteam: String abbreviation for the team with possession.
  • qtr: Quarter of the game (5 is overtime).
  • half_seconds_remaining: Numeric seconds remaining in the half.
  • down: The down for the given play.
    • Basically you get four attempts (aka downs) to move the ball 10 yards (by either running with it or passing it).
    • If you make 10 yards then you get another set of four downs.
  • pass: Binary indicator if the play was a pass play.
  • wp: Estimated winning probability for the posteam given the current situation at the start of the given play.


Question 14

In DataFrame, NFL2022_stuffs, remove observations for which the value of posteam is missing.

Answer:


Question 15

  • Calculate the mean value of pass for the BUF posteam when all the following conditions hold:
    1. wp is greater than 20% and less than 75%;
    2. down is less than or equal to 2; and
    3. half_seconds_remaining is greater than 120.

Answer:


Question 16

  • Consider the following DataFrame, NFL2022_epa:
NFL2022_epa = pd.read_csv('https://bcdanl.github.io/data/NFL2022_epa.csv')

Variable Description for NFL2022_epa

  • play_id: Numeric play identifier that when used with game_id and drive provides the unique identifier for a single play
  • game_id: Ten digit identifier for NFL game.
  • drive: Numeric drive number in the game.
  • posteam: String abbreviation for the team with possession.
  • passer: Name of the player who passed a ball to a receiver by initially taking a three-step drop and backpedaling into the pocket to make a pass. (Mostly, they are quarterbacks)
  • receiver: Name of the receiver.
  • epa: Expected points added (EPA) by the posteam for the given play.


  • Create the following DataFrame, NFL2022_stuffs_EPA, that includes
    1. All the variables in the DataFrame, NFL2022_stuffs;
    2. The variables, passer, receiver, and epa, from the DataFrame, NFL2022_epa by joining the two DataFrames.
  • In the resulting DataFrame, NFL2022_stuffs_EPA, remove observations with NA in passer after the join.

Answer:




Part 3. Mr. Trash Wheel



Mr. Trash Wheel is a semi-autonomous trash interceptor that is placed at the end of a river, stream or other outfall.

Far too lazy to chase trash around the ocean, Mr. Trash Wheel stays put and waits for the waste to flow to him.

Sustainably powered and built to withstand the biggest storms, Mr. Trash Wheel uses a unique blend of solar and hydro power to pull hundreds of tons of trash out of the water each year.

See more how Mr. Trash Wheel works.

  • The following is the DataFrame for Part 3.
trashwheel = pd.read_csv('https://bcdanl.github.io/data/trashwheel.csv')

Variable Description

variable type description
Name string Name of the Trash Wheel
Month string Month
Year numeric Year
Date string Date (Daily)
Weight numeric Weight in tons
Volume numeric Volume in cubic yards
PlasticBottles numeric Number of plastic bottles
Polystyrene numeric Number of polystyrene items
CigaretteButts numeric Number of cigarette butts
GlassBottles numeric Number of glass bottles
PlasticBags numeric Number of plastic bags
Wrappers numeric Number of wrappers
SportsBalls numeric Number of sports balls
HomesPowered numeric Homes Powered - Each ton of trash equates to on average 500 kilowatts of electricity. An average household will use 30 kilowatts per day.

Meet the Mr. Trash Wheel Family


Mister Trash Wheel
  • Installed: May 9, 2014
  • Location: Jones Falls stream, Inner Harbor, Baltimore, MD

Professor Trash Wheel
  • Installed: December 4, 2016
  • Location: Harris Creek, Canton neighborhood, Baltimore, MD

Captain Trash Wheel


  • Installed: June 5, 2018
  • Location: Masonville Cove, Baltimore, MD

Gwynnda Trash Wheel


  • Installed: June 3, 2021
  • Location: Gwynns Falls, West Baltimore, MD



Question 17

  • Reshape the trashwheel DataFrame into a DataFrame called trashwheel_long that includes variables for “Name”, “Date”, “Trash_Type”, and “Number”.
    • The “Trash_Type” variable should indicate the type of trash from the original DataFrame, and “Number” should contain the corresponding values.
    • Finally, sort trashwheel_long by “Name” and “Date” in ascending order.
    • The following displays the trashwheel_long DataFrame:




Part 4. Jupyter Notebook Blogging



Below is spotify DataFrame that reads the file spotify_all.csv containing data of Spotify users’ playlist information (Source: Spotify Million Playlist Dataset Challenge).

spotify = pd.read_csv('https://bcdanl.github.io/data/spotify_all.csv')
rmarkdown::paged_table(readr::read_csv('https://bcdanl.github.io/data/spotify_all.csv'),
                       options = list(rows.print = 25))


Variable Description

  • pid: playlist ID; unique ID for playlist
  • playlist_name: a name of playlist
  • pos: a position of the track within a playlist (starting from 0)
  • artist_name: name of the track’s primary artist
  • track_name: name of the track
  • duration_ms: duration of the track in milliseconds
  • album_name: name of the track’s album


  • Write a blog post about your favorite artist(s) in the spotify DataFrame using Jupyter Notebook, and add it to your online blog.
    • In your blog post, utilize counting, sorting, indexing, and filtering methods.


Back to top