Click to Check the Answer!
import pandas as pd
from itables import init_notebook_mode, show # for displaying an interactive DataFrame
Pandas Basics
Byeong-Hak Choe
March 12, 2025
March 12, 2025
Please submit your Jupyter Notebook for Part 1, Part 2, and Part 3 in Homework 2 to Brightspace with the name below:
danl_210_hw2_LASTNAME_FIRSTNAME.ipynb
danl_210_hw2_choe_byeonghak.ipynb
)The due is March 3, 2025, 10:30 A.M.
Please send Prof. Choe an email (bchoe@geneseo.edu) if you have any questions.
Below is nyc_payroll
DataFrame
that reads the file nyc_payroll.csv
containing data of how the New York City’s budget is being spent on salary and overtime pay for all municipal employees (Source: NYC OpenData).
Fiscal_Year
: Fiscal Year;Payroll_Number
: Payroll Number;Agency_Name
: The Payroll agency that the employee works for;Last_Name
: Last name of employee;First_Name
: First name of employee;Mid_Init
: Middle initial of employee;Agency_Start_Date
: Date which employee began working for their current agency;Work_Location_Borough
: Borough of employee’s primary work location;Title_Description
: Civil service title description of the employee;Leave_Status_as_of_June_30
: Status of employee as of the close of the relevant fiscal year;Base_Salary
: Base Salary assigned to the employee;Pay_Basis
: Lists whether the employee is paid on an hourly, per diem or annual basis;Regular_Hours
: Number of regular hours employee worked in the fiscal year;Regular_Gross_Paid
: The amount paid to the employee for base salary during the fiscal year;OT_Hours
: Overtime Hours worked by employee in the fiscal year;Total_OT_Paid
: Total overtime pay paid to the employee in the fiscal year;Total_Other_Pay
: Includes any compensation in addition to gross salary and overtime pay, ie Differentials, lump sums, uniform allowance, meal allowance, retroactive pay increases, settlement amounts, and bonus pay, if applicable. Select “First_Name
”, “Last_Name
”, “Base_Salary
”, and “Total_OT_Paid
”, then sort the DataFrame with these selected variables by “Base_Salary
” in descending order and display the top 10 entries.
Answer
First_Name Last_Name Base_Salary Total_OT_Paid
2090 YVONNE MILEWSKI 221268.0 2820.39
1825 KENNETH GODINER 221000.0 0.00
3964 BRUCE JORDAN 215861.0 0.00
1062 MELINDA KATZ 212800.0 0.00
277 MELANIE HARTZOG 207518.0 0.00
2580 ERIC BOORSTYN 193968.0 0.00
370 JOANNE RUSSELL 191468.0 0.00
2007 PATRICK KELLY 185886.0 0.00
1412 DEIRDRE SNYDER 184250.0 0.00
2087 OLGA MALUF 183201.0 0.00
First_Name Last_Name Base_Salary Total_OT_Paid
2090 YVONNE MILEWSKI 221268.0 2820.39
1825 KENNETH GODINER 221000.0 0.00
3964 BRUCE JORDAN 215861.0 0.00
1062 MELINDA KATZ 212800.0 0.00
277 MELANIE HARTZOG 207518.0 0.00
2580 ERIC BOORSTYN 193968.0 0.00
370 JOANNE RUSSELL 191468.0 0.00
2007 PATRICK KELLY 185886.0 0.00
1412 DEIRDRE SNYDER 184250.0 0.00
2087 OLGA MALUF 183201.0 0.00
Using set_index()
, change the DataFrame’s index to “Last_Name
”, then locate the data for a specific last name, say “BROWN”, and display their “Agency_Name
”, “Base_Salary
”, and “Total_OT_Paid
”.
Answer
Agency_Name Base_Salary Total_OT_Paid
Last_Name
BROWN POLICE DEPARTMENT 51574.00 0.00
BROWN DEPT OF HEALTH/MENTAL HYGIENE 35538.00 0.00
BROWN DEPT OF ED PER SESSION TEACHER 33.18 0.00
BROWN DEPT OF ED PEDAGOGICAL 55670.00 0.00
BROWN DEPT. OF HOMELESS SERVICES 39438.00 0.00
BROWN NYC HOUSING AUTHORITY 40201.00 471.35
BROWN BOARD OF ELECTION POLL WORKERS 1.00 0.00
BROWN DEPT OF HEALTH/MENTAL HYGIENE 111288.00 0.00
BROWN DEPT OF ED HRLY SUPPORT STAFF 12.66 0.00
BROWN DEPARTMENT OF SANITATION 77318.00 10669.03
BROWN DEPT OF ED HRLY SUPPORT STAFF 16.19 0.00
BROWN DEPARTMENT OF EDUCATION ADMIN 68664.00 0.00
BROWN DEPT OF ED PEDAGOGICAL 69977.00 0.00
BROWN DEPT OF ED PARA PROFESSIONALS 26946.00 1.76
BROWN POLICE DEPARTMENT 125531.00 35904.74
BROWN DEPARTMENT OF CORRECTION 92073.00 3253.36
BROWN DEPARTMENT OF EDUCATION ADMIN 65663.00 0.00
Find the 5 employees with the highest “Regular_Gross_Paid
” and calculate their average “OT_Hours
”. Also, reset the index if you have changed it previously.
Answer
5.0
Sort the DataFrame by “Fiscal_Year
” and “Total_Other_Pay
” in descending order, then set “First_Name
” as the index and use the loc
accessor to retrieve the “Total_Other_Pay
” for a specific first name, say “MICHAEL”.
Answer
Total_Other_Pay
First_Name
MICHAEL 27959.35
MICHAEL 17884.91
MICHAEL 9331.02
MICHAEL 2529.38
MICHAEL 475.00
... ...
MICHAEL 0.00
MICHAEL 0.00
MICHAEL 0.00
MICHAEL 0.00
MICHAEL 0.00
[74 rows x 1 columns]
Sort the DataFrame first by “Work_Location_Borough
” alphabetically, and then by “Total_Compensation
” (sum of “Base_Salary
” and “Total_OT_Paid
”) in descending order within each borough.
Answer
Index(['Fiscal_Year', 'Payroll_Number', 'Agency_Name', 'Last_Name',
'First_Name', 'Mid_Init', 'Agency_Start_Date', 'Work_Location_Borough',
'Title_Description', 'Leave_Status_as_of_June_30', 'Base_Salary',
'Pay_Basis', 'Regular_Hours', 'Regular_Gross_Paid', 'OT_Hours',
'Total_OT_Paid', 'Total_Other_Pay'],
dtype='object')
nyc_payroll['Total_Compensation'] = nyc_payroll['Base_Salary'] + nyc_payroll['Total_OT_Paid']
(
nyc_payroll
.sort_values(['Work_Location_Borough', 'Total_Compensation'], ascending=[True, False])
[['Work_Location_Borough', 'Total_Compensation', # This is to relocate variables
'Fiscal_Year', 'Payroll_Number', 'Agency_Name', 'Last_Name',
'First_Name', 'Mid_Init', 'Agency_Start_Date',
'Title_Description', 'Leave_Status_as_of_June_30', 'Base_Salary',
'Pay_Basis', 'Regular_Hours', 'Regular_Gross_Paid', 'OT_Hours',
'Total_OT_Paid', 'Total_Other_Pay']]
)
Work_Location_Borough Total_Compensation ... Total_OT_Paid Total_Other_Pay
3813 BRONX 190591.42 ... 55080.42 27374.03
2068 BRONX 180327.00 ... 0.00 26914.90
1873 BRONX 179200.00 ... 0.00 1000.00
1677 BRONX 177400.00 ... 0.00 0.00
864 BRONX 174928.76 ... 91463.76 21167.33
... ... ... ... ... ...
3881 NaN 1.00 ... 0.00 0.00
4223 NaN 1.00 ... 0.00 0.00
4278 NaN 1.00 ... 0.00 0.00
4891 NaN 1.00 ... 0.00 0.00
4940 NaN 1.00 ... 0.00 0.00
[5000 rows x 18 columns]
OT_Rate
” (“Total_OT_Paid
” / “OT_Hours
”), and then find the employee with the highest “OT_Rate
”. Display their full name and “OT_Rate
”.Answer
First_Name Last_Name OT_Rate
4224 SHARON GOMEZ 155.836
Create a new DataFrame that includes employees from the “DEPARTMENT OF EDUCATION ADMIN” agency where the variables are “First_Name
”, “Last_Name
”, “Title_Description
”, “Base_Salary
”, and “Total_OT_Paid
”. Additionally, include a new variable “Total_Compensation
” which is the sum of “Base_Salary
” and “Total_OT_Paid
”.
Answer
First_Name Last_Name ... Total_OT_Paid Total_Compensation
79 BETH SHIELS ... 0.00 81186.00
318 SWETA GANDHI ... 16722.97 64094.97
340 CLARE PURCELL ... 0.00 45384.00
401 NYESHA EVANS ... 19.79 67702.79
443 SAMANTHA ELIAS ... 0.00 76385.00
... ... ... ... ... ...
4798 RENEI JOHNSON ... 48.14 29366.14
4835 MICHELLE TIFLINSKY ... 9.72 64049.72
4870 REBEKHA ASKEW ... 0.00 108841.00
4903 NICOLA SINCLAIR ... 37.48 81223.48
4942 ELLEN MORELLO ... 0.00 67888.00
[135 rows x 6 columns]
Base_Salary
” within the top 10% of the DataFrame?Answer
104178.40000000002
500
Filter the DataFrame for employees who have “OT_Hours
” greater than 0 but less than 100, and their “Leave_Status_as_of_June_30
” is “ACTIVE”.
Answer
Fiscal_Year Payroll_Number ... Total_Other_Pay Total_Compensation
6 2014 841.0 ... 3819.73 57082.37
39 2015 NaN ... 20076.23 118450.47
71 2020 906.0 ... 7993.00 68302.10
91 2019 806.0 ... 135.71 60413.66
104 2020 806.0 ... 0.00 38500.85
... ... ... ... ... ...
4903 2022 740.0 ... 14172.85 81223.48
4911 2021 56.0 ... 19385.83 89314.93
4949 2020 827.0 ... 3729.19 79611.47
4979 2015 NaN ... 25296.48 118245.95
4992 2018 69.0 ... 2036.87 45600.59
[339 rows x 18 columns]
Find the unique job titles in the “DEPARTMENT OF EDUCATION ADMIN” agency and count how many there are.
Answer
37
37
Total_OT_Paid
” in the DataFrame.
First_Name
”, “Last_Name
”, and “Total_OT_Paid
”.Answer
First_Name Last_Name
1228 DANIEL HOCK, JR.
Answer
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 18 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Fiscal_Year 5000 non-null int64
1 Payroll_Number 3460 non-null float64
2 Agency_Name 5000 non-null object
3 Last_Name 4985 non-null object
4 First_Name 4985 non-null object
5 Mid_Init 2920 non-null object
6 Agency_Start_Date 5000 non-null object
7 Work_Location_Borough 4602 non-null object
8 Title_Description 5000 non-null object
9 Leave_Status_as_of_June_30 5000 non-null object
10 Base_Salary 5000 non-null float64
11 Pay_Basis 5000 non-null object
12 Regular_Hours 5000 non-null float64
13 Regular_Gross_Paid 5000 non-null float64
14 OT_Hours 5000 non-null float64
15 Total_OT_Paid 5000 non-null float64
16 Total_Other_Pay 5000 non-null float64
17 Total_Compensation 5000 non-null float64
dtypes: float64(8), int64(1), object(9)
memory usage: 703.3+ KB
Fiscal_Year : 0.0
Payroll_Number : 30.8
Agency_Name : 0.0
Last_Name : 0.3
First_Name : 0.3
Mid_Init : 41.6
Agency_Start_Date : 0.0
Work_Location_Borough : 7.96
Title_Description : 0.0
Leave_Status_as_of_June_30 : 0.0
Base_Salary : 0.0
Pay_Basis : 0.0
Regular_Hours : 0.0
Regular_Gross_Paid : 0.0
OT_Hours : 0.0
Total_OT_Paid : 0.0
Total_Other_Pay : 0.0
Total_Compensation : 0.0
Last_Name
” variable with “UNKNOWN
”.Answer
NFL2022_stuffs
is the DataFrame that contains information about NFL games in year 2022, in which the unit of observation is a single play for each drive in a NFL game.play_id
: Numeric play identifier that when used with game_id
and drive
provides the unique identifier for a single playgame_id
: Ten digit identifier for NFL game.drive
: Numeric drive number in the game.week
: Season week.posteam
: String abbreviation for the team with possession.qtr
: Quarter of the game (5 is overtime).half_seconds_remaining
: Numeric seconds remaining in the half.down
: The down for the given play.
pass
: Binary indicator if the play was a pass play.wp
: Estimated winning probability for the posteam
given the current situation at the start of the given play.In DataFrame, NFL2022_stuffs
, remove observations for which the value of posteam
is missing.
Answer:
50147
46427
pass
for the BUF posteam
when all the following conditions hold:
wp
is greater than 20% and less than 75%;down
is less than or equal to 2; andhalf_seconds_remaining
is greater than 120.Answer:
0.6043956043956044
NFL2022_epa
:NFL2022_epa
play_id
: Numeric play identifier that when used with game_id
and drive
provides the unique identifier for a single playgame_id
: Ten digit identifier for NFL game.drive
: Numeric drive number in the game.posteam
: String abbreviation for the team with possession.passer
: Name of the player who passed a ball to a receiver by initially taking a three-step drop and backpedaling into the pocket to make a pass. (Mostly, they are quarterbacks)receiver
: Name of the receiver.epa
: Expected points added (EPA) by the posteam
for the given play.NFL2022_stuffs_EPA
, that includes
NFL2022_stuffs
;passer
, receiver
, and epa
, from the DataFrame, NFL2022_epa
by joining the two DataFrames.NFL2022_stuffs_EPA
, please remove observations with NaN
in passer
after the join.Answer:
Index(['play_id', 'game_id', 'drive', 'week', 'posteam', 'qtr', 'down',
'half_seconds_remaining', 'pass', 'wp'],
dtype='object')
Index(['play_id', 'game_id', 'drive', 'posteam', 'receiver', 'passer', 'epa'], dtype='object')
play_id game_id drive ... receiver passer epa
2 89 2022_01_BAL_NYJ 1.0 ... Mi.Carter J.Flacco -0.492192
4 136 2022_01_BAL_NYJ 1.0 ... NaN J.Flacco -2.402200
6 202 2022_01_BAL_NYJ 2.0 ... R.Bateman L.Jackson 0.075127
7 230 2022_01_BAL_NYJ 2.0 ... D.Duvernay L.Jackson -0.105120
10 301 2022_01_BAL_NYJ 2.0 ... D.Duvernay L.Jackson 0.411132
... ... ... ... ... ... ... ...
46415 3789 2022_22_KC_PHI 17.0 ... T.Kelce P.Mahomes 0.710100
46417 3834 2022_22_KC_PHI 17.0 ... NaN P.Mahomes 1.913129
46419 3876 2022_22_KC_PHI 17.0 ... J.Smith-Schuster P.Mahomes -0.903265
46420 3907 2022_22_KC_PHI 17.0 ... J.Smith-Schuster P.Mahomes 1.403649
46426 4072 2022_22_KC_PHI 18.0 ... NaN J.Hurts -0.070803
[22457 rows x 13 columns]
Mr. Trash Wheel is a semi-autonomous trash interceptor that is placed at the end of a river, stream or other outfall.
Far too lazy to chase trash around the ocean, Mr. Trash Wheel stays put and waits for the waste to flow to him.
Sustainably powered and built to withstand the biggest storms, Mr. Trash Wheel uses a unique blend of solar and hydro power to pull hundreds of tons of trash out of the water each year.
See more how Mr. Trash Wheel works.
variable | type | description |
---|---|---|
Name |
string | Name of the Trash Wheel |
Dumpster |
integer | Dumpster number over time; The Trash Wheel can have multiple dumpsters in a day |
Month |
string | Month |
Year |
integer | Year |
Date |
string | Date (Daily) |
Weight |
float | Weight in tons |
Volume |
integer | Volume in cubic yards |
PlasticBottles |
float | Number of plastic bottles |
Polystyrene |
float | Number of polystyrene items |
CigaretteButts |
float | Number of cigarette butts |
GlassBottles |
float | Number of glass bottles |
PlasticBags |
float | Number of plastic bags |
Wrappers |
float | Number of wrappers |
SportsBalls |
float | Number of sports balls |
HomesPowered |
integer | Homes Powered - Each ton of trash equates to on average 500 kilowatts of electricity. An average household will use 30 kilowatts per day. |
trashwheel
DataFrame into a DataFrame called trashwheel_long
that includes variables for “Name
”, “Date
”, “Dumpster
”, “Trash_Type
”, and “Number
”.
Trash_Type
” variable should indicate the type of trash from the original DataFrame, and “Number
” should contain the corresponding values.trashwheel_long
by “Name
”, “Date
”, “Dumpster
”, and “Trash_Type
” in ascending order.trashwheel_long
DataFrame:Answer:
Index(['ID', 'Name', 'Dumpster', 'Month', 'Year', 'Date', 'Weight', 'Volume',
'PlasticBottles', 'Polystyrene', 'CigaretteButts', 'GlassBottles',
'PlasticBags', 'Wrappers', 'SportsBalls', 'HomesPowered'],
dtype='object')
trashwheel['Date'] = trashwheel['Date'].astype('datetime64[ns]')
trashwheel_long = (
trashwheel[['Name', 'Date', 'Dumpster',
'PlasticBottles', 'Polystyrene', 'CigaretteButts', 'GlassBottles',
'PlasticBags', 'Wrappers', 'SportsBalls']]
.melt(
id_vars=['Name', 'Date', 'Dumpster'],
var_name= 'Trash_Type',
value_name= 'Number'
)
.sort_values(['Name', 'Date', 'Dumpster', 'Trash_Type'])
)
trashwheel_long
Name Date Dumpster Trash_Type Number
2729 Captain Trash Wheel 2018-06-30 1 CigaretteButts 8200.0
3722 Captain Trash Wheel 2018-06-30 1 GlassBottles NaN
4715 Captain Trash Wheel 2018-06-30 1 PlasticBags 1420.0
743 Captain Trash Wheel 2018-06-30 1 PlasticBottles 1150.0
1736 Captain Trash Wheel 2018-06-30 1 Polystyrene 940.0
... ... ... ... ... ...
4714 Professor Trash Wheel 2023-12-05 114 PlasticBags 360.0
742 Professor Trash Wheel 2023-12-05 114 PlasticBottles 4000.0
1735 Professor Trash Wheel 2023-12-05 114 Polystyrene 380.0
6700 Professor Trash Wheel 2023-12-05 114 SportsBalls NaN
5707 Professor Trash Wheel 2023-12-05 114 Wrappers 3900.0
[6951 rows x 5 columns]
Below is spotify
DataFrame
that reads the file spotify_all.csv
containing data of Spotify users’ playlist information (Source: Spotify Million Playlist Dataset Challenge).
pid
: playlist ID; unique ID for playlistplaylist_name
: a name of playlistpos
: a position of the track within a playlist (starting from 0)artist_name
: name of the track’s primary artisttrack_name
: name of the trackduration_ms
: duration of the track in millisecondsalbum_name
: name of the track’s albumspotify
DataFrame using Jupyter Notebook, and add it to your online blog.