Homework 2

Pandas Basics

Author

Byeong-Hak Choe

Published

March 12, 2025

Modified

March 12, 2025

Click to Check the Answer!
import pandas as pd
from itables import init_notebook_mode, show # for displaying an interactive DataFrame

Direction

  • Please submit your Jupyter Notebook for Part 1, Part 2, and Part 3 in Homework 2 to Brightspace with the name below:

    • danl_210_hw2_LASTNAME_FIRSTNAME.ipynb
      ( e.g., danl_210_hw2_choe_byeonghak.ipynb )
  • The due is March 3, 2025, 10:30 A.M.

  • Please send Prof. Choe an email (bchoe@geneseo.edu) if you have any questions.







Part 1. NYC Payroll



Below is nyc_payroll DataFrame that reads the file nyc_payroll.csv containing data of how the New York City’s budget is being spent on salary and overtime pay for all municipal employees (Source: NYC OpenData).

nyc_payroll = pd.read_csv('https://bcdanl.github.io/data/nyc_payroll_2024.csv')

Variable Description

  • Fiscal_Year: Fiscal Year;
  • Payroll_Number: Payroll Number;
  • Agency_Name: The Payroll agency that the employee works for;
  • Last_Name: Last name of employee;
  • First_Name: First name of employee;
  • Mid_Init: Middle initial of employee;
  • Agency_Start_Date: Date which employee began working for their current agency;
  • Work_Location_Borough: Borough of employee’s primary work location;
  • Title_Description: Civil service title description of the employee;
  • Leave_Status_as_of_June_30: Status of employee as of the close of the relevant fiscal year;
  • Base_Salary: Base Salary assigned to the employee;
  • Pay_Basis: Lists whether the employee is paid on an hourly, per diem or annual basis;
  • Regular_Hours: Number of regular hours employee worked in the fiscal year;
  • Regular_Gross_Paid: The amount paid to the employee for base salary during the fiscal year;
  • OT_Hours: Overtime Hours worked by employee in the fiscal year;
  • Total_OT_Paid: Total overtime pay paid to the employee in the fiscal year;
  • Total_Other_Pay: Includes any compensation in addition to gross salary and overtime pay, ie Differentials, lump sums, uniform allowance, meal allowance, retroactive pay increases, settlement amounts, and bonus pay, if applicable.

Question 1

Select “First_Name”, “Last_Name”, “Base_Salary”, and “Total_OT_Paid”, then sort the DataFrame with these selected variables by “Base_Salary” in descending order and display the top 10 entries.

Answer

Click to Check the Answer!
# 1. .sort_values() with head()
(
  nyc_payroll[['First_Name', 'Last_Name', 'Base_Salary', 'Total_OT_Paid']]
  .sort_values('Base_Salary', ascending = False)
  .head(10)
)
     First_Name Last_Name  Base_Salary  Total_OT_Paid
2090     YVONNE  MILEWSKI     221268.0        2820.39
1825    KENNETH   GODINER     221000.0           0.00
3964      BRUCE    JORDAN     215861.0           0.00
1062    MELINDA      KATZ     212800.0           0.00
277     MELANIE   HARTZOG     207518.0           0.00
2580       ERIC  BOORSTYN     193968.0           0.00
370      JOANNE   RUSSELL     191468.0           0.00
2007    PATRICK     KELLY     185886.0           0.00
1412    DEIRDRE    SNYDER     184250.0           0.00
2087       OLGA     MALUF     183201.0           0.00
Click to Check the Answer!
# 2. nlargest()
(
  nyc_payroll[['First_Name', 'Last_Name', 'Base_Salary', 'Total_OT_Paid']]
  .nlargest(10, 'Base_Salary', keep = 'all')
)
     First_Name Last_Name  Base_Salary  Total_OT_Paid
2090     YVONNE  MILEWSKI     221268.0        2820.39
1825    KENNETH   GODINER     221000.0           0.00
3964      BRUCE    JORDAN     215861.0           0.00
1062    MELINDA      KATZ     212800.0           0.00
277     MELANIE   HARTZOG     207518.0           0.00
2580       ERIC  BOORSTYN     193968.0           0.00
370      JOANNE   RUSSELL     191468.0           0.00
2007    PATRICK     KELLY     185886.0           0.00
1412    DEIRDRE    SNYDER     184250.0           0.00
2087       OLGA     MALUF     183201.0           0.00



Question 2

Using set_index(), change the DataFrame’s index to “Last_Name”, then locate the data for a specific last name, say “BROWN”, and display their “Agency_Name”, “Base_Salary”, and “Total_OT_Paid”.

Answer

Click to Check the Answer!
(
  nyc_payroll
  .set_index('Last_Name')
  .loc["BROWN"][['Agency_Name', 'Base_Salary', 'Total_OT_Paid']]
)
                              Agency_Name  Base_Salary  Total_OT_Paid
Last_Name                                                            
BROWN                   POLICE DEPARTMENT     51574.00           0.00
BROWN       DEPT OF HEALTH/MENTAL HYGIENE     35538.00           0.00
BROWN      DEPT OF ED PER SESSION TEACHER        33.18           0.00
BROWN              DEPT OF ED PEDAGOGICAL     55670.00           0.00
BROWN          DEPT. OF HOMELESS SERVICES     39438.00           0.00
BROWN               NYC HOUSING AUTHORITY     40201.00         471.35
BROWN      BOARD OF ELECTION POLL WORKERS         1.00           0.00
BROWN       DEPT OF HEALTH/MENTAL HYGIENE    111288.00           0.00
BROWN       DEPT OF ED HRLY SUPPORT STAFF        12.66           0.00
BROWN            DEPARTMENT OF SANITATION     77318.00       10669.03
BROWN       DEPT OF ED HRLY SUPPORT STAFF        16.19           0.00
BROWN       DEPARTMENT OF EDUCATION ADMIN     68664.00           0.00
BROWN              DEPT OF ED PEDAGOGICAL     69977.00           0.00
BROWN       DEPT OF ED PARA PROFESSIONALS     26946.00           1.76
BROWN                   POLICE DEPARTMENT    125531.00       35904.74
BROWN            DEPARTMENT OF CORRECTION     92073.00        3253.36
BROWN       DEPARTMENT OF EDUCATION ADMIN     65663.00           0.00



Question 3

Find the 5 employees with the highest “Regular_Gross_Paid” and calculate their average “OT_Hours”. Also, reset the index if you have changed it previously.

Answer

Click to Check the Answer!
(
  nyc_payroll
  .nlargest(5, 'Regular_Gross_Paid', keep = "all")['OT_Hours']
  .mean()
)
5.0



Question 4

Sort the DataFrame by “Fiscal_Year” and “Total_Other_Pay” in descending order, then set “First_Name” as the index and use the loc accessor to retrieve the “Total_Other_Pay” for a specific first name, say “MICHAEL”.

Answer

Click to Check the Answer!
(
    nyc_payroll
    .sort_values(['Fiscal_Year', 'Total_Other_Pay'], ascending=False)
    .set_index('First_Name')
    .loc["MICHAEL"][['Total_Other_Pay']]
)
            Total_Other_Pay
First_Name                 
MICHAEL            27959.35
MICHAEL            17884.91
MICHAEL             9331.02
MICHAEL             2529.38
MICHAEL              475.00
...                     ...
MICHAEL                0.00
MICHAEL                0.00
MICHAEL                0.00
MICHAEL                0.00
MICHAEL                0.00

[74 rows x 1 columns]



Question 5

Sort the DataFrame first by “Work_Location_Borough” alphabetically, and then by “Total_Compensation” (sum of “Base_Salary” and “Total_OT_Paid”) in descending order within each borough.

Answer

nyc_payroll.columns
Index(['Fiscal_Year', 'Payroll_Number', 'Agency_Name', 'Last_Name',
       'First_Name', 'Mid_Init', 'Agency_Start_Date', 'Work_Location_Borough',
       'Title_Description', 'Leave_Status_as_of_June_30', 'Base_Salary',
       'Pay_Basis', 'Regular_Hours', 'Regular_Gross_Paid', 'OT_Hours',
       'Total_OT_Paid', 'Total_Other_Pay'],
      dtype='object')
Click to Check the Answer!
nyc_payroll['Total_Compensation'] = nyc_payroll['Base_Salary'] + nyc_payroll['Total_OT_Paid']

(
    nyc_payroll
    .sort_values(['Work_Location_Borough', 'Total_Compensation'], ascending=[True, False])
    [['Work_Location_Borough', 'Total_Compensation',     # This is to relocate variables
      'Fiscal_Year', 'Payroll_Number', 'Agency_Name', 'Last_Name',
       'First_Name', 'Mid_Init', 'Agency_Start_Date',
       'Title_Description', 'Leave_Status_as_of_June_30', 'Base_Salary',
       'Pay_Basis', 'Regular_Hours', 'Regular_Gross_Paid', 'OT_Hours',
       'Total_OT_Paid', 'Total_Other_Pay']]
)
     Work_Location_Borough  Total_Compensation  ...  Total_OT_Paid  Total_Other_Pay
3813                 BRONX           190591.42  ...       55080.42         27374.03
2068                 BRONX           180327.00  ...           0.00         26914.90
1873                 BRONX           179200.00  ...           0.00          1000.00
1677                 BRONX           177400.00  ...           0.00             0.00
864                  BRONX           174928.76  ...       91463.76         21167.33
...                    ...                 ...  ...            ...              ...
3881                   NaN                1.00  ...           0.00             0.00
4223                   NaN                1.00  ...           0.00             0.00
4278                   NaN                1.00  ...           0.00             0.00
4891                   NaN                1.00  ...           0.00             0.00
4940                   NaN                1.00  ...           0.00             0.00

[5000 rows x 18 columns]



Question 6

  • Select employees who have “OT_Hours” greater than 0, calculate their “OT_Rate” (“Total_OT_Paid” / “OT_Hours”), and then find the employee with the highest “OT_Rate”. Display their full name and “OT_Rate”.

Answer

Click to Check the Answer!
q6 = (
    nyc_payroll
    .query("OT_Hours > 0")
)

q6['OT_Rate'] = q6['Total_OT_Paid'] / q6['OT_Hours']

(
    q6
    .nlargest(1, "OT_Rate", keep = 'all')
    [['First_Name', 'Last_Name', 'OT_Rate']]
)
     First_Name Last_Name  OT_Rate
4224     SHARON     GOMEZ  155.836



Question 7

Create a new DataFrame that includes employees from the “DEPARTMENT OF EDUCATION ADMIN” agency where the variables are “First_Name”, “Last_Name”, “Title_Description”, “Base_Salary”, and “Total_OT_Paid”. Additionally, include a new variable “Total_Compensation” which is the sum of “Base_Salary” and “Total_OT_Paid”.

Answer

Click to Check the Answer!
(
    nyc_payroll
    .query("Agency_Name == 'DEPARTMENT OF EDUCATION ADMIN'")
    [['First_Name', 'Last_Name', 'Title_Description', 'Base_Salary', 'Total_OT_Paid', 'Total_Compensation']]
)
     First_Name  Last_Name  ... Total_OT_Paid  Total_Compensation
79         BETH     SHIELS  ...          0.00            81186.00
318       SWETA     GANDHI  ...      16722.97            64094.97
340       CLARE    PURCELL  ...          0.00            45384.00
401      NYESHA      EVANS  ...         19.79            67702.79
443    SAMANTHA      ELIAS  ...          0.00            76385.00
...         ...        ...  ...           ...                 ...
4798      RENEI    JOHNSON  ...         48.14            29366.14
4835   MICHELLE  TIFLINSKY  ...          9.72            64049.72
4870    REBEKHA      ASKEW  ...          0.00           108841.00
4903     NICOLA   SINCLAIR  ...         37.48            81223.48
4942      ELLEN    MORELLO  ...          0.00            67888.00

[135 rows x 6 columns]



Question 8

  • How many employees have a “Base_Salary” within the top 10% of the DataFrame?

Answer

Click to Check the Answer!
top10 = nyc_payroll['Base_Salary'].quantile(.9)
top10
104178.40000000002
Click to Check the Answer!
nyc_payroll[ nyc_payroll['Base_Salary'] >= top10 ].shape[0]
500



Question 9

Filter the DataFrame for employees who have “OT_Hours” greater than 0 but less than 100, and their “Leave_Status_as_of_June_30” is “ACTIVE”.

Answer

Click to Check the Answer!
(
    nyc_payroll
    .query("OT_Hours > 0 & OT_Hours < 100 & Leave_Status_as_of_June_30 == 'ACTIVE'")
)
      Fiscal_Year  Payroll_Number  ... Total_Other_Pay Total_Compensation
6            2014           841.0  ...         3819.73           57082.37
39           2015             NaN  ...        20076.23          118450.47
71           2020           906.0  ...         7993.00           68302.10
91           2019           806.0  ...          135.71           60413.66
104          2020           806.0  ...            0.00           38500.85
...           ...             ...  ...             ...                ...
4903         2022           740.0  ...        14172.85           81223.48
4911         2021            56.0  ...        19385.83           89314.93
4949         2020           827.0  ...         3729.19           79611.47
4979         2015             NaN  ...        25296.48          118245.95
4992         2018            69.0  ...         2036.87           45600.59

[339 rows x 18 columns]



Question 10

Find the unique job titles in the “DEPARTMENT OF EDUCATION ADMIN” agency and count how many there are.

Answer

Click to Check the Answer!
# 1. value_counts()
(
    nyc_payroll
    .query('Agency_Name == "DEPARTMENT OF EDUCATION ADMIN"')['Title_Description']
    .value_counts()
    .reset_index()
    .shape[0]
)
37
Click to Check the Answer!
# 2. nunique()
(
    nyc_payroll
    .query('Agency_Name == "DEPARTMENT OF EDUCATION ADMIN"')['Title_Description']
    .nunique()
)
37



Question 11

  • Identify the employee(s) with the highest “Total_OT_Paid” in the DataFrame.
    • Include their “First_Name”, “Last_Name”, and “Total_OT_Paid”.

Answer

Click to Check the Answer!
(
    nyc_payroll
    .nlargest(1, 'Total_OT_Paid', keep="all")[['First_Name', 'Last_Name']]
)
     First_Name  Last_Name
1228     DANIEL  HOCK, JR.



Question 12

  • What percentage of the values is missing for each variable?

Answer

Click to Check the Answer!
nyc_payroll.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 18 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Fiscal_Year                 5000 non-null   int64  
 1   Payroll_Number              3460 non-null   float64
 2   Agency_Name                 5000 non-null   object 
 3   Last_Name                   4985 non-null   object 
 4   First_Name                  4985 non-null   object 
 5   Mid_Init                    2920 non-null   object 
 6   Agency_Start_Date           5000 non-null   object 
 7   Work_Location_Borough       4602 non-null   object 
 8   Title_Description           5000 non-null   object 
 9   Leave_Status_as_of_June_30  5000 non-null   object 
 10  Base_Salary                 5000 non-null   float64
 11  Pay_Basis                   5000 non-null   object 
 12  Regular_Hours               5000 non-null   float64
 13  Regular_Gross_Paid          5000 non-null   float64
 14  OT_Hours                    5000 non-null   float64
 15  Total_OT_Paid               5000 non-null   float64
 16  Total_Other_Pay             5000 non-null   float64
 17  Total_Compensation          5000 non-null   float64
dtypes: float64(8), int64(1), object(9)
memory usage: 703.3+ KB
Click to Check the Answer!
nyc_payroll['Payroll_Number'].isna().sum() / nyc_payroll.shape[0]
0.308
Click to Check the Answer!
for var in nyc_payroll.columns:
  print(var, ':', 100 * nyc_payroll[var].isna().sum() / nyc_payroll.shape[0])
Fiscal_Year : 0.0
Payroll_Number : 30.8
Agency_Name : 0.0
Last_Name : 0.3
First_Name : 0.3
Mid_Init : 41.6
Agency_Start_Date : 0.0
Work_Location_Borough : 7.96
Title_Description : 0.0
Leave_Status_as_of_June_30 : 0.0
Base_Salary : 0.0
Pay_Basis : 0.0
Regular_Hours : 0.0
Regular_Gross_Paid : 0.0
OT_Hours : 0.0
Total_OT_Paid : 0.0
Total_Other_Pay : 0.0
Total_Compensation : 0.0



Question 13

  • Fill missing values in the “Last_Name” variable with “UNKNOWN”.

Answer

Click to Check the Answer!
nyc_payroll['Last_Name'] = nyc_payroll['Last_Name'].fillna("UNKNOWN")




Part 2. NFL



  • The following is the DataFrame for Part 2.
NFL2022_stuffs = pd.read_csv('https://bcdanl.github.io/data/NFL2022_stuffs.csv')
  • NFL2022_stuffs is the DataFrame that contains information about NFL games in year 2022, in which the unit of observation is a single play for each drive in a NFL game.

Variable description

  • play_id: Numeric play identifier that when used with game_id and drive provides the unique identifier for a single play
  • game_id: Ten digit identifier for NFL game.
  • drive: Numeric drive number in the game.
  • week: Season week.
  • posteam: String abbreviation for the team with possession.
  • qtr: Quarter of the game (5 is overtime).
  • half_seconds_remaining: Numeric seconds remaining in the half.
  • down: The down for the given play.
    • Basically you get four attempts (aka downs) to move the ball 10 yards (by either running with it or passing it).
    • If you make 10 yards then you get another set of four downs.
  • pass: Binary indicator if the play was a pass play.
  • wp: Estimated winning probability for the posteam given the current situation at the start of the given play.


Question 14

In DataFrame, NFL2022_stuffs, remove observations for which the value of posteam is missing.

Answer:

Click to Check the Answer!
NFL2022_stuffs.shape[0]
50147
Click to Check the Answer!
NFL2022_stuffs = NFL2022_stuffs.dropna(subset = ['posteam'])
NFL2022_stuffs.shape[0]
46427


Question 15

  • Calculate the mean value of pass for the BUF posteam when all the following conditions hold:
    1. wp is greater than 20% and less than 75%;
    2. down is less than or equal to 2; and
    3. half_seconds_remaining is greater than 120.

Answer:

Click to Check the Answer!
(
    NFL2022_stuffs
    .query('wp > .2 & wp < .75 & down <= 2 & half_seconds_remaining > 120 & posteam == "BUF"')['pass']
    .mean()
)
0.6043956043956044


Question 16

  • Consider the following DataFrame, NFL2022_epa:
NFL2022_epa = pd.read_csv('https://bcdanl.github.io/data/NFL2022_epa.csv')

Variable Description for NFL2022_epa

  • play_id: Numeric play identifier that when used with game_id and drive provides the unique identifier for a single play
  • game_id: Ten digit identifier for NFL game.
  • drive: Numeric drive number in the game.
  • posteam: String abbreviation for the team with possession.
  • passer: Name of the player who passed a ball to a receiver by initially taking a three-step drop and backpedaling into the pocket to make a pass. (Mostly, they are quarterbacks)
  • receiver: Name of the receiver.
  • epa: Expected points added (EPA) by the posteam for the given play.


  • Create the following DataFrame, NFL2022_stuffs_EPA, that includes
    1. All the variables and the observations in the DataFrame, NFL2022_stuffs;
    2. The variables, passer, receiver, and epa, from the DataFrame, NFL2022_epa by joining the two DataFrames.
  • In the resulting DataFrame, NFL2022_stuffs_EPA, please remove observations with NaN in passer after the join.

Answer:

NFL2022_stuffs.columns
Index(['play_id', 'game_id', 'drive', 'week', 'posteam', 'qtr', 'down',
       'half_seconds_remaining', 'pass', 'wp'],
      dtype='object')
NFL2022_epa.columns
Index(['play_id', 'game_id', 'drive', 'posteam', 'receiver', 'passer', 'epa'], dtype='object')
Click to Check the Answer!
NFL2022_stuffs_EPA = (
    NFL2022_stuffs
    .merge(NFL2022_epa,
           on = ['play_id', 'game_id', 'drive', 'posteam'],
           how = 'left')
    .dropna(subset = ['passer'])
)

NFL2022_stuffs_EPA
       play_id          game_id  drive  ...          receiver     passer       epa
2           89  2022_01_BAL_NYJ    1.0  ...         Mi.Carter   J.Flacco -0.492192
4          136  2022_01_BAL_NYJ    1.0  ...               NaN   J.Flacco -2.402200
6          202  2022_01_BAL_NYJ    2.0  ...         R.Bateman  L.Jackson  0.075127
7          230  2022_01_BAL_NYJ    2.0  ...        D.Duvernay  L.Jackson -0.105120
10         301  2022_01_BAL_NYJ    2.0  ...        D.Duvernay  L.Jackson  0.411132
...        ...              ...    ...  ...               ...        ...       ...
46415     3789   2022_22_KC_PHI   17.0  ...           T.Kelce  P.Mahomes  0.710100
46417     3834   2022_22_KC_PHI   17.0  ...               NaN  P.Mahomes  1.913129
46419     3876   2022_22_KC_PHI   17.0  ...  J.Smith-Schuster  P.Mahomes -0.903265
46420     3907   2022_22_KC_PHI   17.0  ...  J.Smith-Schuster  P.Mahomes  1.403649
46426     4072   2022_22_KC_PHI   18.0  ...               NaN    J.Hurts -0.070803

[22457 rows x 13 columns]




Part 3. Mr. Trash Wheel



Mr. Trash Wheel is a semi-autonomous trash interceptor that is placed at the end of a river, stream or other outfall.

Far too lazy to chase trash around the ocean, Mr. Trash Wheel stays put and waits for the waste to flow to him.

Sustainably powered and built to withstand the biggest storms, Mr. Trash Wheel uses a unique blend of solar and hydro power to pull hundreds of tons of trash out of the water each year.

See more how Mr. Trash Wheel works.

  • The following is the DataFrame for Part 3.
trashwheel = pd.read_csv('https://bcdanl.github.io/data/trashwheel.csv')

Variable Description

variable type description
Name string Name of the Trash Wheel
Dumpster integer Dumpster number over time; The Trash Wheel can have multiple dumpsters in a day
Month string Month
Year integer Year
Date string Date (Daily)
Weight float Weight in tons
Volume integer Volume in cubic yards
PlasticBottles float Number of plastic bottles
Polystyrene float Number of polystyrene items
CigaretteButts float Number of cigarette butts
GlassBottles float Number of glass bottles
PlasticBags float Number of plastic bags
Wrappers float Number of wrappers
SportsBalls float Number of sports balls
HomesPowered integer Homes Powered - Each ton of trash equates to on average 500 kilowatts of electricity. An average household will use 30 kilowatts per day.



Meet the Mr. Trash Wheel Family


Mister Trash Wheel
  • Installed: May 9, 2014
  • Location: Jones Falls stream, Inner Harbor, Baltimore, MD

Professor Trash Wheel
  • Installed: December 4, 2016
  • Location: Harris Creek, Canton neighborhood, Baltimore, MD

Captain Trash Wheel


  • Installed: June 5, 2018
  • Location: Masonville Cove, Baltimore, MD

Gwynnda Trash Wheel


  • Installed: June 3, 2021
  • Location: Gwynns Falls, West Baltimore, MD



Question 17

  • Reshape the trashwheel DataFrame into a DataFrame called trashwheel_long that includes variables for “Name”, “Date”, “Dumpster”, “Trash_Type”, and “Number”.
    • The “Trash_Type” variable should indicate the type of trash from the original DataFrame, and “Number” should contain the corresponding values.
    • Finally, sort trashwheel_long by “Name”, “Date”, “Dumpster”, and “Trash_Type” in ascending order.
    • The following displays the trashwheel_long DataFrame:

Answer:

Click to Check the Answer!
trashwheel.columns
Index(['ID', 'Name', 'Dumpster', 'Month', 'Year', 'Date', 'Weight', 'Volume',
       'PlasticBottles', 'Polystyrene', 'CigaretteButts', 'GlassBottles',
       'PlasticBags', 'Wrappers', 'SportsBalls', 'HomesPowered'],
      dtype='object')
Click to Check the Answer!
trashwheel['Date'] = trashwheel['Date'].astype('datetime64[ns]')


trashwheel_long = (
    trashwheel[['Name', 'Date', 'Dumpster',
                'PlasticBottles', 'Polystyrene', 'CigaretteButts', 'GlassBottles',
                'PlasticBags', 'Wrappers', 'SportsBalls']]
    .melt(
        id_vars=['Name', 'Date', 'Dumpster'],
        var_name= 'Trash_Type',
        value_name= 'Number'
    )
    .sort_values(['Name', 'Date', 'Dumpster', 'Trash_Type'])
)

trashwheel_long
                       Name       Date  Dumpster      Trash_Type  Number
2729    Captain Trash Wheel 2018-06-30         1  CigaretteButts  8200.0
3722    Captain Trash Wheel 2018-06-30         1    GlassBottles     NaN
4715    Captain Trash Wheel 2018-06-30         1     PlasticBags  1420.0
743     Captain Trash Wheel 2018-06-30         1  PlasticBottles  1150.0
1736    Captain Trash Wheel 2018-06-30         1     Polystyrene   940.0
...                     ...        ...       ...             ...     ...
4714  Professor Trash Wheel 2023-12-05       114     PlasticBags   360.0
742   Professor Trash Wheel 2023-12-05       114  PlasticBottles  4000.0
1735  Professor Trash Wheel 2023-12-05       114     Polystyrene   380.0
6700  Professor Trash Wheel 2023-12-05       114     SportsBalls     NaN
5707  Professor Trash Wheel 2023-12-05       114        Wrappers  3900.0

[6951 rows x 5 columns]




Part 4. Jupyter Notebook Blogging



Below is spotify DataFrame that reads the file spotify_all.csv containing data of Spotify users’ playlist information (Source: Spotify Million Playlist Dataset Challenge).

spotify = pd.read_csv('https://bcdanl.github.io/data/spotify_all.csv')


Variable Description

  • pid: playlist ID; unique ID for playlist
  • playlist_name: a name of playlist
  • pos: a position of the track within a playlist (starting from 0)
  • artist_name: name of the track’s primary artist
  • track_name: name of the track
  • duration_ms: duration of the track in milliseconds
  • album_name: name of the track’s album


  • Write a blog post about your favorite artist(s) in the spotify DataFrame using Jupyter Notebook, and add it to your online blog.
    • In your blog post, utilize counting, sorting, indexing, and filtering methods.


Back to top