Homework 4

Pandas Basics - Sorting and Filtering Data

Author

Byeong-Hak Choe

Published

February 27, 2024

Modified

February 28, 2024

Direction

  • Please submit your Jupyter Notebook for Homework 4 to the Brightspace with the name below:

    • danl-m1-hw4-LASTNAME-FIRSTNAME.ipynb
      ( e.g., danl-m1-hw4-choe-byeonghak.ipynb )
  • The due is March 5, 2024, 7:00 P.M.

  • Please send Byeong-Hak an email (bchoe@geneseo.edu) if you have any questions.

  • Please prepare a Jupyter/Python Notebook (*.ipynb) to address all questions.

  • Make at least some simple comment (# ...) in each question.

  • Make one text cell to explain things in each question.


Load Libraries

Import the pandas library.

import pandas as pd


Part 1


Below is nyc_payroll DataFrame that reads the file nyc_payroll.csv containing data of how the New York City’s budget is being spent on salary and overtime pay for all municipal employees (Source: NYC OpenData).

nyc_payroll = pd.read_csv('https://bcdanl.github.io/data/nyc_payroll_2022_2023.csv')

Variable Description

  • Fiscal Year: Fiscal Year;
  • Payroll Number: Payroll Number;
  • Agency Name: The Payroll agency that the employee works for;
  • Last Name: Last name of employee;
  • First Name: First name of employee;
  • Mid Init: Middle initial of employee;
  • Agency Start Date: Date which employee began working for their current agency;
  • Work Location Borough: Borough of employee’s primary work location;
  • Title Description: Civil service title description of the employee;
  • Leave Status as of June 30: Status of employee as of the close of the relevant fiscal year;
  • Base Salary: Base Salary assigned to the employee;
  • Pay Basis: Lists whether the employee is paid on an hourly, per diem or annual basis;
  • Regular Hours: Number of regular hours employee worked in the fiscal year;
  • Regular Gross Paid: The amount paid to the employee for base salary during the fiscal year;
  • OT Hours: Overtime Hours worked by employee in the fiscal year;
  • Total OT Paid: Total overtime pay paid to the employee in the fiscal year;
  • Total Other Pay: Includes any compensation in addition to gross salary and overtime pay, ie Differentials, lump sums, uniform allowance, meal allowance, retroactive pay increases, settlement amounts, and bonus pay, if applicable.

Question 1

  • How many employees have a “Base Salary” within the top 10% of the DataFrame?
    • Hint: The .quantile() method can be useful.

Answer





Question 2

Filter the DataFrame for employees who have “OT Hours” greater than 0 but less than 100, and their “Leave Status as of June 30” is “ACTIVE”.

Answer



Question 3

Find the unique job titles in the “DEPARTMENT OF EDUCATION ADMIN” agency and count how many there are.

Answer



Question 4

  • Identify the employee(s) with the highest “Total OT Paid” in the DataFrame.
    • Include their “First Name”, “Last Name”, and “Total OT Paid”.

Answer




Part 2


Below is spotify DataFrame that reads the file spotify_all.csv containing data of Spotify users’ playlist information (Source: Spotify Million Playlist Dataset Challenge).

spotify = pd.read_csv('https://bcdanl.github.io/data/spotify_all.csv')


Variable Description

  • pid: playlist ID; unique ID for playlist

  • playlist_name: a name of playlist

  • pos: a position of the track within a playlist (starting from 0)

  • artist_name: name of the track’s primary artist

  • track_name: name of the track

  • duration_ms: duration of the track in milliseconds

  • album_name: name of the track’s album


Question 5

  • Find the five most popular song.
    • A value of a song is defined as a combination of a artist_name value and a track_name value.
    • Who are artists for those five most popular song?

Answer



Question 6

  • Create a DataFrame that only contains observations from playlists featuring the song “One Dance” by Drake.
    • Next, identify the song with the highest popularity after Drake’s “One Dance” in the newly created DataFrame.

Answer



Back to top