Final Exam Practice

Data Wrangling with pandas

Published

May 6, 2025

Part 1. Pandas with Mr. Trash Wheel Family

Mr. Trash Wheel is a semi-autonomous trash interceptor that is placed at the end of a river, stream or other outfall.

Far too lazy to chase trash around the ocean, Mr. Trash Wheel stays put and waits for the waste to flow to him.

Sustainably powered and built to withstand the biggest storms, Mr. Trash Wheel uses a unique blend of solar and hydro power to pull hundreds of tons of trash out of the water each year.

See more how Mr. Trash Wheel works.

The following is the DataFrame for Part 1.

import pandas as pd
import numpy as np
trashwheel = pd.read_csv('https://bcdanl.github.io/data/trashwheel.csv')

Variable Description

variable	type	description
`Name`	string	Name of the Trash Wheel
`Dumpster`	integer	Dumpster number over time; The Trash Wheel can have multiple dumpsters in a day
`Month`	string	Month
`Year`	integer	Year
`Date`	string	Date (Daily)
`Weight`	float	Weight in tons
`Volume`	integer	Volume in cubic yards
`PlasticBottles`	float	Number of plastic bottles
`Polystyrene`	float	Number of polystyrene items
`CigaretteButts`	float	Number of cigarette butts
`GlassBottles`	float	Number of glass bottles
`PlasticBags`	float	Number of plastic bags
`Wrappers`	float	Number of wrappers
`SportsBalls`	float	Number of sports balls
`HomesPowered`	integer	Number of homes powered

Each ton of trash equates to on average 500 kilowatts of electricity.
An average household will use 30 kilowatts per day.

Meet the Mr. Trash Wheel Family

Installed: May 9, 2014
Location: Jones Falls stream, Inner Harbor, Baltimore, MD

Installed: December 4, 2016
Location: Harris Creek, Canton neighborhood, Baltimore, MD

Installed: June 5, 2018
Location: Masonville Cove, Baltimore, MD

Installed: June 3, 2021
Location: Gwynns Falls, West Baltimore, MD

Question 1

Write code to count the number of observations in the Name variable.

Click to Check the Answer!

trashwheel['Name'].value_counts()

Question 2

Write code to convert
- Data type of Name variable to category;
- Data type of Month variable to category; and
- Data type of Date variable to datetimes in the trashwheel DataFrame.

Click to Check the Answer!

trashwheel = (
    trashwheel
    .astype(
        {
            'Name':'category',
            'Month':'category',
            'Date':'datetime64',
        }
    )
)

Question 3

Write Python code to replace NaN values in the GlassBottles variable with the mean within each Year group of the trashwheel DataFrame.

Click to Check the Answer!

trashwheel['GlassBottles'] = (
    trashwheel
    .groupby('Year')['GlassBottles']
    .transform(lambda x: x.fillna(x.mean()))
)

Question 4

Write Python code to identify which Trash Wheel provided electricity to the greatest number of homes for each Year–Month combination..
- Please make sure your code manages any ties for the top values of HomesPowered while making selections.

Click to Check the Answer!

q4 = (
    trashwheel
    .groupby(['Name', 'Year', 'Month'])[['HomesPowered']]
    .sum()
    .reset_index()
    .groupby(['Year', 'Month'])
    .apply(lambda df: df.nlargest(1, 'HomesPowered', keep = "all"))
    .reset_index(drop = True)
)

Question 5

Each ton of trash equates to 500 kilowatts of electricity.
Write code to determine the average monthly electricity supplied to homes by each Trash Wheel.

Click to Check the Answer!

q5 = (
    trashwheel
    .groupby(['Name', 'Year', 'Month'])[['Weight', 'HomesPowered']]
    .sum()
    .reset_index()
    .assign(
        electricity = lambda x: x['Weight'] * 500
    )
    .query('HomesPowered > 0') # optional
)

q5['avg_elec_home'] = q5['electricity'] / q5['HomesPowered']

Part 2. Pandas with NYC weather

The nyc_weather DataFrame contains data for daily NYC weather in 2022 and 2023 from the National Centers for Environmental Information (NCEI) https://www.ncdc.noaa.gov/cdo-web/webservices/v2, observed from the weather stations in the NYC Metropolitan area.

import pandas as pd
import numpy as np
nyc_weather = pd.read_csv('https://bcdanl.github.io/data/nyc_weather_2022_2023.csv')

Variable	Description
`year`	The calendar year when the weather observation was recorded (values: 2022 or 2023).
`month`	The calendar month when the weather observation was recorded (values: 1–12, where 1 = January, 12 = December).
`day`	The day of the month when the weather observation was recorded (values: 1–31, depending on the month).
`id`	The unique identifier of the weather station where the data was collected (e.g., `GHCND:US1CTFR0039`). This likely includes NYC area stations.
`attributes`	Weather observation details, possibly including temperature, precipitation, wind, time, and quality flags. Likely needs parsing.
`type`	Type of weather measurement (e.g., PRCP for precipitation, SNOW, SNWD)
`weather_value`	Measured value for the given `type` (e.g., amount of precipitation in tenths of mm (0.1 mm))

Question 6

Using the nyc_weather DataFrame, write code to create the weather DataFrame containing only the observations where the type variable has one of the following values: ‘PRCP’, ‘SNOW’, ‘SNWD’, ‘TMIN’, ‘TMAX’, ‘AWND’, or ‘TAVG’.

Below explains what each value means:
- PRCP: Precipitation (tenths of mm)
  - PRCP covers any other form of moisture that falls to the ground (e.g., rain, snow, sleet, hail)
- SNOW: Snowfall (mm)
- SNWD: Snow depth (mm)
- TMAX: Maximum temperature (tenths of degrees C)
- TMIN: Minimum temperature (tenths of degrees C)
- AWND: Average daily wind speed (tenths of meters per second)
- TAVG: Average temperature (tenths of degrees C)

Click to Check the Answer!

weather =(
      nyc_weather[
        nyc_weather['type'].isin(['PRCP', 'SNOW', 'SNWD', 'TMIN', 'TMAX', 'AWND', 'TAVG'])
        ]
      )

Question 7

Write code to calculate the monthly average value for each type at each weather station in the weather DataFrame.

Click to Check the Answer!

q7 = (
       weather.groupby(['id', 'type', 'year', 'month'])['weather_value']
       .mean()
       .reset_index(name = 'mean_value')
       )

Question 8

Write code to identify the names of the top 10 weather stations with the highest mean ‘TMAX’ values in June, July, and August, respectively. (Here, top 10 refers to the stations with the hottest average temperatures.)
Write code to identify the names of the top 10 weather stations with the lowest mean ‘TMIN’ values in December, January, and February, respectively. (Here, top 10 refers to the stations with the coldest average temperatures.)

Click to Check the Answer!

q8_tmax = (
       weather
       .query('month > 5 & month < 9')
       .query('type == "TMAX"')
       .groupby(['month', 'id'])
       .agg(tmax_mean = ('weather_value', 'mean'))
       .reset_index()
       .groupby('month')
       .apply(lambda x: x.nlargest(10, 'tmax_mean', keep = 'all'))
       .reset_index(drop = True)
       )

q8_tmin = (
       weather
       .query('month == 11 | month == 1 | month == 2')
       .query('type == "TMIN"')
       .groupby(['month', 'id'])
       .agg(tmin_mean = ('weather_value', 'mean'))
       .reset_index()
       .groupby('month')
       .apply(lambda x: x.nsmallest(10, 'tmin_mean', keep = 'all'))
       .reset_index(drop = True)
       )

Question 9

The stations DataFrame contains information about each weather station in the NYC area, including its id (unique identifier of the weather station), name (the name of the weather station, including its lo), latitude, longitude, and elevation.

stations = pd.read_csv('https://bcdanl.github.io/data/weather_stations.csv')

Write code to perform the following tasks:
- Identify the names of the top 10 weather stations with the highest number of days with precipitation in each year.
  - Determine how many days recorded measurable precipitation at the highest-ranked weather station each year.
- For each of the top 10 weather stations in each year, generate a summary of descriptive statistics for precipitation values.

Click to Check the Answer!

prcp_sorted = (
       weather
       .query('type == "PRCP"')
       .query('weather_value > 0') 
       .groupby(['year', 'id'])['type']
       .size()
       .reset_index(name = "n_days")
       .merge(stations,
              on = 'id',
              how='left')
       .sort_values('n_days', ascending = False)
       .reset_index(drop=True)
       )
       
top10stations = (
    prcp_sorted
    .groupby('year')
    .apply(lambda df: df.nlargest(10, 'n_days', keep = 'all')['id'])
    .reset_index()
)


top10prcp_summary = (
    weather[weather['id'].isin(top10stations['id'])]
    .query('type == "PRCP"')
    .groupby(['id','year'])['weather_value']
    .describe()
    )