import pandas as pd
import numpy as np
trashwheel = pd.read_csv('https://bcdanl.github.io/data/trashwheel.csv')Final Exam Practice
Data Wrangling with pandas
Part 1. Pandas with Mr. Trash Wheel Family
Mr. Trash Wheel is a semi-autonomous trash interceptor that is placed at the end of a river, stream or other outfall.
Far too lazy to chase trash around the ocean, Mr. Trash Wheel stays put and waits for the waste to flow to him.
Sustainably powered and built to withstand the biggest storms, Mr. Trash Wheel uses a unique blend of solar and hydro power to pull hundreds of tons of trash out of the water each year.
See more how Mr. Trash Wheel works.
- The following is the DataFrame for Part 1.
Variable Description
| variable | type | description |
|---|---|---|
Name |
string | Name of the Trash Wheel |
Dumpster |
integer | Dumpster number over time; The Trash Wheel can have multiple dumpsters in a day |
Month |
string | Month |
Year |
integer | Year |
Date |
string | Date (Daily) |
Weight |
float | Weight in tons |
Volume |
integer | Volume in cubic yards |
PlasticBottles |
float | Number of plastic bottles |
Polystyrene |
float | Number of polystyrene items |
CigaretteButts |
float | Number of cigarette butts |
GlassBottles |
float | Number of glass bottles |
PlasticBags |
float | Number of plastic bags |
Wrappers |
float | Number of wrappers |
SportsBalls |
float | Number of sports balls |
HomesPowered |
integer | Number of homes powered |
- Each ton of trash equates to on average 500 kilowatts of electricity.
- An average household will use 30 kilowatts per day.
Meet the Mr. Trash Wheel Family

- Installed: May 9, 2014
- Location: Jones Falls stream, Inner Harbor, Baltimore, MD

- Installed: December 4, 2016
- Location: Harris Creek, Canton neighborhood, Baltimore, MD

- Installed: June 5, 2018
- Location: Masonville Cove, Baltimore, MD

- Installed: June 3, 2021
- Location: Gwynns Falls, West Baltimore, MD
Question 1
- Write code to count the number of observations in the
Namevariable.
Click to Check the Answer!
trashwheel['Name'].value_counts()Question 2
- Write code to convert
- Data type of
Namevariable tocategory; - Data type of
Monthvariable tocategory; and - Data type of
Datevariable todatetimesin thetrashwheelDataFrame.
- Data type of
Click to Check the Answer!
trashwheel = (
trashwheel
.astype(
{
'Name':'category',
'Month':'category',
'Date':'datetime64',
}
)
)Question 3
- Write Python code to replace NaN values in the
GlassBottlesvariable with the mean within eachYeargroup of thetrashwheelDataFrame.
Click to Check the Answer!
trashwheel['GlassBottles'] = (
trashwheel
.groupby('Year')['GlassBottles']
.transform(lambda x: x.fillna(x.mean()))
)Question 4
- Write Python code to identify which Trash Wheel provided electricity to the greatest number of homes for each
Year–Monthcombination..- Please make sure your code manages any ties for the top values of
HomesPoweredwhile making selections.
- Please make sure your code manages any ties for the top values of
Click to Check the Answer!
q4 = (
trashwheel
.groupby(['Name', 'Year', 'Month'])[['HomesPowered']]
.sum()
.reset_index()
.groupby(['Year', 'Month'])
.apply(lambda df: df.nlargest(1, 'HomesPowered', keep = "all"))
.reset_index(drop = True)
)Question 5
Each ton of trash equates to 500 kilowatts of electricity.
Write code to determine the average monthly electricity supplied to homes by each Trash Wheel.
Click to Check the Answer!
q5 = (
trashwheel
.groupby(['Name', 'Year', 'Month'])[['Weight', 'HomesPowered']]
.sum()
.reset_index()
.assign(
electricity = lambda x: x['Weight'] * 500
)
.query('HomesPowered > 0') # optional
)
q5['avg_elec_home'] = q5['electricity'] / q5['HomesPowered']Part 2. Pandas with NYC weather
The nyc_weather DataFrame contains data for daily NYC weather in 2022 and 2023 from the National Centers for Environmental Information (NCEI) https://www.ncdc.noaa.gov/cdo-web/webservices/v2, observed from the weather stations in the NYC Metropolitan area.
import pandas as pd
import numpy as np
nyc_weather = pd.read_csv('https://bcdanl.github.io/data/nyc_weather_2022_2023.csv')| Variable | Description |
|---|---|
year |
The calendar year when the weather observation was recorded (values: 2022 or 2023). |
month |
The calendar month when the weather observation was recorded (values: 1–12, where 1 = January, 12 = December). |
day |
The day of the month when the weather observation was recorded (values: 1–31, depending on the month). |
id |
The unique identifier of the weather station where the data was collected (e.g., GHCND:US1CTFR0039). This likely includes NYC area stations. |
attributes |
Weather observation details, possibly including temperature, precipitation, wind, time, and quality flags. Likely needs parsing. |
type |
Type of weather measurement (e.g., PRCP for precipitation, SNOW, SNWD) |
weather_value |
Measured value for the given type (e.g., amount of precipitation in tenths of mm (0.1 mm)) |
Question 6
Using the nyc_weather DataFrame, write code to create the weather DataFrame containing only the observations where the type variable has one of the following values: ‘PRCP’, ‘SNOW’, ‘SNWD’, ‘TMIN’, ‘TMAX’, ‘AWND’, or ‘TAVG’.
- Below explains what each value means:
- PRCP: Precipitation (tenths of mm)
- PRCP covers any other form of moisture that falls to the ground (e.g., rain, snow, sleet, hail)
- SNOW: Snowfall (mm)
- SNWD: Snow depth (mm)
- TMAX: Maximum temperature (tenths of degrees C)
- TMIN: Minimum temperature (tenths of degrees C)
- AWND: Average daily wind speed (tenths of meters per second)
- TAVG: Average temperature (tenths of degrees C)
- PRCP: Precipitation (tenths of mm)
Click to Check the Answer!
weather =(
nyc_weather[
nyc_weather['type'].isin(['PRCP', 'SNOW', 'SNWD', 'TMIN', 'TMAX', 'AWND', 'TAVG'])
]
)Question 7
Write code to calculate the monthly average value for each type at each weather station in the weather DataFrame.
Click to Check the Answer!
q7 = (
weather.groupby(['id', 'type', 'year', 'month'])['weather_value']
.mean()
.reset_index(name = 'mean_value')
)Question 8
- Write code to identify the names of the top 10 weather stations with the highest mean ‘TMAX’ values in June, July, and August, respectively. (Here, top 10 refers to the stations with the hottest average temperatures.)
- Write code to identify the names of the top 10 weather stations with the lowest mean ‘TMIN’ values in December, January, and February, respectively. (Here, top 10 refers to the stations with the coldest average temperatures.)
Click to Check the Answer!
q8_tmax = (
weather
.query('month > 5 & month < 9')
.query('type == "TMAX"')
.groupby(['month', 'id'])
.agg(tmax_mean = ('weather_value', 'mean'))
.reset_index()
.groupby('month')
.apply(lambda x: x.nlargest(10, 'tmax_mean', keep = 'all'))
.reset_index(drop = True)
)
q8_tmin = (
weather
.query('month == 11 | month == 1 | month == 2')
.query('type == "TMIN"')
.groupby(['month', 'id'])
.agg(tmin_mean = ('weather_value', 'mean'))
.reset_index()
.groupby('month')
.apply(lambda x: x.nsmallest(10, 'tmin_mean', keep = 'all'))
.reset_index(drop = True)
)Question 9
The stations DataFrame contains information about each weather station in the NYC area, including its id (unique identifier of the weather station), name (the name of the weather station, including its lo), latitude, longitude, and elevation.
stations = pd.read_csv('https://bcdanl.github.io/data/weather_stations.csv')- Write code to perform the following tasks:
- Identify the names of the top 10 weather stations with the highest number of days with precipitation in each year.
- Determine how many days recorded measurable precipitation at the highest-ranked weather station each year.
- For each of the top 10 weather stations in each year, generate a summary of descriptive statistics for precipitation values.
- Identify the names of the top 10 weather stations with the highest number of days with precipitation in each year.
Click to Check the Answer!
prcp_sorted = (
weather
.query('type == "PRCP"')
.query('weather_value > 0')
.groupby(['year', 'id'])['type']
.size()
.reset_index(name = "n_days")
.merge(stations,
on = 'id',
how='left')
.sort_values('n_days', ascending = False)
.reset_index(drop=True)
)
top10stations = (
prcp_sorted
.groupby('year')
.apply(lambda df: df.nlargest(10, 'n_days', keep = 'all')['id'])
.reset_index()
)
top10prcp_summary = (
weather[weather['id'].isin(top10stations['id'])]
.query('type == "PRCP"')
.groupby(['id','year'])['weather_value']
.describe()
)