import pandas as pd
import numpy as np
= pd.read_csv('https://bcdanl.github.io/data/trashwheel.csv') trashwheel
Final Exam Practice
Data Wrangling with pandas
Part 1. Pandas with Mr. Trash Wheel Family
Mr. Trash Wheel is a semi-autonomous trash interceptor that is placed at the end of a river, stream or other outfall.
Far too lazy to chase trash around the ocean, Mr. Trash Wheel stays put and waits for the waste to flow to him.
Sustainably powered and built to withstand the biggest storms, Mr. Trash Wheel uses a unique blend of solar and hydro power to pull hundreds of tons of trash out of the water each year.
See more how Mr. Trash Wheel works.
- The following is the DataFrame for Part 1.
Variable Description
variable | type | description |
---|---|---|
Name |
string | Name of the Trash Wheel |
Dumpster |
integer | Dumpster number over time; The Trash Wheel can have multiple dumpsters in a day |
Month |
string | Month |
Year |
integer | Year |
Date |
string | Date (Daily) |
Weight |
float | Weight in tons |
Volume |
integer | Volume in cubic yards |
PlasticBottles |
float | Number of plastic bottles |
Polystyrene |
float | Number of polystyrene items |
CigaretteButts |
float | Number of cigarette butts |
GlassBottles |
float | Number of glass bottles |
PlasticBags |
float | Number of plastic bags |
Wrappers |
float | Number of wrappers |
SportsBalls |
float | Number of sports balls |
HomesPowered |
integer | Number of homes powered |
- Each ton of trash equates to on average 500 kilowatts of electricity.
- An average household will use 30 kilowatts per day.
Meet the Mr. Trash Wheel Family
- Installed: May 9, 2014
- Location: Jones Falls stream, Inner Harbor, Baltimore, MD
- Installed: December 4, 2016
- Location: Harris Creek, Canton neighborhood, Baltimore, MD
- Installed: June 5, 2018
- Location: Masonville Cove, Baltimore, MD
- Installed: June 3, 2021
- Location: Gwynns Falls, West Baltimore, MD
Question 1
- Write code to count the number of observations in the
Name
variable.
Click to Check the Answer!
'Name'].value_counts() trashwheel[
Question 2
- Write code to convert
- Data type of
Name
variable tocategory
; - Data type of
Month
variable tocategory
; and - Data type of
Date
variable todatetimes
in thetrashwheel
DataFrame.
- Data type of
Click to Check the Answer!
= (
trashwheel
trashwheel
.astype(
{'Name':'category',
'Month':'category',
'Date':'datetime64',
}
) )
Question 3
- Write Python code to replace NaN values in the
GlassBottles
variable with the mean within eachYear
group of thetrashwheel
DataFrame.
Click to Check the Answer!
'GlassBottles'] = (
trashwheel[
trashwheel'Year')['GlassBottles']
.groupby(lambda x: x.fillna(x.mean()))
.transform( )
Question 4
- Write Python code to identify which Trash Wheel provided electricity to the greatest number of homes for each
Year
–Month
combination..- Please make sure your code manages any ties for the top values of
HomesPowered
while making selections.
- Please make sure your code manages any ties for the top values of
Click to Check the Answer!
= (
q4
trashwheel'Name', 'Year', 'Month'])[['HomesPowered']]
.groupby([sum()
.
.reset_index()'Year', 'Month'])
.groupby([apply(lambda df: df.nlargest(1, 'HomesPowered', keep = "all"))
.= True)
.reset_index(drop )
Question 5
Each ton of trash equates to 500 kilowatts of electricity.
Write code to determine the average monthly electricity supplied to homes by each Trash Wheel.
Click to Check the Answer!
= (
q5
trashwheel'Name', 'Year', 'Month'])[['Weight', 'HomesPowered']]
.groupby([sum()
.
.reset_index()
.assign(= lambda x: x['Weight'] * 500
electricity
)'HomesPowered > 0') # optional
.query(
)
'avg_elec_home'] = q5['electricity'] / q5['HomesPowered'] q5[
Part 2. Pandas with NYC weather
The nyc_weather
DataFrame contains data for daily NYC weather in 2022 and 2023 from the National Centers for Environmental Information (NCEI) https://www.ncdc.noaa.gov/cdo-web/webservices/v2, observed from the weather stations in the NYC Metropolitan area.
import pandas as pd
import numpy as np
= pd.read_csv('https://bcdanl.github.io/data/nyc_weather_2022_2023.csv') nyc_weather
Variable | Description |
---|---|
year |
The calendar year when the weather observation was recorded (values: 2022 or 2023). |
month |
The calendar month when the weather observation was recorded (values: 1–12, where 1 = January, 12 = December). |
day |
The day of the month when the weather observation was recorded (values: 1–31, depending on the month). |
id |
The unique identifier of the weather station where the data was collected (e.g., GHCND:US1CTFR0039 ). This likely includes NYC area stations. |
attributes |
Weather observation details, possibly including temperature, precipitation, wind, time, and quality flags. Likely needs parsing. |
type |
Type of weather measurement (e.g., PRCP for precipitation, SNOW, SNWD) |
weather_value |
Measured value for the given type (e.g., amount of precipitation in tenths of mm (0.1 mm)) |
Question 6
Using the nyc_weather
DataFrame, write code to create the weather
DataFrame containing only the observations where the type
variable has one of the following values: ‘PRCP’, ‘SNOW’, ‘SNWD’, ‘TMIN’, ‘TMAX’, ‘AWND’, or ‘TAVG’.
- Below explains what each value means:
- PRCP: Precipitation (tenths of mm)
- PRCP covers any other form of moisture that falls to the ground (e.g., rain, snow, sleet, hail)
- SNOW: Snowfall (mm)
- SNWD: Snow depth (mm)
- TMAX: Maximum temperature (tenths of degrees C)
- TMIN: Minimum temperature (tenths of degrees C)
- AWND: Average daily wind speed (tenths of meters per second)
- TAVG: Average temperature (tenths of degrees C)
- PRCP: Precipitation (tenths of mm)
Click to Check the Answer!
=(
weather
nyc_weather['type'].isin(['PRCP', 'SNOW', 'SNWD', 'TMIN', 'TMAX', 'AWND', 'TAVG'])
nyc_weather[
] )
Question 7
Write code to calculate the monthly average value
for each type
at each weather station in the weather
DataFrame.
Click to Check the Answer!
= (
q7 'id', 'type', 'year', 'month'])['weather_value']
weather.groupby([
.mean()= 'mean_value')
.reset_index(name )
Question 8
- Write code to identify the names of the top 10 weather stations with the highest mean ‘TMAX’ values in June, July, and August, respectively. (Here, top 10 refers to the stations with the hottest average temperatures.)
- Write code to identify the names of the top 10 weather stations with the lowest mean ‘TMIN’ values in December, January, and February, respectively. (Here, top 10 refers to the stations with the coldest average temperatures.)
Click to Check the Answer!
= (
q8_tmax
weather'month > 5 & month < 9')
.query('type == "TMAX"')
.query('month', 'id'])
.groupby([= ('weather_value', 'mean'))
.agg(tmax_mean
.reset_index()'month')
.groupby(apply(lambda x: x.nlargest(10, 'tmax_mean', keep = 'all'))
.= True)
.reset_index(drop
)
= (
q8_tmin
weather'month == 11 | month == 1 | month == 2')
.query('type == "TMIN"')
.query('month', 'id'])
.groupby([= ('weather_value', 'mean'))
.agg(tmin_mean
.reset_index()'month')
.groupby(apply(lambda x: x.nsmallest(10, 'tmin_mean', keep = 'all'))
.= True)
.reset_index(drop )
Question 9
The stations
DataFrame contains information about each weather station in the NYC area, including its id
(unique identifier of the weather station), name
(the name of the weather station, including its lo), latitude
, longitude
, and elevation
.
= pd.read_csv('https://bcdanl.github.io/data/weather_stations.csv') stations
- Write code to perform the following tasks:
- Identify the names of the top 10 weather stations with the highest number of days with precipitation in each year.
- Determine how many days recorded measurable precipitation at the highest-ranked weather station each year.
- For each of the top 10 weather stations in each year, generate a summary of descriptive statistics for precipitation values.
- Identify the names of the top 10 weather stations with the highest number of days with precipitation in each year.
Click to Check the Answer!
= (
prcp_sorted
weather'type == "PRCP"')
.query('weather_value > 0')
.query('year', 'id'])['type']
.groupby([
.size()= "n_days")
.reset_index(name
.merge(stations,= 'id',
on ='left')
how'n_days', ascending = False)
.sort_values(=True)
.reset_index(drop
)
= (
top10stations
prcp_sorted'year')
.groupby(apply(lambda df: df.nlargest(10, 'n_days', keep = 'all')['id'])
.
.reset_index()
)
= (
top10prcp_summary 'id'].isin(top10stations['id'])]
weather[weather['type == "PRCP"')
.query('id','year'])['weather_value']
.groupby([
.describe() )