Midterm Exam II
Spring 2025, DANL 210-01
Descriptive Statistics
The distribution of scores for Midterm Exam II is shown below:

The distribution of raw scores for Data Collection in Midterm Exam II is shown below:

The distribution of raw scores for Pandas in Midterm Exam II is shown below:

The following provides the descriptive statistics for each part of the Midterm Exam II:
Section 1. Data Collection I
# %%
# =============================================================================
# Setting up
# =============================================================================
import pandas as pd
import os
import time
import random
from selenium import webdriver
from selenium.webdriver.common.by import By
= webdriver.Chrome()
driver
# Set the working directory path
= '/Users/bchoe/My Drive/suny-geneseo/spring2025/lecture-code'
wd_path
os.chdir(wd_path)
= 'http://quotes.toscrape.com/tableful/' url
First/Front page of the website for the url
:

Last page of the website for the url
:


Corresponding tags

Question 1 (Points: 4)
- Write
pandas
code that reads the HTML table from the front page of the website URL stored inurl
into a DataFrame calleddf
.
Click to Check the Answer!
= pd.read_html(url)[0] df
Question 2 (Points: 4)
- Write code to load the first/front page of the website for the
url
on the Chrome browser that is being controlled by automated test software, calledselenium
.
Click to Check the Answer!
driver.get(url)
Question 3 (Points: 20)
- Write Python code to scrape every quote from a paginated website (unknown number of pages, 10 quotes per page) by repeatedly clicking the Next button until it disappears.
- On each page, extract each quote and its author into a column
quote_author
and its associated tags intotags
, appending them to a DataFrame nameddf_clean
. - After loading each page, pause execution for a random 1–2 second interval.
- Examine below XPath examples for
quote_author
(rows 2, 4, 6, …) andtags
(rows 3, 5, 7, …), identify that pattern of even vs. odd row numbers, and use it to build your f-strings for locating each element.
- On each page, extract each quote and its author into a column
# Table body - quote with author
= '/html/body/div/table/tbody/tr[2]/td'
xpath_quote_author_01 = '/html/body/div/table/tbody/tr[4]/td'
xpath_quote_author_02 = '/html/body/div/table/tbody/tr[6]/td'
xpath_quote_author_03 = '/html/body/div/table/tbody/tr[8]/td'
xpath_quote_author_04 = '/html/body/div/table/tbody/tr[10]/td'
xpath_quote_author_05 = '/html/body/div/table/tbody/tr[12]/td'
xpath_quote_author_06 = '/html/body/div/table/tbody/tr[14]/td'
xpath_quote_author_07 = '/html/body/div/table/tbody/tr[16]/td'
xpath_quote_author_08 = '/html/body/div/table/tbody/tr[18]/td'
xpath_quote_author_09 = '/html/body/div/table/tbody/tr[20]/td'
xpath_quote_author_10
# Table body - tags
= '/html/body/div/table/tbody/tr[3]/td'
xpath_tags_01 = '/html/body/div/table/tbody/tr[5]/td'
xpath_tags_02 = '/html/body/div/table/tbody/tr[7]/td'
xpath_tags_03 = '/html/body/div/table/tbody/tr[9]/td'
xpath_tags_04 = '/html/body/div/table/tbody/tr[11]/td'
xpath_tags_05 = '/html/body/div/table/tbody/tr[13]/td'
xpath_tags_06 = '/html/body/div/table/tbody/tr[15]/td'
xpath_tags_07 = '/html/body/div/table/tbody/tr[17]/td'
xpath_tags_08 = '/html/body/div/table/tbody/tr[19]/td'
xpath_tags_09 = '/html/body/div/table/tbody/tr[21]/td' xpath_tags_10
Click to Check the Answer!
= pd.DataFrame()
df_clean while True:
try:
= driver.find_element(By.PARTIAL_LINK_TEXT, "Next")
btn except:
= []
btn
for i in range(1,11):
= i*2
j = i*2+1
k
= f'/html/body/div/table/tbody/tr[{j}]/td'
xpath_quote_author = f'/html/body/div/table/tbody/tr[{k}]/td'
xpath_tags
= driver.find_element(By.XPATH, xpath_quote_author).text
quote_author = driver.find_element(By.XPATH, xpath_tags).text
tags
= [quote_author, tags]
lst = pd.DataFrame([lst])
obs = pd.concat([df_clean, obs])
df_clean
if btn != []:
btn.click()1,2))
time.sleep(random.uniform(else:
break
= ['quote_author', 'tags'] df_clean.columns
Question 4 (Points: 4)
- Write one-line code to export the
df_clean
DataFrame as a CSV file named table_quotes.csv inside the data subfolder of the current working directory given bywd_path
.- Ensure that the CSV does not include row index of the
df_clean
DataFrame.
- Ensure that the CSV does not include row index of the
Click to Check the Answer!
'data/quotes_table.csv', index=False, encoding = 'utf-8-sig') df_clean.to_csv(
Question 5 (Points: 4)
- Write a one-line code to quit the Chrome browser that is being controlled by automated test software, called
selenium
.
Click to Check the Answer!
driver.quit()
Section 2. Data Collection II
Question 6 (Points: 4)
- In the client–server model of the web, which statement is true?
- A client hosts webpages, and a server displays them to users.
- A client requests data, and a server responds with data.
- Clients and servers are the same machine communicating via HTTPS.
- A server initiates requests, and a client responds with data.
Click to Check the Answer!
and a server responds with data. b. A client requests data,
Question 7 (Points: 10)
- Fill in the following 5 blanks to make a request to the FRED API for collecting the U.S. unemployment rate (
series_id
= “UNRATE”)
import requests
import pandas as pd
= {
param_dicts 'api_key': 'YOUR_FRED_API_KEY', ## Change to your own key
'file_type': 'json',
'series_id': ___BLANK_1___ ## ID for series data from the FRED
}= "https://api.stlouisfed.org/fred/series/observations"
api_endpoint = ___BLANK_2___
response
# Check if the server successfully sends the requested content
if ___BLANK_3___:
# Convert JSON response to Python dictionary.
= ___BLANK_4___
content
# Create a DataFrame of key 'observations's value from the content dictionary
= pd.DataFrame( ___BLANK_5___ )
df else:
print("The server cannot find the requested content")
key-value pairs in the content
dictionary

Answer for ___BLANK_1___
:
Click to Check the Answer!
"UNRATE"
Answer for ___BLANK_2___
:
Click to Check the Answer!
=param_dicts) requests.get(api_endpoint, params
Answer for ___BLANK_3___
:
Click to Check the Answer!
== 200 response.status_code
Answer for ___BLANK_4___
:
Click to Check the Answer!
response.json()
Answer for ___BLANK_5___
:
Click to Check the Answer!
'observations'] content[
Section 3. Pandas Basics I
import pandas as pd
import numpy as np
Below is spotify
DataFrame
that reads the file spotify_all.csv
containing data of Spotify users’ playlist information (Source: Spotify Million Playlist Dataset Challenge).
= pd.read_csv('https://bcdanl.github.io/data/spotify_all.csv') spotify
Variable Description
pid
: playlist ID; unique ID for playlistplaylist_name
: a name of playlistpos
: a position of the track within a playlist (starting from 0)artist_name
: name of the track’s primary artisttrack_name
: name of the trackduration_ms
: duration of the track in milliseconds
Definition of a Song
In Section 1, a song is defined as a combination of a
artist_name
value and atrack_name
value.E.g., the following provides the 12 distinct songs with the same
track_name
—I Love You:
Index | artist_name | track_name |
---|---|---|
0 | Alex & Sierra | I Love You |
1 | Alex Clare | I Love You |
2 | Axwell /\ Ingrosso | I Love You |
3 | Beat Happening | I Love You |
4 | Climax Blues Band | I Love You |
5 | Céline Dion | I Love You |
6 | Donna Summer | I Love You |
7 | Eli Young Band | I Love You |
8 | Faith Evans | I Love You |
9 | Mary J. Blige | I Love You |
10 | OMFG | I Love You |
11 | Said The Whale | I Love You |
Question 8 (Points: 4)
- Complete the following line so that
num_columns
is assigned the total number of columns in thespotify
DataFrame as an integer:
= spotify._______________________________________________ num_columns
Click to Check the Answer!
= spotify.shape[1] num_columns
Question 9 (Points: 4)
- Write code to count how many tracks each playlist contains.
Click to Check the Answer!
'pid', 'playlist_name']].value_counts().reset_index() spotify[[
Question 10 (Points: 5)
- Write code to identify playlists that include at least 100 different artists.
Click to Check the Answer!
= (
plist_100 'pid', 'playlist_name', 'artist_name']]
spotify[[
.drop_duplicates()
.value_counts()'pid', 'playlist_name']]
.reset_index()[[
.value_counts()
.reset_index()'count >= 100')
.query( )
Question 11 (Points: 5)
- Write code to calculate the proportion of playlists in
spotify
that contain at least one duplicated song, i.e. (the number of playlists with at least 1 duplicate song) divided by (the total number of distinct playlists)
Click to Check the Answer!
= spotify.duplicated(subset=["pid", "artist_name", "track_name"])
dup_T = (
duplicate_plist
spotify[ dup_T ]= ['pid', 'playlist_name'])
.drop_duplicates(subset
)
= (
all_plists
spotify= ['pid', 'playlist_name'])
.drop_duplicates(subset
)
0] / all_plist.shape[0] duplicate_plist.shape[
Question 12 (Points: 6)
- Write code to add a new column,
duration_cat
, to thespotify
DataFrame that categorizesduration_ms
into:- “short” for tracks under 3 minutes (<180000 ms)
- “medium” for tracks between 3 and 5 minutes (180000–300000 ms)
- “long” for tracks over 5 minutes (>300000 ms)
Click to Check the Answer!
'duration_cat'] = np.where(spotify['duration_ms'] < 180*1000,
spotify["short",
"")
'duration_cat'] = np.where(spotify['duration_ms'] > 300*1000,
spotify["long",
'duration_cat'])
spotify[
'duration_cat'] = np.where(spotify['duration_cat'] == '',
spotify["medium",
'duration_cat']) spotify[
Question 13 (Points: 7)
- Write code to find the second most frequently appearing song(s) across all playlists that include “Viva La Vida” (
track_name
) by “Coldplay” (artist_name
).
Click to Check the Answer!
= (
coldplay_list 'artist_name == "Coldplay" & track_name == "Viva La Vida"')
spotify.query(
)
= (
after_coldplay 'pid'].isin(coldplay_list['pid']) ]
spotify[ spotify['artist_name', 'track_name'] ]
[ [
.value_counts()
.reset_index()2, 'count', keep="all")
.nlargest( )
Question 14 (Points: 7)
- Consider a DataFrame called
songs
, derived fromspotify
, that adds a new variablesong
by concatenating each row’sartist_name
andtrack_name
:
Index | pid | playlist_name | song |
---|---|---|---|
0 | 0 | Throwbacks | Missy Elliott - Lose Control (feat. Ciara & Fat Man Scoop) |
1 | 0 | Throwbacks | Britney Spears - Toxic |
2 | 0 | Throwbacks | Beyoncé - Crazy In Love |
..... |
Note that the
songs
DataFrame has only the three variables—pid
,playlist_name
, andsong
, as shown above.Write code to determine which two different songs appear together in playlists more often than any other pair, as shown below.
- (Hint: self‑merge the
songs
DataFrame onpid
to getsong_x
-song_y
pairs, which is a many-to-many join.)
- (Hint: self‑merge the
Index | song_x | song_y | count |
---|---|---|---|
76 | Kendrick Lamar – HUMBLE. | Lil Uzi Vert – XO TOUR Llif3 | 67 |
Below is how songs
DataFrame is created:
= (
songs
spotify= ['pid', 'playlist_name', 'artist_name', 'track_name'])
.drop_duplicates(subset 'pid', 'playlist_name', 'artist_name', 'track_name']]
[[
)
'song'] = songs['artist_name'] + ' – ' + songs['track_name']
songs[= songs[['pid', 'playlist_name', 'song']] songs
Click to Check the Answer!
= songs.merge(songs, on='pid')
pairs
= (
pair_counts 'song_x','song_y']]
pairs[[
.value_counts()
.reset_index()'song_x != song_y')
.query(1, 'count', keep = "all")
.nlargest(
)
# Note: Lexicographic filtering is necessary to treat (song_x, song_y) and (song_y, song_x) as the same pair.
Section 4. Pandas Basics II
Below is shoes
DataFrame
that reads the file onlinestore_shoes_simple.csv
containing data of “shoes” search information from an online store.
= pd.read_csv('https://bcdanl.github.io/data/onlinestore_shoes_simple.csv') shoes
Variable Description
Variable | Description |
---|---|
id |
Unique identifier for the product |
brandId |
Unique identifier for the brand |
brandName |
Name of the brand |
name |
Product name or title |
reviewCount |
Number of customer reviews |
reviewStarRating |
Average star rating from customer reviews |
current_p |
Current price |
clearance_p |
Clearance price (NA if not available for clearance) |
Question 15 (Points: 5)
- Write code to identify the top deal product(s), where a deal is defined by all of the following:
- On clearance: the product’s
current_p
equals itsclearance_p
. - Highly reviewed:
reviewCount
is at or above the 95th percentile of allreviewCount
values. - Best rated: among these clearance, highly‑reviewed products, it has the maximum
reviewStarRating
(including any ties).
- On clearance: the product’s
Index | id | brandId | brandName | name | reviewCount | reviewStarRating | currernt_p | clearance_p |
---|---|---|---|---|---|---|---|---|
92 | 6545910 | 9423 | Tory Burch | Ballet Loafer | 1008 | 4.8 | 178 | 178 |
393 | 7248724 | 9423 | Tory Burch | Howell Court Sneaker | 1044 | 4.8 | 88 | 88 |
6318 | 5654262 | 15361 | See Kai Run | Stevie II Sneaker | 113 | 4.8 | 35 | 35 |
Click to Check the Answer!
= shoes['reviewCount'].quantile(.95)
review_top5 = (
shoes_promo 'clearance_p'] == shoes['currernt_p']) & (shoes['reviewCount'] >= review_top5) ]
shoes[(shoes[1, 'reviewStarRating', keep = 'all')
.nlargest( )
Question 16 (Points: 7)
- Write pandas code to:
- Count products per brand: determine how many items each brand offers.
- Find the 90th‑percentile threshold of those brand‑level counts.
- Select “popular” brands whose product counts exceed that threshold.
- Compute clearance rates for each of these popular brands: the fraction of their items with a non‑missing
clearance_p
. - List the top five brands by clearance rate, reporting each brand and its rate, as shown in below DataFrame.
Index | brandName | count_x | count_y | rate |
---|---|---|---|---|
4 | MIA | 51 | 76 | 0.6710526 |
1 | Fly London | 76 | 125 | 0.6080000 |
8 | Vince Camuto | 41 | 77 | 0.5324675 |
13 | Keds® | 37 | 70 | 0.5285714 |
12 | Cole Haan | 37 | 85 | 0.4352941 |
Click to Check the Answer!
# 1. Count total products per brand
= shoes['brandName'].value_counts().reset_index()
total_per_brand
= total_per_brand['count'].quantile(.9)
total_per_brand_n = total_per_brand[total_per_brand['count'] >= total_per_brand_n]
total_per_brand
# 2. Count total products per brand
'clearance_available'] = shoes['clearance_p'].notna()
shoes[
= (
total_per_brand_c
shoes'clearance_available == True')['brandName']
.query(
.value_counts()
.reset_index()
)
# 3. Join
= total_per_brand_c.merge(total_per_brand, on='brandName', how = 'left')
df = df[~df['count_y'].isna()]
df 'rate'] = df['count_x'] / df['count_y']
df[= df.nlargest(5, 'rate', keep = 'all') df_top5