Midterm Exam II

Spring 2025, DANL 210-01

Published

April 21, 2025

Descriptive Statistics

The distribution of scores for Midterm Exam II is shown below:

The distribution of raw scores for Data Collection in Midterm Exam II is shown below:

The distribution of raw scores for Pandas in Midterm Exam II is shown below:

The following provides the descriptive statistics for each part of the Midterm Exam II:

Section 1. Data Collection I

# %%
# =============================================================================
# Setting up
# =============================================================================
import pandas as pd
import os
import time
import random
from selenium import webdriver
from selenium.webdriver.common.by import By

driver = webdriver.Chrome() 

# Set the working directory path
wd_path = '/Users/bchoe/My Drive/suny-geneseo/spring2025/lecture-code'
os.chdir(wd_path)  

url = 'http://quotes.toscrape.com/tableful/'

First/Front page of the website for the url:

Last page of the website for the url:

A single quote with its author

Corresponding tags

Question 1 (Points: 4)

Write pandas code that reads the HTML table from the front page of the website URL stored in url into a DataFrame called df.

Click to Check the Answer!

df = pd.read_html(url)[0]

Question 2 (Points: 4)

Write code to load the first/front page of the website for the url on the Chrome browser that is being controlled by automated test software, called selenium.

Click to Check the Answer!

driver.get(url)

Question 3 (Points: 20)

Write Python code to scrape every quote from a paginated website (unknown number of pages, 10 quotes per page) by repeatedly clicking the Next button until it disappears.
- On each page, extract each quote and its author into a column quote_author and its associated tags into tags, appending them to a DataFrame named df_clean.
- After loading each page, pause execution for a random 1–2 second interval.
- Examine below XPath examples for quote_author (rows 2, 4, 6, …) and tags (rows 3, 5, 7, …), identify that pattern of even vs. odd row numbers, and use it to build your f-strings for locating each element.

# Table body - quote with author
xpath_quote_author_01 = '/html/body/div/table/tbody/tr[2]/td'
xpath_quote_author_02 = '/html/body/div/table/tbody/tr[4]/td'
xpath_quote_author_03 = '/html/body/div/table/tbody/tr[6]/td'
xpath_quote_author_04 = '/html/body/div/table/tbody/tr[8]/td'
xpath_quote_author_05 = '/html/body/div/table/tbody/tr[10]/td'
xpath_quote_author_06 = '/html/body/div/table/tbody/tr[12]/td'
xpath_quote_author_07 = '/html/body/div/table/tbody/tr[14]/td'
xpath_quote_author_08 = '/html/body/div/table/tbody/tr[16]/td'
xpath_quote_author_09 = '/html/body/div/table/tbody/tr[18]/td'
xpath_quote_author_10 = '/html/body/div/table/tbody/tr[20]/td'

# Table body - tags
xpath_tags_01 = '/html/body/div/table/tbody/tr[3]/td'
xpath_tags_02 = '/html/body/div/table/tbody/tr[5]/td'
xpath_tags_03 = '/html/body/div/table/tbody/tr[7]/td'
xpath_tags_04 = '/html/body/div/table/tbody/tr[9]/td'
xpath_tags_05 = '/html/body/div/table/tbody/tr[11]/td'
xpath_tags_06 = '/html/body/div/table/tbody/tr[13]/td'
xpath_tags_07 = '/html/body/div/table/tbody/tr[15]/td'
xpath_tags_08 = '/html/body/div/table/tbody/tr[17]/td'
xpath_tags_09 = '/html/body/div/table/tbody/tr[19]/td'
xpath_tags_10 = '/html/body/div/table/tbody/tr[21]/td'

Click to Check the Answer!

df_clean = pd.DataFrame()
while True:
    
    try:
        btn = driver.find_element(By.PARTIAL_LINK_TEXT, "Next")
    except:
        btn = []

    for i in range(1,11):
        j = i*2
        k = i*2+1
        
        xpath_quote_author = f'/html/body/div/table/tbody/tr[{j}]/td'
        xpath_tags = f'/html/body/div/table/tbody/tr[{k}]/td'
        
        quote_author = driver.find_element(By.XPATH, xpath_quote_author).text
        tags = driver.find_element(By.XPATH, xpath_tags).text
        
        lst = [quote_author, tags]
        obs = pd.DataFrame([lst])
        df_clean = pd.concat([df_clean, obs])
    
    if btn != []:
        btn.click()
        time.sleep(random.uniform(1,2))
    else:
        break
      
df_clean.columns = ['quote_author', 'tags']

Question 4 (Points: 4)

Write one-line code to export the df_clean DataFrame as a CSV file named table_quotes.csv inside the data subfolder of the current working directory given by wd_path.
- Ensure that the CSV does not include row index of the df_clean DataFrame.

Click to Check the Answer!

df_clean.to_csv('data/quotes_table.csv', index=False, encoding = 'utf-8-sig')

Question 5 (Points: 4)

Write a one-line code to quit the Chrome browser that is being controlled by automated test software, called selenium.

Click to Check the Answer!

driver.quit()

Section 2. Data Collection II

Question 6 (Points: 4)

In the client–server model of the web, which statement is true?

A client hosts webpages, and a server displays them to users.
A client requests data, and a server responds with data.
Clients and servers are the same machine communicating via HTTPS.
A server initiates requests, and a client responds with data.

Click to Check the Answer!

b. A client requests data, and a server responds with data.

Question 7 (Points: 10)

Fill in the following 5 blanks to make a request to the FRED API for collecting the U.S. unemployment rate (series_id = “UNRATE”)

import requests
import pandas as pd
param_dicts = {
  'api_key': 'YOUR_FRED_API_KEY', ## Change to your own key
  'file_type': 'json',
  'series_id': ___BLANK_1___    ## ID for series data from the FRED
}
api_endpoint = "https://api.stlouisfed.org/fred/series/observations"
response = ___BLANK_2___

# Check if the server successfully sends the requested content
if ___BLANK_3___:
    # Convert JSON response to Python dictionary.
    content = ___BLANK_4___

    # Create a DataFrame of key 'observations's value from the content dictionary
    df = pd.DataFrame( ___BLANK_5___ )
else:
    print("The server cannot find the requested content")

key-value pairs in the content dictionary

Answer for ___BLANK_1___:

Click to Check the Answer!

"UNRATE"

Answer for ___BLANK_2___:

Click to Check the Answer!

requests.get(api_endpoint, params=param_dicts)

Answer for ___BLANK_3___:

Click to Check the Answer!

response.status_code == 200

Answer for ___BLANK_4___:

Click to Check the Answer!

response.json()

Answer for ___BLANK_5___:

Click to Check the Answer!

content['observations']

Section 3. Pandas Basics I

import pandas as pd
import numpy as np

Below is spotify DataFrame that reads the file spotify_all.csv containing data of Spotify users’ playlist information (Source: Spotify Million Playlist Dataset Challenge).

spotify = pd.read_csv('https://bcdanl.github.io/data/spotify_all.csv')

Variable Description

pid: playlist ID; unique ID for playlist
playlist_name: a name of playlist
pos: a position of the track within a playlist (starting from 0)
artist_name: name of the track’s primary artist
track_name: name of the track
duration_ms: duration of the track in milliseconds

Definition of a Song

In Section 1, a song is defined as a combination of a artist_name value and a track_name value.
E.g., the following provides the 12 distinct songs with the same track_name—I Love You:

Index	artist_name	track_name
0	Alex & Sierra	I Love You
1	Alex Clare	I Love You
2	Axwell /\ Ingrosso	I Love You
3	Beat Happening	I Love You
4	Climax Blues Band	I Love You
5	Céline Dion	I Love You
6	Donna Summer	I Love You
7	Eli Young Band	I Love You
8	Faith Evans	I Love You
9	Mary J. Blige	I Love You
10	OMFG	I Love You
11	Said The Whale	I Love You

Question 8 (Points: 4)

Complete the following line so that num_columns is assigned the total number of columns in the spotify DataFrame as an integer:

num_columns = spotify._______________________________________________

Click to Check the Answer!

num_columns = spotify.shape[1]

Question 9 (Points: 4)

Write code to count how many tracks each playlist contains.

Click to Check the Answer!

spotify[['pid', 'playlist_name']].value_counts().reset_index()

Question 10 (Points: 5)

Write code to identify playlists that include at least 100 different artists.

Click to Check the Answer!

plist_100 = (
 spotify[['pid', 'playlist_name', 'artist_name']]
 .drop_duplicates()
 .value_counts()
 .reset_index()[['pid', 'playlist_name']]
 .value_counts()
 .reset_index()
 .query('count >= 100')
 )

Question 11 (Points: 5)

Write code to calculate the proportion of playlists in spotify that contain at least one duplicated song, i.e. (the number of playlists with at least 1 duplicate song) divided by (the total number of distinct playlists)

Click to Check the Answer!

dup_T = spotify.duplicated(subset=["pid", "artist_name", "track_name"])
duplicate_plist = (
    spotify[ dup_T ]
    .drop_duplicates(subset = ['pid', 'playlist_name'])
    )


all_plists = (
    spotify
    .drop_duplicates(subset = ['pid', 'playlist_name'])
    )
    
duplicate_plist.shape[0] / all_plist.shape[0]

Question 12 (Points: 6)

Write code to add a new column, duration_cat, to the spotify DataFrame that categorizes duration_ms into:
- “short” for tracks under 3 minutes (<180000 ms)
- “medium” for tracks between 3 and 5 minutes (180000–300000 ms)
- “long” for tracks over 5 minutes (>300000 ms)

Click to Check the Answer!

spotify['duration_cat'] = np.where(spotify['duration_ms'] < 180*1000,
                                   "short", 
                                   "")

spotify['duration_cat'] = np.where(spotify['duration_ms'] > 300*1000,
                                   "long", 
                                   spotify['duration_cat'])

spotify['duration_cat'] = np.where(spotify['duration_cat'] == '',
                                   "medium", 
                                   spotify['duration_cat'])

Question 13 (Points: 7)

Write code to find the second most frequently appearing song(s) across all playlists that include “Viva La Vida” (track_name) by “Coldplay” (artist_name).

Click to Check the Answer!

coldplay_list = (
    spotify.query('artist_name == "Coldplay" & track_name == "Viva La Vida"')
    )

after_coldplay = (
    spotify[ spotify['pid'].isin(coldplay_list['pid']) ]
    [ ['artist_name', 'track_name'] ]
    .value_counts()
    .reset_index()
    .nlargest(2, 'count', keep="all")
    )

Question 14 (Points: 7)

Consider a DataFrame called songs, derived from spotify, that adds a new variable song by concatenating each row’s artist_name and track_name:

Index	pid	playlist_name	song
0	0	Throwbacks	Missy Elliott - Lose Control (feat. Ciara & Fat Man Scoop)
1	0	Throwbacks	Britney Spears - Toxic
2	0	Throwbacks	Beyoncé - Crazy In Love
		.....

Note that the songs DataFrame has only the three variables—pid, playlist_name, and song, as shown above.
Write code to determine which two different songs appear together in playlists more often than any other pair, as shown below.
- (Hint: self‑merge the songs DataFrame on pid to get song_x-song_y pairs, which is a many-to-many join.)

Index	song_x	song_y	count
76	Kendrick Lamar – HUMBLE.	Lil Uzi Vert – XO TOUR Llif3	67

Below is how songs DataFrame is created:

songs = (
    spotify
    .drop_duplicates(subset = ['pid', 'playlist_name', 'artist_name', 'track_name'])
    [['pid', 'playlist_name', 'artist_name', 'track_name']]
    )

songs['song'] = songs['artist_name'] + ' – ' + songs['track_name']
songs = songs[['pid', 'playlist_name', 'song']]

Click to Check the Answer!

pairs = songs.merge(songs, on='pid')

pair_counts = (
    pairs[['song_x','song_y']]
    .value_counts()
    .reset_index()
    .query('song_x != song_y')
    .nlargest(1, 'count', keep = "all")  
)

# Note: Lexicographic filtering is necessary to treat (song_x, song_y) and (song_y, song_x) as the same pair.

Section 4. Pandas Basics II

Below is shoes DataFrame that reads the file onlinestore_shoes_simple.csv containing data of “shoes” search information from an online store.

shoes = pd.read_csv('https://bcdanl.github.io/data/onlinestore_shoes_simple.csv')

Variable Description

Variable	Description
`id`	Unique identifier for the product
`brandId`	Unique identifier for the brand
`brandName`	Name of the brand
`name`	Product name or title
`reviewCount`	Number of customer reviews
`reviewStarRating`	Average star rating from customer reviews
`current_p`	Current price
`clearance_p`	Clearance price (`NA` if not available for clearance)

Question 15 (Points: 5)

Write code to identify the top deal product(s), where a deal is defined by all of the following:
1. On clearance: the product’s current_p equals its clearance_p.
2. Highly reviewed: reviewCount is at or above the 95th percentile of all reviewCount values.
3. Best rated: among these clearance, highly‑reviewed products, it has the maximum reviewStarRating (including any ties).

Index	id	brandId	brandName	name	reviewCount	reviewStarRating	currernt_p	clearance_p
92	6545910	9423	Tory Burch	Ballet Loafer	1008	4.8	178	178
393	7248724	9423	Tory Burch	Howell Court Sneaker	1044	4.8	88	88
6318	5654262	15361	See Kai Run	Stevie II Sneaker	113	4.8	35	35

Click to Check the Answer!

review_top5 = shoes['reviewCount'].quantile(.95)
shoes_promo = (
    shoes[(shoes['clearance_p'] == shoes['currernt_p']) & (shoes['reviewCount'] >= review_top5) ]
    .nlargest(1, 'reviewStarRating', keep = 'all')
    )

Question 16 (Points: 7)

Write pandas code to:
1. Count products per brand: determine how many items each brand offers.
2. Find the 90th‑percentile threshold of those brand‑level counts.
3. Select “popular” brands whose product counts exceed that threshold.
4. Compute clearance rates for each of these popular brands: the fraction of their items with a non‑missing clearance_p.
5. List the top five brands by clearance rate, reporting each brand and its rate, as shown in below DataFrame.

Index	brandName	count_x	count_y	rate
4	MIA	51	76	0.6710526
1	Fly London	76	125	0.6080000
8	Vince Camuto	41	77	0.5324675
13	Keds®	37	70	0.5285714
12	Cole Haan	37	85	0.4352941

Click to Check the Answer!

# 1. Count total products per brand
total_per_brand = shoes['brandName'].value_counts().reset_index()

total_per_brand_n = total_per_brand['count'].quantile(.9)
total_per_brand = total_per_brand[total_per_brand['count'] >= total_per_brand_n]

# 2. Count total products per brand
shoes['clearance_available'] = shoes['clearance_p'].notna()

total_per_brand_c = (
    shoes
    .query('clearance_available == True')['brandName']
    .value_counts()
    .reset_index()
    )

# 3. Join
df = total_per_brand_c.merge(total_per_brand, on='brandName', how = 'left')
df = df[~df['count_y'].isna()]
df['rate'] = df['count_x'] / df['count_y'] 
df_top5 = df.nlargest(5, 'rate', keep = 'all')