Homework 3
Python selenium & Pandas Basics
Descriptive Statistics
The following provides the descriptive statistics for each part of the Homework 3:
import pandas as pdDirection
Please submit your Python Script for Part 1 and Part 2 in Homework 3 to Brightspace with the name below:
danl_210_hw3_LASTNAME_FIRSTNAME.py
( e.g.,danl_210_hw3_choe_byeonghak.py)
The due is April 9, 2025, 11:59 P.M.
Please send Prof. Choe an email (bchoe@geneseo.edu) if you have any questions.
Understanding duplicated() and drop_duplicates() in Pandas
duplicated()
The duplicated() function returns a boolean Series indicating whether each row is a duplicate based on specified columns.
Parameters:
subset: Columns to consider when identifying duplicateskeep:'first'(default): Marks duplicates after the first asTrue'last': Marks duplicates before the last asTrueFalse: Marks all duplicates asTrue
🔢 Example DataFrame
df = pd.DataFrame({
"pid": [1, 1, 2, 2, 2, 3],
"artist_name": ["A", "A", "B", "B", "B", "C"],
"track_name": ["x", "x", "y", "y", "y", "z"]
})df.duplicated(subset=["pid", "artist_name", "track_name"], keep='first')0 False
1 True
2 False
3 True
4 True
5 False
dtype: bool
df.duplicated(subset=["pid", "artist_name", "track_name"], keep='last')0 True
1 False
2 True
3 True
4 False
5 False
dtype: bool
df.duplicated(subset=["pid", "artist_name", "track_name"], keep=False)0 True
1 True
2 True
3 True
4 True
5 False
dtype: bool
drop_duplicates()
The drop_duplicates() function removes duplicate rows based on specified columns. The behavior is controlled by the keep parameter.
Parameters:
subset: Columns to consider when identifying duplicateskeep:'first'(default): Keeps the first occurrence, removes the rest'last': Keeps the last occurrence, removes the restFalse: Removes all duplicates, keeping only unique entries
🔢 Example DataFrame
df = pd.DataFrame({
"pid": [1, 1, 2, 2, 2, 3],
"artist_name": ["A", "A", "B", "B", "B", "C"],
"track_name": ["x", "x", "y", "y", "y", "z"]
})df.drop_duplicates(subset=["pid", "artist_name", "track_name"], keep='first') pid artist_name track_name
0 1 A x
2 2 B y
5 3 C z
df.drop_duplicates(subset=["pid", "artist_name", "track_name"], keep='last') pid artist_name track_name
1 1 A x
4 2 B y
5 3 C z
df.drop_duplicates(subset=["pid", "artist_name", "track_name"], keep=False) pid artist_name track_name
5 3 C z
Part 1. Collecting the IMDb Data using Python Selenium
Question 1
- Go to the following Internet Movie Database (IMDb) webpage for
- Provide your Python Selenium code to scrape the following information about Top 200 Family Movies and TV Shows by Popularity.
- You should create the following variables in the DataFrame:
ranking(e.g., 1, 2, 3)title(e.g., Snow White, Freakier Friday)year(e.g., 2025, 2025-)runtime(e.g., 1h 56m, 1h 40m)rating(e.g., PG, TV-G, G)imdb_score(e.g., 7.6, 6.5)votes(e.g., 1.4k, 70k)metascore(e.g., 56, 74)plot(e.g., A princess joins forces with seven dwarfs to liberate her kingdom from her cruel stepmother the Evil Queen.)
- You should create the following variables in the DataFrame:
- Suppose
rankingandtitleare initially combined in a single column calledranking_title, with values like:- “1. Snow White”
- “2. Freakier Friday”
- You can split this column into
rankingandtitleusing the following code:
# Split the 'ranking_title' column into two new variables, 'ranking' and 'title'
# by the first occurrence of a period (dot).
# n=1 ensures the split to be limited to the first dot.
# expand=True tells pandas to return the results in a DataFrame
# If expand were False, the result would be a Series of lists.
df[['ranking', 'title']] = df['ranking_title'].str.split('.', n=1, expand=True)
df = df.drop(columns=['ranking_title'])- Finally, export the resulting DataFrame to the CSV file.
Click to Check the Answer!
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Tue Apr 15 14:05:39 2025
@author: bchoe
"""
# %%
# =============================================================================
# Libraries
# =============================================================================
import pandas as pd
import numpy as np
import os
import time
import random
# from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium import webdriver
# from selenium.webdriver.chrome.service import Service
# %%
# =============================================================================
# working directory
# =============================================================================
wd_path = 'PATHNAME_OF_YOUR_WORKING_DIRECTORY'
os.chdir(wd_path)
# os.getcwd()
# %%
# =============================================================================
# Selenium driver
# =============================================================================
driver = webdriver.Chrome()
url = 'https://www.imdb.com/search/title/?genres=family&sort=popularity,desc&count=200'
driver.get(url)
time.sleep(random.uniform(5,8))
# %%
# =============================================================================
# Patterns
# =============================================================================
# <h3 class="ipc-title__text">1. A Minecraft Movie</h3>
# ranking_titles[0].text
# ranking_titles[1].text
# ranking_titles[200].text
# //*[@id="__next"]/main/div[2]/div[3]/section/section/div/section/section/div[2]/div/section/div[2]/div[2]/ul/li[1]/div/div/div/div[1]/div[2]/div[2]/span[1]
# //*[@id="__next"]/main/div[2]/div[3]/section/section/div/section/section/div[2]/div/section/div[2]/div[2]/ul/li[2]/div/div/div/div[1]/div[2]/div[2]/span[1]
# <span class="ipc-rating-star--rating">7.2</span>
# <span class="ipc-rating-star--voteCount"> (<!-- -->149K<!-- -->)</span>
# <div class="ipc-html-content-inner-div" role="presentation">A princess joins forces with seven dwarfs and a group of rebels to liberate her kingdom from her cruel stepmother the Evil Queen.</div>
# plots[0].text
# plots[199].text
# class_imdb_scores = "ipc-rating-star--rating"
# imdb_scores = driver.find_elements(By.CLASS_NAME, class_imdb_scores)
# <span aria-label="IMDb rating: 6.6" class="ipc-rating-star ipc-rating-star--base ipc-rating-star--imdb ratingGroup--imdb-rating" data-testid="ratingGroup--imdb-rating"><svg width="24" height="24" xmlns="http://www.w3.org/2000/svg" class="ipc-icon ipc-icon--star-inline" viewBox="0 0 24 24" fill="currentColor" role="presentation"><path d="M12 20.1l5.82 3.682c1.066.675 2.37-.322 2.09-1.584l-1.543-6.926 5.146-4.667c.94-.85.435-2.465-.799-2.567l-6.773-.602L13.29.89a1.38 1.38 0 0 0-2.581 0l-2.65 6.53-6.774.602C.052 8.126-.453 9.74.486 10.59l5.147 4.666-1.542 6.926c-.28 1.262 1.023 2.26 2.09 1.585L12 20.099z"></path></svg><span class="ipc-rating-star--rating">6.6</span><span class="ipc-rating-star--voteCount"> (<!-- -->100K<!-- -->)</span></span>
# <span class="ipc-rating-star ipc-rating-star--base ipc-rating-star--placeholder ratingGroup--placeholder standalone-star" data-testid="ratingGroup--placeholder" aria-hidden="true"><svg width="24" height="24" xmlns="http://www.w3.org/2000/svg" class="ipc-icon ipc-icon--star-inline" viewBox="0 0 24 24" fill="currentColor" role="presentation"><path d="M12 20.1l5.82 3.682c1.066.675 2.37-.322 2.09-1.584l-1.543-6.926 5.146-4.667c.94-.85.435-2.465-.799-2.567l-6.773-.602L13.29.89a1.38 1.38 0 0 0-2.581 0l-2.65 6.53-6.774.602C.052 8.126-.453 9.74.486 10.59l5.147 4.666-1.542 6.926c-.28 1.262 1.023 2.26 2.09 1.585L12 20.099z"></path></svg></span>
# class_imdb_scores = "ipc-rating-star--base"
# imdb_scores = driver.find_elements(By.CLASS_NAME, class_imdb_scores)
# imdb_scores[0].text
# class_votes = "ipc-rating-star--voteCount"
# votes = driver.find_elements(By.CLASS_NAME, class_votes)
# %%
# =============================================================================
# Loop
# =============================================================================
class_ranking_titles = "ipc-title__text"
ranking_titles = driver.find_elements(By.CLASS_NAME, class_ranking_titles)
class_plots = "ipc-html-content-inner-div"
plots = driver.find_elements(By.CLASS_NAME, class_plots)
df = pd.DataFrame()
for item in range(1, len(ranking_titles)-1):
ranking_title = ranking_titles[item].text
xpath_year = f'//*[@id="__next"]/main/div[2]/div[3]/section/section/div/section/section/div[2]/div/section/div[2]/div[2]/ul/li[{item}]/div/div/div/div[1]/div[2]/div[2]/span[1]'
year = driver.find_element(By.XPATH, xpath_year).text
xpath_items = f'//*[@id="__next"]/main/div[2]/div[3]/section/section/div/section/section/div[2]/div/section/div[2]/div[2]/ul/li[{item}]/div/div/div/div[1]/div[2]/div[2]'
items_txt = driver.find_element(By.XPATH, xpath_items).text
# `"TV Series" in items_txt` returns True if "TV Series" is a substring of items_txt; False otherwise.
if "TV Series" in items_txt:
runtime = pd.NA # This is a missing value; Empty string ('') can also be used instead.
xpath_rating = f'//*[@id="__next"]/main/div[2]/div[3]/section/section/div/section/section/div[2]/div/section/div[2]/div[2]/ul/li[{item}]/div/div/div/div[1]/div[2]/div[2]/span[2]'
rating = driver.find_element(By.XPATH, xpath_rating).text
metascore = pd.NA
type_movie = "TV Series"
elif "TV Movie" in items_txt:
xpath_runtime = f'//*[@id="__next"]/main/div[2]/div[3]/section/section/div/section/section/div[2]/div/section/div[2]/div[2]/ul/li[{item}]/div/div/div/div[1]/div[2]/div[2]/span[2]'
runtime = driver.find_element(By.XPATH, xpath_runtime).text
xpath_rating = f'//*[@id="__next"]/main/div[2]/div[3]/section/section/div/section/section/div[2]/div/section/div[2]/div[2]/ul/li[{item}]/div/div/div/div[1]/div[2]/div[2]/span[3]'
rating = driver.find_element(By.XPATH, xpath_rating).text
metascore = pd.NA
type_movie = "TV Movie"
else:
try:
xpath_runtime = f'//*[@id="__next"]/main/div[2]/div[3]/section/section/div/section/section/div[2]/div/section/div[2]/div[2]/ul/li[{item}]/div/div/div/div[1]/div[2]/div[2]/span[2]'
runtime = driver.find_element(By.XPATH, xpath_runtime).text
except:
runtime = pd.NA
try:
xpath_rating = f'//*[@id="__next"]/main/div[2]/div[3]/section/section/div/section/section/div[2]/div/section/div[2]/div[2]/ul/li[{item}]/div/div/div/div[1]/div[2]/div[2]/span[3]'
rating = driver.find_element(By.XPATH, xpath_rating).text
except:
rating = pd.NA
try:
xpath_metascore = f'//*[@id="__next"]/main/div[2]/div[3]/section/section/div/section/section/div[2]/div/section/div[2]/div[2]/ul/li[{item}]/div/div/div/div[1]/div[2]/div[2]/span[4]/span[1]'
metascore = driver.find_element(By.XPATH, xpath_metascore).text
except:
metascore = pd.NA
type_movie = "Movie"
try:
xpath_imdb_score = f'//*[@id="__next"]/main/div[2]/div[3]/section/section/div/section/section/div[2]/div/section/div[2]/div[2]/ul/li[{item}]/div/div/div/div[1]/div[2]/span/div/span/span[1]'
imdb_score = driver.find_element(By.XPATH, xpath_imdb_score).text
except:
imdb_score = pd.NA
try:
xpath_vote = f'//*[@id="__next"]/main/div[2]/div[3]/section/section/div/section/section/div[2]/div/section/div[2]/div[2]/ul/li[{item}]/div/div/div/div[1]/div[2]/span/div/span/span[2]'
vote = driver.find_element(By.XPATH, xpath_vote).text
except:
vote = pd.NA
plot = plots[item-1].text
lst = [ranking_title, year, runtime, rating, metascore, type_movie, imdb_score, vote, plot]
obs = pd.DataFrame([lst])
df = pd.concat([df, obs], ignore_index=True)
df.columns = ['ranking_title', 'year', 'runtime', 'rating', 'metascore', 'type_movie', 'imdb_score', 'vote', 'plot']
# %%
# =============================================================================
# Cleaning
# =============================================================================
df[['ranking', 'title']] = df['ranking_title'].str.split('.', n=1, expand=True)
df = df.drop(columns=['ranking_title'])
df['ranking'] = df['ranking'].astype('int')
df['vote'] = df['vote'].str.replace(' (', '')
df['vote'] = df['vote'].str.replace(')', '')
df[['runtime_h', 'runtime_m']] = df['runtime'].str.split(' ', n=1, expand=True)
df['runtime_h'] = df['runtime_h'].str.replace('h', '')
df['runtime_m'] = df['runtime_m'].str.replace('m', '')
df['runtime_h'] = pd.to_numeric(df['runtime_h'], errors='coerce')
df['runtime_m'] = pd.to_numeric(df['runtime_m'], errors='coerce')
df['runtime'] = df['runtime_h'] * 60 + df['runtime_m']
df = df.drop(columns=['runtime_h', 'runtime_m'])
df['metascore'] = pd.to_numeric(df['metascore'], errors='coerce') # errors='coerce' to avoid an error
df['imdb_score'] = pd.to_numeric(df['imdb_score'], errors='coerce')
df['is_k'] = df['vote'].str.endswith('K')
df['is_m'] = df['vote'].str.endswith('M')
df['vote'] = df['vote'].str.replace('K', '').str.replace('M', '')
df['vote'] = pd.to_numeric(df['vote'], errors='coerce')
df['vote'] = np.where(df['is_k'] == True, df['vote'] * 1000, df['vote'])
df['vote'] = np.where(df['is_m'] == True, df['vote'] * 1000000, df['vote'])
df = df.drop(columns=['is_k', 'is_m'])
df[['year_start', 'year_end']] = df['year'].str.split('–', n=1, expand=True)
df['year_start'] = df['year_start'].astype('int')
df['year_end'] = np.where(df['year_end'] == '', 'present', df['year_end'])
# df.columns
df = df[['ranking', 'title', 'year', 'year_start', 'year_end', 'runtime', 'rating', 'metascore', 'type_movie', 'imdb_score', 'vote', 'plot']]
# Export df as CSV
df.to_csv('data/imdb_family_2025_0415.csv', index = False, encoding = 'utf-8-sig')Part 2. Pandas Basic
Below is spotify DataFrame that reads the file spotify_all.csv containing data of Spotify users’ playlist information (Source: Spotify Million Playlist Dataset Challenge).
spotify = pd.read_csv('https://bcdanl.github.io/data/spotify_all.csv')Variable Description
pid: playlist ID; unique ID for playlistplaylist_name: a name of playlistpos: a position of the track within a playlist (starting from 0)artist_name: name of the track’s primary artisttrack_name: name of the trackduration_ms: duration of the track in millisecondsalbum_name: name of the track’s album
Definition of a Song
In Part 2, a song is defined as a combination of a
artist_namevalue and atrack_namevalue.E.g., the following provides the 12 distinct songs with the same
track_name—I Love You:
Question 2
- Write a Python code to identify the top five songs with the highest frequency of appearances in the
spotifyDataFrame.
Click to Check the Answer!
songs_top_5 = (
spotify[['artist_name', 'track_name']]
.value_counts()
.reset_index()
.nlargest(5, 'count', keep="all")
)Question 3
- Write a Python code to create a DataFrame that contains information about how often each song occurs within the
spotifyDataFrame.- In this new DataFrame, each observation should represent a distinct song.
- Then, write a Python code to identify top 5% songs based on their frequency of appearances.
Click to Check the Answer!
songs_freq = (
spotify[['artist_name', 'track_name']]
.value_counts()
.reset_index()
)
cond = songs_freq['count'] >= songs_freq['count'].quantile(.95)
songs_top_5_pct = songs_freq[ cond ]Question 4
- Write a Python code to list all artists who have more than 50 unique songs in the
sportifyDataFrame.
Click to Check the Answer!
artists_50_songs = (
spotify[['artist_name', 'track_name']]
.value_counts()
.reset_index()['artist_name']
.value_counts()
.reset_index()
.query('count > 50')
)Question 5
- Write a Python code to create a DataFrame that identifies all the playlists featuring the song “One Dance” by Drake.
Click to Check the Answer!
playlist_onedance = (
spotify
.query('artist_name == "Drake" & track_name == "One Dance"')
)
playlist_songs_w_onedance = (
spotify[ spotify['pid'].isin(playlist_onedance['pid']) ]
)Question 6
- Write a Python code to identify the longest and shortest duration songs (based on
duration_ms) for each unique artist.
Click to Check the Answer!
# Shortest song per artist
shortest_songs = (
spotify
.sort_values(["artist_name", "duration_ms"])
.drop_duplicates(subset=["artist_name"], keep="first")
[['artist_name', 'track_name', 'duration_ms']]
)
# Longest song per artist
longest_songs = (
spotify
.sort_values(["artist_name", "duration_ms"])
.drop_duplicates(subset=["artist_name"], keep="last")
[['artist_name', 'track_name', 'duration_ms']]
)Question 7
Write a Python code to find the same song(s) appearing more than once in the same playlist.
- Using
drop_duplicates()with index-based filtering.
Click to Check the Answer!
non_dup = spotify.drop_duplicates(subset=["pid", "artist_name", "track_name"], keep=False)
dup_T = ~spotify.index.isin( non_dup.index )
duplicates = spotify[ dup_T ]- Using
duplicated()with standard filtering.- The
subsetandkeepparameters work similarly to those indrop_duplicates()
- The
Click to Check the Answer!
dup_T = spotify.duplicated(subset=["pid", "artist_name", "track_name"], keep = False)
duplicates = (
spotify[ dup_T ]
)Question 8
- Write a Python code to filter all songs that appear in more than 100 different playlists.
Click to Check the Answer!
songs_100_plist = (
spotify
.drop_duplicates(subset=["pid", "artist_name", "track_name"], keep=False)[['artist_name', 'track_name']]
.value_counts()
.reset_index()
.query('count > 100')
)Part 3. Jupyter Notebook Blogging
holiday_movies = pd.read_csv("https://bcdanl.github.io/data/holiday_movies.csv")The DataFrame
holiday_moviescomes from the IMDb.The following is the DataFrame,
holiday_movies.
Variable description
tconst: alphanumeric unique identifier of the titletitle_type: the type/format of the title- (movie, video, or tvMovie)
primary_title: the more popular title / the title used by the filmmakers on promotional materials at the point of releasesimple_title: the title in lowercase, with punctuation removed, for easier filtering and groupingyear: the release year of a titleruntime_minutes: primary runtime of the title, in minutesaverage_rating: weighted average of all the individual user ratings on IMDbnum_votes: number of votes the title has received on IMDb (titles with fewer than 10 votes were not included in this dataset)
- The following is another DataFrame
holiday_movie_genresthat is related with the DataFrameholiday_movies:
holiday_movie_genres = pd.read_csv("https://bcdanl.github.io/data/holiday_movie_genres.csv")- The DataFrame
holiday_movie_genresinclude up to three genres associated with the titles that appear in the DataFrame.
Variable description
tconst: alphanumeric unique identifier of the titlegenres: genres associated with the title, one row per genre
- Using the provided DataFrames, write a blog post about Christmas movies in a Jupyter Notebook and publish it on your online blog.
- In your analysis, make use of counting, sorting, indexing, filtering, and joining techniques.