= pd.DataFrame({
df "pid": [1, 1, 2, 2, 2, 3],
"artist_name": ["A", "A", "B", "B", "B", "C"],
"track_name": ["x", "x", "y", "y", "y", "z"]
})
Homework 3
Python selenium
& Pandas Basics
import pandas as pd
Direction
Please submit your Python Script for Part 1 and Part 2 in Homework 3 to Brightspace with the name below:
danl_210_hw3_LASTNAME_FIRSTNAME.py
( e.g.,danl_210_hw3_choe_byeonghak.py
)
The due is April 9, 2025, 11:59 P.M.
Please send Prof. Choe an email (bchoe@geneseo.edu) if you have any questions.
Understanding duplicated()
and drop_duplicates()
in Pandas
duplicated()
The duplicated()
function returns a boolean Series indicating whether each row is a duplicate based on specified columns.
Parameters:
subset
: Columns to consider when identifying duplicateskeep
:'first'
(default): Marks duplicates after the first asTrue
'last'
: Marks duplicates before the last asTrue
False
: Marks all duplicates asTrue
🔢 Example DataFrame
=["pid", "artist_name", "track_name"], keep='first') df.duplicated(subset
0 False
1 True
2 False
3 True
4 True
5 False
dtype: bool
=["pid", "artist_name", "track_name"], keep='last') df.duplicated(subset
0 True
1 False
2 True
3 True
4 False
5 False
dtype: bool
=["pid", "artist_name", "track_name"], keep=False) df.duplicated(subset
0 True
1 True
2 True
3 True
4 True
5 False
dtype: bool
drop_duplicates()
The drop_duplicates()
function removes duplicate rows based on specified columns. The behavior is controlled by the keep
parameter.
Parameters:
subset
: Columns to consider when identifying duplicateskeep
:'first'
(default): Keeps the first occurrence, removes the rest'last'
: Keeps the last occurrence, removes the restFalse
: Removes all duplicates, keeping only unique entries
🔢 Example DataFrame
= pd.DataFrame({
df "pid": [1, 1, 2, 2, 2, 3],
"artist_name": ["A", "A", "B", "B", "B", "C"],
"track_name": ["x", "x", "y", "y", "y", "z"]
})
=["pid", "artist_name", "track_name"], keep='first') df.drop_duplicates(subset
pid artist_name track_name
0 1 A x
2 2 B y
5 3 C z
=["pid", "artist_name", "track_name"], keep='last') df.drop_duplicates(subset
pid artist_name track_name
1 1 A x
4 2 B y
5 3 C z
=["pid", "artist_name", "track_name"], keep=False) df.drop_duplicates(subset
pid artist_name track_name
5 3 C z
Part 1. Collecting the IMDb Data using Python Selenium
Question 1
- Go to the following Internet Movie Database (IMDb) webpage for
- Provide your Python Selenium code to scrape the following information about Top 200 Family Movies and TV Shows by Popularity.
- You should create the following variables in the DataFrame:
ranking
(e.g., 1, 2, 3)title
(e.g., Snow White, Freakier Friday)year
(e.g., 2025, 2025-)runtime
(e.g., 1h 56m, 1h 40m)rating
(e.g., PG, TV-G, G)imdb_score
(e.g., 7.6, 6.5)votes
(e.g., 1.4k, 70k)metascore
(e.g., 56, 74)plot
(e.g., A princess joins forces with seven dwarfs to liberate her kingdom from her cruel stepmother the Evil Queen.)
- You should create the following variables in the DataFrame:
- Suppose
ranking
andtitle
are initially combined in a single column calledranking_title
, with values like:- “1. Snow White”
- “2. Freakier Friday”
- You can split this column into
ranking
andtitle
using the following code:
# Split the 'ranking_title' column into two new variables, 'ranking' and 'title'
# by the first occurrence of a period (dot).
# n=1 ensures the split to be limited to the first dot.
# expand=True tells pandas to return the results in a DataFrame
# If expand were False, the result would be a Series of lists.
'ranking', 'title']] = df['ranking_title'].str.split('.', n=1, expand=True)
df[[= df.drop(columns=['ranking_title']) df
- Finally, export the resulting DataFrame to the CSV file.
Click to Check the Answer!
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Tue Apr 15 14:05:39 2025
@author: bchoe
"""
# %%
# =============================================================================
# Libraries
# =============================================================================
import pandas as pd
import numpy as np
import os
import time
import random
# from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium import webdriver
# from selenium.webdriver.chrome.service import Service
# %%
# =============================================================================
# working directory
# =============================================================================
= 'PATHNAME_OF_YOUR_WORKING_DIRECTORY'
wd_path
os.chdir(wd_path)# os.getcwd()
# %%
# =============================================================================
# Selenium driver
# =============================================================================
= webdriver.Chrome()
driver = 'https://www.imdb.com/search/title/?genres=family&sort=popularity,desc&count=200'
url
driver.get(url)5,8))
time.sleep(random.uniform(
# %%
# =============================================================================
# Patterns
# =============================================================================
# <h3 class="ipc-title__text">1. A Minecraft Movie</h3>
# ranking_titles[0].text
# ranking_titles[1].text
# ranking_titles[200].text
# //*[@id="__next"]/main/div[2]/div[3]/section/section/div/section/section/div[2]/div/section/div[2]/div[2]/ul/li[1]/div/div/div/div[1]/div[2]/div[2]/span[1]
# //*[@id="__next"]/main/div[2]/div[3]/section/section/div/section/section/div[2]/div/section/div[2]/div[2]/ul/li[2]/div/div/div/div[1]/div[2]/div[2]/span[1]
# <span class="ipc-rating-star--rating">7.2</span>
# <span class="ipc-rating-star--voteCount"> (<!-- -->149K<!-- -->)</span>
# <div class="ipc-html-content-inner-div" role="presentation">A princess joins forces with seven dwarfs and a group of rebels to liberate her kingdom from her cruel stepmother the Evil Queen.</div>
# plots[0].text
# plots[199].text
# class_imdb_scores = "ipc-rating-star--rating"
# imdb_scores = driver.find_elements(By.CLASS_NAME, class_imdb_scores)
# <span aria-label="IMDb rating: 6.6" class="ipc-rating-star ipc-rating-star--base ipc-rating-star--imdb ratingGroup--imdb-rating" data-testid="ratingGroup--imdb-rating"><svg width="24" height="24" xmlns="http://www.w3.org/2000/svg" class="ipc-icon ipc-icon--star-inline" viewBox="0 0 24 24" fill="currentColor" role="presentation"><path d="M12 20.1l5.82 3.682c1.066.675 2.37-.322 2.09-1.584l-1.543-6.926 5.146-4.667c.94-.85.435-2.465-.799-2.567l-6.773-.602L13.29.89a1.38 1.38 0 0 0-2.581 0l-2.65 6.53-6.774.602C.052 8.126-.453 9.74.486 10.59l5.147 4.666-1.542 6.926c-.28 1.262 1.023 2.26 2.09 1.585L12 20.099z"></path></svg><span class="ipc-rating-star--rating">6.6</span><span class="ipc-rating-star--voteCount"> (<!-- -->100K<!-- -->)</span></span>
# <span class="ipc-rating-star ipc-rating-star--base ipc-rating-star--placeholder ratingGroup--placeholder standalone-star" data-testid="ratingGroup--placeholder" aria-hidden="true"><svg width="24" height="24" xmlns="http://www.w3.org/2000/svg" class="ipc-icon ipc-icon--star-inline" viewBox="0 0 24 24" fill="currentColor" role="presentation"><path d="M12 20.1l5.82 3.682c1.066.675 2.37-.322 2.09-1.584l-1.543-6.926 5.146-4.667c.94-.85.435-2.465-.799-2.567l-6.773-.602L13.29.89a1.38 1.38 0 0 0-2.581 0l-2.65 6.53-6.774.602C.052 8.126-.453 9.74.486 10.59l5.147 4.666-1.542 6.926c-.28 1.262 1.023 2.26 2.09 1.585L12 20.099z"></path></svg></span>
# class_imdb_scores = "ipc-rating-star--base"
# imdb_scores = driver.find_elements(By.CLASS_NAME, class_imdb_scores)
# imdb_scores[0].text
# class_votes = "ipc-rating-star--voteCount"
# votes = driver.find_elements(By.CLASS_NAME, class_votes)
# %%
# =============================================================================
# Loop
# =============================================================================
= "ipc-title__text"
class_ranking_titles = driver.find_elements(By.CLASS_NAME, class_ranking_titles)
ranking_titles
= "ipc-html-content-inner-div"
class_plots = driver.find_elements(By.CLASS_NAME, class_plots)
plots
= pd.DataFrame()
df for item in range(1, len(ranking_titles)-1):
= ranking_titles[item].text
ranking_title
= f'//*[@id="__next"]/main/div[2]/div[3]/section/section/div/section/section/div[2]/div/section/div[2]/div[2]/ul/li[{item}]/div/div/div/div[1]/div[2]/div[2]/span[1]'
xpath_year = driver.find_element(By.XPATH, xpath_year).text
year
= f'//*[@id="__next"]/main/div[2]/div[3]/section/section/div/section/section/div[2]/div/section/div[2]/div[2]/ul/li[{item}]/div/div/div/div[1]/div[2]/div[2]'
xpath_items = driver.find_element(By.XPATH, xpath_items).text
items_txt
# `"TV Series" in items_txt` returns True if "TV Series" is a substring of items_txt; False otherwise.
if "TV Series" in items_txt:
= pd.NA # This is a missing value; Empty string ('') can also be used instead.
runtime
= f'//*[@id="__next"]/main/div[2]/div[3]/section/section/div/section/section/div[2]/div/section/div[2]/div[2]/ul/li[{item}]/div/div/div/div[1]/div[2]/div[2]/span[2]'
xpath_rating = driver.find_element(By.XPATH, xpath_rating).text
rating
= pd.NA
metascore
= "TV Series"
type_movie
elif "TV Movie" in items_txt:
= f'//*[@id="__next"]/main/div[2]/div[3]/section/section/div/section/section/div[2]/div/section/div[2]/div[2]/ul/li[{item}]/div/div/div/div[1]/div[2]/div[2]/span[2]'
xpath_runtime = driver.find_element(By.XPATH, xpath_runtime).text
runtime
= f'//*[@id="__next"]/main/div[2]/div[3]/section/section/div/section/section/div[2]/div/section/div[2]/div[2]/ul/li[{item}]/div/div/div/div[1]/div[2]/div[2]/span[3]'
xpath_rating = driver.find_element(By.XPATH, xpath_rating).text
rating
= pd.NA
metascore
= "TV Movie"
type_movie
else:
try:
= f'//*[@id="__next"]/main/div[2]/div[3]/section/section/div/section/section/div[2]/div/section/div[2]/div[2]/ul/li[{item}]/div/div/div/div[1]/div[2]/div[2]/span[2]'
xpath_runtime = driver.find_element(By.XPATH, xpath_runtime).text
runtime except:
= pd.NA
runtime
try:
= f'//*[@id="__next"]/main/div[2]/div[3]/section/section/div/section/section/div[2]/div/section/div[2]/div[2]/ul/li[{item}]/div/div/div/div[1]/div[2]/div[2]/span[3]'
xpath_rating = driver.find_element(By.XPATH, xpath_rating).text
rating except:
= pd.NA
rating
try:
= f'//*[@id="__next"]/main/div[2]/div[3]/section/section/div/section/section/div[2]/div/section/div[2]/div[2]/ul/li[{item}]/div/div/div/div[1]/div[2]/div[2]/span[4]/span[1]'
xpath_metascore = driver.find_element(By.XPATH, xpath_metascore).text
metascore except:
= pd.NA
metascore
= "Movie"
type_movie
try:
= f'//*[@id="__next"]/main/div[2]/div[3]/section/section/div/section/section/div[2]/div/section/div[2]/div[2]/ul/li[{item}]/div/div/div/div[1]/div[2]/span/div/span/span[1]'
xpath_imdb_score = driver.find_element(By.XPATH, xpath_imdb_score).text
imdb_score except:
= pd.NA
imdb_score
try:
= f'//*[@id="__next"]/main/div[2]/div[3]/section/section/div/section/section/div[2]/div/section/div[2]/div[2]/ul/li[{item}]/div/div/div/div[1]/div[2]/span/div/span/span[2]'
xpath_vote = driver.find_element(By.XPATH, xpath_vote).text
vote except:
= pd.NA
vote
= plots[item-1].text
plot
= [ranking_title, year, runtime, rating, metascore, type_movie, imdb_score, vote, plot]
lst = pd.DataFrame([lst])
obs = pd.concat([df, obs], ignore_index=True)
df
= ['ranking_title', 'year', 'runtime', 'rating', 'metascore', 'type_movie', 'imdb_score', 'vote', 'plot']
df.columns
# %%
# =============================================================================
# Cleaning
# =============================================================================
'ranking', 'title']] = df['ranking_title'].str.split('.', n=1, expand=True)
df[[= df.drop(columns=['ranking_title'])
df 'ranking'] = df['ranking'].astype('int')
df[
'vote'] = df['vote'].str.replace(' (', '')
df['vote'] = df['vote'].str.replace(')', '')
df[
'runtime_h', 'runtime_m']] = df['runtime'].str.split(' ', n=1, expand=True)
df[['runtime_h'] = df['runtime_h'].str.replace('h', '')
df['runtime_m'] = df['runtime_m'].str.replace('m', '')
df[
'runtime_h'] = pd.to_numeric(df['runtime_h'], errors='coerce')
df['runtime_m'] = pd.to_numeric(df['runtime_m'], errors='coerce')
df[
'runtime'] = df['runtime_h'] * 60 + df['runtime_m']
df[= df.drop(columns=['runtime_h', 'runtime_m'])
df
'metascore'] = pd.to_numeric(df['metascore'], errors='coerce') # errors='coerce' to avoid an error
df['imdb_score'] = pd.to_numeric(df['imdb_score'], errors='coerce')
df[
'is_k'] = df['vote'].str.endswith('K')
df['is_m'] = df['vote'].str.endswith('M')
df['vote'] = df['vote'].str.replace('K', '').str.replace('M', '')
df['vote'] = pd.to_numeric(df['vote'], errors='coerce')
df['vote'] = np.where(df['is_k'] == True, df['vote'] * 1000, df['vote'])
df['vote'] = np.where(df['is_m'] == True, df['vote'] * 1000000, df['vote'])
df[= df.drop(columns=['is_k', 'is_m'])
df
'year_start', 'year_end']] = df['year'].str.split('–', n=1, expand=True)
df[['year_start'] = df['year_start'].astype('int')
df['year_end'] = np.where(df['year_end'] == '', 'present', df['year_end'])
df[
# df.columns
= df[['ranking', 'title', 'year', 'year_start', 'year_end', 'runtime', 'rating', 'metascore', 'type_movie', 'imdb_score', 'vote', 'plot']]
df
# Export df as CSV
'data/imdb_family_2025_0415.csv', index = False, encoding = 'utf-8-sig') df.to_csv(
Part 2. Pandas Basic
Below is spotify
DataFrame
that reads the file spotify_all.csv
containing data of Spotify users’ playlist information (Source: Spotify Million Playlist Dataset Challenge).
= pd.read_csv('https://bcdanl.github.io/data/spotify_all.csv') spotify
Variable Description
pid
: playlist ID; unique ID for playlistplaylist_name
: a name of playlistpos
: a position of the track within a playlist (starting from 0)artist_name
: name of the track’s primary artisttrack_name
: name of the trackduration_ms
: duration of the track in millisecondsalbum_name
: name of the track’s album
Definition of a Song
In Part 2, a song is defined as a combination of a
artist_name
value and atrack_name
value.E.g., the following provides the 12 distinct songs with the same
track_name
—I Love You:
Question 2
- Write a Python code to identify the top five songs with the highest frequency of appearances in the
spotify
DataFrame.
Click to Check the Answer!
= (
songs_top_5 'artist_name', 'track_name']]
spotify[[
.value_counts()
.reset_index()5, 'count', keep="all")
.nlargest( )
Question 3
- Write a Python code to create a DataFrame that contains information about how often each song occurs within the
spotify
DataFrame.- In this new DataFrame, each observation should represent a distinct song.
- Then, write a Python code to identify top 5% songs based on their frequency of appearances.
Click to Check the Answer!
= (
songs_freq 'artist_name', 'track_name']]
spotify[[
.value_counts()
.reset_index()
)
= songs_freq['count'] >= songs_freq['count'].quantile(.95)
cond = songs_freq[ cond ] songs_top_5_pct
Question 4
- Write a Python code to list all artists who have more than 50 unique songs in the
sportify
DataFrame.
Click to Check the Answer!
= (
artists_50_songs 'artist_name', 'track_name']]
spotify[[
.value_counts()'artist_name']
.reset_index()[
.value_counts()
.reset_index()'count > 50')
.query( )
Question 5
- Write a Python code to create a DataFrame that identifies all the playlists featuring the song “One Dance” by Drake.
Click to Check the Answer!
= (
playlist_onedance
spotify'artist_name == "Drake" & track_name == "One Dance"')
.query(
)
= (
playlist_songs_w_onedance 'pid'].isin(playlist_onedance['pid']) ]
spotify[ spotify[ )
Question 6
- Write a Python code to identify the longest and shortest duration songs (based on
duration_ms
) for each unique artist.
Click to Check the Answer!
# Shortest song per artist
= (
shortest_songs
spotify"artist_name", "duration_ms"])
.sort_values([=["artist_name"], keep="first")
.drop_duplicates(subset'artist_name', 'track_name', 'duration_ms']]
[[
)
# Longest song per artist
= (
longest_songs
spotify"artist_name", "duration_ms"])
.sort_values([=["artist_name"], keep="last")
.drop_duplicates(subset'artist_name', 'track_name', 'duration_ms']]
[[ )
Question 7
Write a Python code to find the same song(s) appearing more than once in the same playlist.
- Using
drop_duplicates()
with index-based filtering.
Click to Check the Answer!
= spotify.drop_duplicates(subset=["pid", "artist_name", "track_name"], keep=False)
non_dup = ~spotify.index.isin( non_dup.index )
dup_T = spotify[ dup_T ] duplicates
- Using
duplicated()
with standard filtering.- The
subset
andkeep
parameters work similarly to those indrop_duplicates()
- The
Click to Check the Answer!
= spotify.duplicated(subset=["pid", "artist_name", "track_name"], keep = False)
dup_T = (
duplicates
spotify[ dup_T ] )
Question 8
- Write a Python code to filter all songs that appear in more than 100 different playlists.
Click to Check the Answer!
= (
songs_100_plist
spotify=["pid", "artist_name", "track_name"], keep=False)[['artist_name', 'track_name']]
.drop_duplicates(subset
.value_counts()
.reset_index()'count > 100')
.query( )
Part 3. Jupyter Notebook Blogging
= pd.read_csv("https://bcdanl.github.io/data/holiday_movies.csv") holiday_movies
The DataFrame
holiday_movies
comes from the IMDb.The following is the DataFrame,
holiday_movies
.
Variable description
tconst
: alphanumeric unique identifier of the titletitle_type
: the type/format of the title- (movie, video, or tvMovie)
primary_title
: the more popular title / the title used by the filmmakers on promotional materials at the point of releasesimple_title
: the title in lowercase, with punctuation removed, for easier filtering and groupingyear
: the release year of a titleruntime_minutes
: primary runtime of the title, in minutesaverage_rating
: weighted average of all the individual user ratings on IMDbnum_votes
: number of votes the title has received on IMDb (titles with fewer than 10 votes were not included in this dataset)
- The following is another DataFrame
holiday_movie_genres
that is related with the DataFrameholiday_movies
:
= pd.read_csv("https://bcdanl.github.io/data/holiday_movie_genres.csv") holiday_movie_genres
- The DataFrame
holiday_movie_genres
include up to three genres associated with the titles that appear in the DataFrame.
Variable description
tconst
: alphanumeric unique identifier of the titlegenres
: genres associated with the title, one row per genre
- Using the provided DataFrames, write a blog post about Christmas movies in a Jupyter Notebook and publish it on your online blog.
- In your analysis, make use of counting, sorting, indexing, filtering, and joining techniques.