Homework 3

Python selenium & Pandas Basics

Author

Byeong-Hak Choe

Published

March 30, 2025

Modified

April 25, 2025

import pandas as pd

Direction

  • Please submit your Python Script for Part 1 and Part 2 in Homework 3 to Brightspace with the name below:

    • danl_210_hw3_LASTNAME_FIRSTNAME.py
      ( e.g., danl_210_hw3_choe_byeonghak.py )
  • The due is April 9, 2025, 11:59 P.M.

  • Please send Prof. Choe an email (bchoe@geneseo.edu) if you have any questions.




Understanding duplicated() and drop_duplicates() in Pandas

duplicated()

The duplicated() function returns a boolean Series indicating whether each row is a duplicate based on specified columns.

Parameters:

  • subset: Columns to consider when identifying duplicates
  • keep:
    • 'first' (default): Marks duplicates after the first as True
    • 'last': Marks duplicates before the last as True
    • False: Marks all duplicates as True

🔢 Example DataFrame

df = pd.DataFrame({
    "pid": [1, 1, 2, 2, 2, 3],
    "artist_name": ["A", "A", "B", "B", "B", "C"],
    "track_name": ["x", "x", "y", "y", "y", "z"]
})
df.duplicated(subset=["pid", "artist_name", "track_name"], keep='first')
0    False
1     True
2    False
3     True
4     True
5    False
dtype: bool
df.duplicated(subset=["pid", "artist_name", "track_name"], keep='last')
0     True
1    False
2     True
3     True
4    False
5    False
dtype: bool
df.duplicated(subset=["pid", "artist_name", "track_name"], keep=False)
0     True
1     True
2     True
3     True
4     True
5    False
dtype: bool


drop_duplicates()

The drop_duplicates() function removes duplicate rows based on specified columns. The behavior is controlled by the keep parameter.

Parameters:

  • subset: Columns to consider when identifying duplicates
  • keep:
    • 'first' (default): Keeps the first occurrence, removes the rest
    • 'last': Keeps the last occurrence, removes the rest
    • False: Removes all duplicates, keeping only unique entries

🔢 Example DataFrame

df = pd.DataFrame({
    "pid": [1, 1, 2, 2, 2, 3],
    "artist_name": ["A", "A", "B", "B", "B", "C"],
    "track_name": ["x", "x", "y", "y", "y", "z"]
})
df.drop_duplicates(subset=["pid", "artist_name", "track_name"], keep='first')
   pid artist_name track_name
0    1           A          x
2    2           B          y
5    3           C          z
df.drop_duplicates(subset=["pid", "artist_name", "track_name"], keep='last')
   pid artist_name track_name
1    1           A          x
4    2           B          y
5    3           C          z
df.drop_duplicates(subset=["pid", "artist_name", "track_name"], keep=False)
   pid artist_name track_name
5    3           C          z


Part 1. Collecting the IMDb Data using Python Selenium



Question 1

  • Go to the following Internet Movie Database (IMDb) webpage for
  • Provide your Python Selenium code to scrape the following information about Top 200 Family Movies and TV Shows by Popularity.
    • You should create the following variables in the DataFrame:
      • ranking (e.g., 1, 2, 3)
      • title (e.g., Snow White, Freakier Friday)
      • year (e.g., 2025, 2025-)
      • runtime (e.g., 1h 56m, 1h 40m)
      • rating (e.g., PG, TV-G, G)
      • imdb_score (e.g., 7.6, 6.5)
      • votes (e.g., 1.4k, 70k)
      • metascore (e.g., 56, 74)
      • plot (e.g., A princess joins forces with seven dwarfs to liberate her kingdom from her cruel stepmother the Evil Queen.)
  • Suppose ranking and title are initially combined in a single column called ranking_title, with values like:
    • “1. Snow White”
    • “2. Freakier Friday”
  • You can split this column into ranking and title using the following code:
# Split the 'ranking_title' column into two new variables, 'ranking' and 'title'
# by the first occurrence of a period (dot). 

# n=1 ensures the split to be limited to the first dot.
# expand=True tells pandas to return the results in a DataFrame
  # If expand were False, the result would be a Series of lists.

df[['ranking', 'title']] = df['ranking_title'].str.split('.', n=1, expand=True)
df = df.drop(columns=['ranking_title'])
  • Finally, export the resulting DataFrame to the CSV file.
Click to Check the Answer!
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Tue Apr 15 14:05:39 2025

@author: bchoe
"""

# %%
# =============================================================================
# Libraries
# =============================================================================

import pandas as pd
import numpy as np
import os
import time
import random
# from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium import webdriver
# from selenium.webdriver.chrome.service import Service

# %%
# =============================================================================
# working directory
# =============================================================================

wd_path = 'PATHNAME_OF_YOUR_WORKING_DIRECTORY'
os.chdir(wd_path)
# os.getcwd()

# %%
# =============================================================================
# Selenium driver
# =============================================================================

driver = webdriver.Chrome()
url = 'https://www.imdb.com/search/title/?genres=family&sort=popularity,desc&count=200'
driver.get(url)
time.sleep(random.uniform(5,8))

# %%
# =============================================================================
# Patterns
# =============================================================================

# <h3 class="ipc-title__text">1. A Minecraft Movie</h3>
# ranking_titles[0].text
# ranking_titles[1].text
# ranking_titles[200].text

# //*[@id="__next"]/main/div[2]/div[3]/section/section/div/section/section/div[2]/div/section/div[2]/div[2]/ul/li[1]/div/div/div/div[1]/div[2]/div[2]/span[1]
# //*[@id="__next"]/main/div[2]/div[3]/section/section/div/section/section/div[2]/div/section/div[2]/div[2]/ul/li[2]/div/div/div/div[1]/div[2]/div[2]/span[1]

# <span class="ipc-rating-star--rating">7.2</span>
# <span class="ipc-rating-star--voteCount">&nbsp;(<!-- -->149K<!-- -->)</span>

# <div class="ipc-html-content-inner-div" role="presentation">A princess joins forces with seven dwarfs and a group of rebels to liberate her kingdom from her cruel stepmother the Evil Queen.</div>
# plots[0].text
# plots[199].text

# class_imdb_scores = "ipc-rating-star--rating"
# imdb_scores = driver.find_elements(By.CLASS_NAME, class_imdb_scores)
# <span aria-label="IMDb rating: 6.6" class="ipc-rating-star ipc-rating-star--base ipc-rating-star--imdb ratingGroup--imdb-rating" data-testid="ratingGroup--imdb-rating"><svg width="24" height="24" xmlns="http://www.w3.org/2000/svg" class="ipc-icon ipc-icon--star-inline" viewBox="0 0 24 24" fill="currentColor" role="presentation"><path d="M12 20.1l5.82 3.682c1.066.675 2.37-.322 2.09-1.584l-1.543-6.926 5.146-4.667c.94-.85.435-2.465-.799-2.567l-6.773-.602L13.29.89a1.38 1.38 0 0 0-2.581 0l-2.65 6.53-6.774.602C.052 8.126-.453 9.74.486 10.59l5.147 4.666-1.542 6.926c-.28 1.262 1.023 2.26 2.09 1.585L12 20.099z"></path></svg><span class="ipc-rating-star--rating">6.6</span><span class="ipc-rating-star--voteCount">&nbsp;(<!-- -->100K<!-- -->)</span></span>
# <span                               class="ipc-rating-star ipc-rating-star--base ipc-rating-star--placeholder ratingGroup--placeholder standalone-star" data-testid="ratingGroup--placeholder" aria-hidden="true"><svg width="24" height="24" xmlns="http://www.w3.org/2000/svg" class="ipc-icon ipc-icon--star-inline" viewBox="0 0 24 24" fill="currentColor" role="presentation"><path d="M12 20.1l5.82 3.682c1.066.675 2.37-.322 2.09-1.584l-1.543-6.926 5.146-4.667c.94-.85.435-2.465-.799-2.567l-6.773-.602L13.29.89a1.38 1.38 0 0 0-2.581 0l-2.65 6.53-6.774.602C.052 8.126-.453 9.74.486 10.59l5.147 4.666-1.542 6.926c-.28 1.262 1.023 2.26 2.09 1.585L12 20.099z"></path></svg></span>
# class_imdb_scores = "ipc-rating-star--base"
# imdb_scores = driver.find_elements(By.CLASS_NAME, class_imdb_scores)
# imdb_scores[0].text

# class_votes = "ipc-rating-star--voteCount"
# votes = driver.find_elements(By.CLASS_NAME, class_votes)




# %%
# =============================================================================
# Loop
# =============================================================================

class_ranking_titles = "ipc-title__text"
ranking_titles = driver.find_elements(By.CLASS_NAME, class_ranking_titles)

class_plots = "ipc-html-content-inner-div"
plots = driver.find_elements(By.CLASS_NAME, class_plots)

df = pd.DataFrame()
for item in range(1, len(ranking_titles)-1):
    
    ranking_title = ranking_titles[item].text
    
    xpath_year = f'//*[@id="__next"]/main/div[2]/div[3]/section/section/div/section/section/div[2]/div/section/div[2]/div[2]/ul/li[{item}]/div/div/div/div[1]/div[2]/div[2]/span[1]'
    year = driver.find_element(By.XPATH, xpath_year).text
    
    xpath_items = f'//*[@id="__next"]/main/div[2]/div[3]/section/section/div/section/section/div[2]/div/section/div[2]/div[2]/ul/li[{item}]/div/div/div/div[1]/div[2]/div[2]'
    items_txt = driver.find_element(By.XPATH, xpath_items).text
    
    # `"TV Series" in items_txt` returns True if "TV Series" is a substring of items_txt; False otherwise.
    
    if "TV Series" in items_txt:
        runtime = pd.NA # This is a missing value; Empty string ('') can also be used instead.
        
        xpath_rating = f'//*[@id="__next"]/main/div[2]/div[3]/section/section/div/section/section/div[2]/div/section/div[2]/div[2]/ul/li[{item}]/div/div/div/div[1]/div[2]/div[2]/span[2]'
        rating = driver.find_element(By.XPATH, xpath_rating).text
        
        metascore = pd.NA
        
        type_movie = "TV Series"
        
    elif "TV Movie" in items_txt:
        xpath_runtime = f'//*[@id="__next"]/main/div[2]/div[3]/section/section/div/section/section/div[2]/div/section/div[2]/div[2]/ul/li[{item}]/div/div/div/div[1]/div[2]/div[2]/span[2]'
        runtime = driver.find_element(By.XPATH, xpath_runtime).text
        
        xpath_rating = f'//*[@id="__next"]/main/div[2]/div[3]/section/section/div/section/section/div[2]/div/section/div[2]/div[2]/ul/li[{item}]/div/div/div/div[1]/div[2]/div[2]/span[3]'
        rating = driver.find_element(By.XPATH, xpath_rating).text 
        
        metascore = pd.NA
        
        type_movie = "TV Movie"
        
    else:
        
        try:
            xpath_runtime = f'//*[@id="__next"]/main/div[2]/div[3]/section/section/div/section/section/div[2]/div/section/div[2]/div[2]/ul/li[{item}]/div/div/div/div[1]/div[2]/div[2]/span[2]'
            runtime = driver.find_element(By.XPATH, xpath_runtime).text
        except:
            runtime = pd.NA
            
        try:
            xpath_rating = f'//*[@id="__next"]/main/div[2]/div[3]/section/section/div/section/section/div[2]/div/section/div[2]/div[2]/ul/li[{item}]/div/div/div/div[1]/div[2]/div[2]/span[3]'
            rating = driver.find_element(By.XPATH, xpath_rating).text
        except:
            rating = pd.NA
        
        try:
            xpath_metascore = f'//*[@id="__next"]/main/div[2]/div[3]/section/section/div/section/section/div[2]/div/section/div[2]/div[2]/ul/li[{item}]/div/div/div/div[1]/div[2]/div[2]/span[4]/span[1]'
            metascore = driver.find_element(By.XPATH, xpath_metascore).text
        except:
            metascore = pd.NA
        
        type_movie = "Movie"
        
    
    try:
        xpath_imdb_score = f'//*[@id="__next"]/main/div[2]/div[3]/section/section/div/section/section/div[2]/div/section/div[2]/div[2]/ul/li[{item}]/div/div/div/div[1]/div[2]/span/div/span/span[1]'
        imdb_score = driver.find_element(By.XPATH, xpath_imdb_score).text
    except:
        imdb_score = pd.NA
    
    try:
        xpath_vote = f'//*[@id="__next"]/main/div[2]/div[3]/section/section/div/section/section/div[2]/div/section/div[2]/div[2]/ul/li[{item}]/div/div/div/div[1]/div[2]/span/div/span/span[2]'
        vote = driver.find_element(By.XPATH, xpath_vote).text
    except:
        vote = pd.NA

    plot = plots[item-1].text
    
    lst = [ranking_title, year, runtime, rating, metascore, type_movie, imdb_score, vote, plot]
    obs = pd.DataFrame([lst])
    df = pd.concat([df, obs], ignore_index=True)
    
    
df.columns = ['ranking_title', 'year', 'runtime', 'rating', 'metascore', 'type_movie', 'imdb_score', 'vote', 'plot']


# %%
# =============================================================================
# Cleaning
# =============================================================================

df[['ranking', 'title']] = df['ranking_title'].str.split('.', n=1, expand=True)
df = df.drop(columns=['ranking_title'])
df['ranking'] = df['ranking'].astype('int')

df['vote'] = df['vote'].str.replace(' (', '')
df['vote'] = df['vote'].str.replace(')', '')

df[['runtime_h', 'runtime_m']] = df['runtime'].str.split(' ', n=1, expand=True)
df['runtime_h'] = df['runtime_h'].str.replace('h', '')
df['runtime_m'] = df['runtime_m'].str.replace('m', '')

df['runtime_h'] = pd.to_numeric(df['runtime_h'], errors='coerce')
df['runtime_m'] = pd.to_numeric(df['runtime_m'], errors='coerce')

df['runtime'] = df['runtime_h'] * 60 + df['runtime_m']
df = df.drop(columns=['runtime_h', 'runtime_m'])

df['metascore'] = pd.to_numeric(df['metascore'], errors='coerce') # errors='coerce' to avoid an error
df['imdb_score'] = pd.to_numeric(df['imdb_score'], errors='coerce')

df['is_k'] = df['vote'].str.endswith('K')
df['is_m'] = df['vote'].str.endswith('M')
df['vote'] = df['vote'].str.replace('K', '').str.replace('M', '')
df['vote'] = pd.to_numeric(df['vote'], errors='coerce')
df['vote'] = np.where(df['is_k'] == True, df['vote'] * 1000, df['vote'])
df['vote'] = np.where(df['is_m'] == True, df['vote'] * 1000000, df['vote'])
df = df.drop(columns=['is_k', 'is_m'])


df[['year_start', 'year_end']] = df['year'].str.split('–', n=1, expand=True)
df['year_start'] = df['year_start'].astype('int')
df['year_end'] = np.where(df['year_end'] == '', 'present', df['year_end'])

# df.columns
df = df[['ranking', 'title', 'year', 'year_start', 'year_end', 'runtime', 'rating', 'metascore', 'type_movie', 'imdb_score', 'vote', 'plot']]

# Export df as CSV
df.to_csv('data/imdb_family_2025_0415.csv', index = False, encoding = 'utf-8-sig')



Part 2. Pandas Basic



Below is spotify DataFrame that reads the file spotify_all.csv containing data of Spotify users’ playlist information (Source: Spotify Million Playlist Dataset Challenge).

spotify = pd.read_csv('https://bcdanl.github.io/data/spotify_all.csv')


Variable Description

  • pid: playlist ID; unique ID for playlist
  • playlist_name: a name of playlist
  • pos: a position of the track within a playlist (starting from 0)
  • artist_name: name of the track’s primary artist
  • track_name: name of the track
  • duration_ms: duration of the track in milliseconds
  • album_name: name of the track’s album

Definition of a Song

  • In Part 2, a song is defined as a combination of a artist_name value and a track_name value.

  • E.g., the following provides the 12 distinct songs with the same track_nameI Love You:


Question 2

  • Write a Python code to identify the top five songs with the highest frequency of appearances in the spotify DataFrame.
Click to Check the Answer!
songs_top_5 = (
    spotify[['artist_name', 'track_name']]
    .value_counts()
    .reset_index()
    .nlargest(5, 'count', keep="all")
    )


Question 3

  • Write a Python code to create a DataFrame that contains information about how often each song occurs within the spotify DataFrame.
    • In this new DataFrame, each observation should represent a distinct song.
  • Then, write a Python code to identify top 5% songs based on their frequency of appearances.
Click to Check the Answer!
songs_freq = (
    spotify[['artist_name', 'track_name']]
    .value_counts()
    .reset_index()
    )

cond = songs_freq['count'] >= songs_freq['count'].quantile(.95)
songs_top_5_pct = songs_freq[ cond ]


Question 4

  • Write a Python code to list all artists who have more than 50 unique songs in the sportify DataFrame.
Click to Check the Answer!
artists_50_songs = (
    spotify[['artist_name', 'track_name']]
    .value_counts()
    .reset_index()['artist_name']
    .value_counts()
    .reset_index()
    .query('count > 50')
    )


Question 5

  • Write a Python code to create a DataFrame that identifies all the playlists featuring the song “One Dance” by Drake.
Click to Check the Answer!
playlist_onedance = (
    spotify
    .query('artist_name == "Drake" & track_name == "One Dance"')
    )

playlist_songs_w_onedance = (
    spotify[ spotify['pid'].isin(playlist_onedance['pid']) ]
    )


Question 6

  • Write a Python code to identify the longest and shortest duration songs (based on duration_ms) for each unique artist.
Click to Check the Answer!
# Shortest song per artist
shortest_songs = (
    spotify
    .sort_values(["artist_name", "duration_ms"])
    .drop_duplicates(subset=["artist_name"], keep="first")
    [['artist_name', 'track_name', 'duration_ms']]
    )

# Longest song per artist
longest_songs = (
    spotify
    .sort_values(["artist_name", "duration_ms"])
    .drop_duplicates(subset=["artist_name"], keep="last")
    [['artist_name', 'track_name', 'duration_ms']]
)


Question 7

Write a Python code to find the same song(s) appearing more than once in the same playlist.

  • Using drop_duplicates() with index-based filtering.
Click to Check the Answer!
non_dup = spotify.drop_duplicates(subset=["pid", "artist_name", "track_name"], keep=False)
dup_T = ~spotify.index.isin( non_dup.index )
duplicates = spotify[ dup_T ]


  • Using duplicated() with standard filtering.
    • The subset and keep parameters work similarly to those in drop_duplicates()
Click to Check the Answer!
dup_T = spotify.duplicated(subset=["pid", "artist_name", "track_name"], keep = False)
duplicates = (
    spotify[ dup_T ]
    )



Question 8

  • Write a Python code to filter all songs that appear in more than 100 different playlists.
Click to Check the Answer!
songs_100_plist = (
    spotify
    .drop_duplicates(subset=["pid", "artist_name", "track_name"], keep=False)[['artist_name', 'track_name']]
    .value_counts()
    .reset_index()
    .query('count > 100')
    )



Part 3. Jupyter Notebook Blogging



holiday_movies = pd.read_csv("https://bcdanl.github.io/data/holiday_movies.csv")
  • The DataFrame holiday_movies comes from the IMDb.

  • The following is the DataFrame, holiday_movies.

Variable description

  • tconst: alphanumeric unique identifier of the title

  • title_type: the type/format of the title

    • (movie, video, or tvMovie)
  • primary_title: the more popular title / the title used by the filmmakers on promotional materials at the point of release

  • simple_title: the title in lowercase, with punctuation removed, for easier filtering and grouping

  • year: the release year of a title

  • runtime_minutes: primary runtime of the title, in minutes

  • average_rating: weighted average of all the individual user ratings on IMDb

  • num_votes: number of votes the title has received on IMDb (titles with fewer than 10 votes were not included in this dataset)



  • The following is another DataFrame holiday_movie_genres that is related with the DataFrame holiday_movies:
holiday_movie_genres = pd.read_csv("https://bcdanl.github.io/data/holiday_movie_genres.csv")
  • The DataFrame holiday_movie_genres include up to three genres associated with the titles that appear in the DataFrame.

Variable description

  • tconst: alphanumeric unique identifier of the title
  • genres: genres associated with the title, one row per genre


  • Using the provided DataFrames, write a blog post about Christmas movies in a Jupyter Notebook and publish it on your online blog.
    • In your analysis, make use of counting, sorting, indexing, filtering, and joining techniques.


Back to top