Homework 3

Python selenium & Pandas Basics

Author

Byeong-Hak Choe

Published

March 30, 2025

Modified

March 30, 2025

import pandas as pd
from itables import init_notebook_mode, show # for displaying an interactive DataFrame

Direction

Please submit your Python Script for Part 1 and Part 2 in Homework 3 to Brightspace with the name below:
- danl_210_hw3_LASTNAME_FIRSTNAME.py
  ( e.g., danl_210_hw3_choe_byeonghak.py )
The due is April 9, 2025, 10:30 A.M.
Please send Prof. Choe an email (bchoe@geneseo.edu) if you have any questions.

Part 1. Collecting the IMDb Data using Python `Selenium`

Question 1

Go to the following Internet Movie Database (IMDb) webpage for
- Top 200 Family Movies and TV Shows (sorted by Popularity).
Provide your Python Selenium code to scrape the following information about Top 200 Family Movies and TV Shows by Popularity.
- You should create the following variables in the DataFrame:
  - ranking (e.g., 1, 2, 3)
  - title (e.g., Snow White, Freakier Friday)
  - year (e.g., 2025, 2025-)
  - runtime (e.g., 1h 56m, 1h 40m)
  - rating (e.g., PG, TV-G, G)
  - imdb_score (e.g., 7.6, 6.5)
  - votes (e.g., 1.4k, 70k)
  - metascore (e.g., 56, 74)
  - plot (e.g., A princess joins forces with seven dwarfs to liberate her kingdom from her cruel stepmother the Evil Queen.)
Suppose ranking and title are initially combined in a single column called ranking_title, with values like:
- “1. Snow White”
- “2. Freakier Friday”
You can split this column into ranking and title using the following code:

# Split the 'ranking_title' column into two new variables, 'ranking' and 'title'
# by the first occurrence of a period (dot). 

# n=1 ensures the split to be limited to the first dot.
# expand=True tells pandas to return the results in a DataFrame
  # If expand were False, the result would be a Series of lists.

df[['ranking', 'title']] = df['ranking_title'].str.split('.', n=1, expand=True)
df = df.drop(columns=['ranking_title'])

Finally, export the resulting DataFrame to the CSV file.

Answer:

Part 2. Pandas Basic

Below is spotify DataFrame that reads the file spotify_all.csv containing data of Spotify users’ playlist information (Source: Spotify Million Playlist Dataset Challenge).

spotify = pd.read_csv('https://bcdanl.github.io/data/spotify_all.csv')

Variable Description

pid: playlist ID; unique ID for playlist
playlist_name: a name of playlist
pos: a position of the track within a playlist (starting from 0)
artist_name: name of the track’s primary artist
track_name: name of the track
duration_ms: duration of the track in milliseconds
album_name: name of the track’s album

Definition of a Song

In Part 2, a song is defined as a combination of a artist_name value and a track_name value.
E.g., the following provides the 12 distinct songs with the same track_name—I Love You:

Question 2

Write a Python code to identify the top five songs with the highest frequency of appearances in the spotify DataFrame.

Answer:

Question 3

Write a Python code to create a DataFrame that contains information about how often each song occurs within the spotify DataFrame.
- In this new DataFrame, each observation should represent a distinct song.
Then, write a Python code to identify top 5% songs based on their frequency of appearances.

Answer:

Question 4

Write a Python code to list all artists who have more than 50 unique songs in the sportify DataFrame.

Answer:

Question 5

Write a Python code to create a DataFrame that identifies all the playlists featuring the song “One Dance” by Drake.

Answer:

Question 6

Write a Python code to identify the longest and shortest duration songs (based on duration_ms) for each unique artist.

Answer:

Question 7

Write a Python code to find the same song(s) appearing more than once in the same playlist.

Answer:

Question 8

Write a Python code to filter all songs that appear in more than 100 different playlists.

Answer:

Part 3. Jupyter Notebook Blogging

holiday_movies = pd.read_csv("https://bcdanl.github.io/data/holiday_movies.csv")

The DataFrame holiday_movies comes from the IMDb.
The following is the DataFrame, holiday_movies.

Variable description

tconst: alphanumeric unique identifier of the title
title_type: the type/format of the title
- (movie, video, or tvMovie)
primary_title: the more popular title / the title used by the filmmakers on promotional materials at the point of release
simple_title: the title in lowercase, with punctuation removed, for easier filtering and grouping
year: the release year of a title
runtime_minutes: primary runtime of the title, in minutes
average_rating: weighted average of all the individual user ratings on IMDb
num_votes: number of votes the title has received on IMDb (titles with fewer than 10 votes were not included in this dataset)

The following is another DataFrame holiday_movie_genres that is related with the DataFrame holiday_movies:

holiday_movie_genres = pd.read_csv("https://bcdanl.github.io/data/holiday_movies.csv")

The DataFrame holiday_movie_genres include up to three genres associated with the titles that appear in the DataFrame.

Variable description

tconst: alphanumeric unique identifier of the title
genres: genres associated with the title, one row per genre

Using the provided DataFrames, write a blog post about Christmas movies in a Jupyter Notebook and publish it on your online blog.
- In your analysis, make use of counting, sorting, indexing, filtering, and joining techniques.

Direction

Part 1. Collecting the IMDb Data using Python Selenium

Question 1

Part 2. Pandas Basic

Variable Description

Definition of a Song

Question 2

Question 3

Question 4

Question 5

Question 6

Question 7

Question 8

Part 3. Jupyter Notebook Blogging

Variable description

Variable description

Part 1. Collecting the IMDb Data using Python `Selenium`