Homework 3
Python selenium
& Pandas Basics
import pandas as pd
from itables import init_notebook_mode, show # for displaying an interactive DataFrame
Direction
Please submit your Python Script for Part 1 and Part 2 in Homework 3 to Brightspace with the name below:
danl_210_hw3_LASTNAME_FIRSTNAME.py
( e.g.,danl_210_hw3_choe_byeonghak.py
)
The due is April 9, 2025, 10:30 A.M.
Please send Prof. Choe an email (bchoe@geneseo.edu) if you have any questions.
Part 1. Collecting the IMDb Data using Python Selenium
Question 1
- Go to the following Internet Movie Database (IMDb) webpage for
- Provide your Python Selenium code to scrape the following information about Top 200 Family Movies and TV Shows by Popularity.
- You should create the following variables in the DataFrame:
ranking
(e.g., 1, 2, 3)title
(e.g., Snow White, Freakier Friday)year
(e.g., 2025, 2025-)runtime
(e.g., 1h 56m, 1h 40m)rating
(e.g., PG, TV-G, G)imdb_score
(e.g., 7.6, 6.5)votes
(e.g., 1.4k, 70k)metascore
(e.g., 56, 74)plot
(e.g., A princess joins forces with seven dwarfs to liberate her kingdom from her cruel stepmother the Evil Queen.)
- You should create the following variables in the DataFrame:
- Suppose
ranking
andtitle
are initially combined in a single column calledranking_title
, with values like:- “1. Snow White”
- “2. Freakier Friday”
- You can split this column into
ranking
andtitle
using the following code:
# Split the 'ranking_title' column into two new variables, 'ranking' and 'title'
# by the first occurrence of a period (dot).
# n=1 ensures the split to be limited to the first dot.
# expand=True tells pandas to return the results in a DataFrame
# If expand were False, the result would be a Series of lists.
'ranking', 'title']] = df['ranking_title'].str.split('.', n=1, expand=True)
df[[= df.drop(columns=['ranking_title']) df
- Finally, export the resulting DataFrame to the CSV file.
Answer:
Part 2. Pandas Basic
Below is spotify
DataFrame
that reads the file spotify_all.csv
containing data of Spotify users’ playlist information (Source: Spotify Million Playlist Dataset Challenge).
= pd.read_csv('https://bcdanl.github.io/data/spotify_all.csv') spotify
Variable Description
pid
: playlist ID; unique ID for playlistplaylist_name
: a name of playlistpos
: a position of the track within a playlist (starting from 0)artist_name
: name of the track’s primary artisttrack_name
: name of the trackduration_ms
: duration of the track in millisecondsalbum_name
: name of the track’s album
Definition of a Song
In Part 2, a song is defined as a combination of a
artist_name
value and atrack_name
value.E.g., the following provides the 12 distinct songs with the same
track_name
—I Love You:
Question 2
- Write a Python code to identify the top five songs with the highest frequency of appearances in the
spotify
DataFrame.
Answer:
Question 3
- Write a Python code to create a DataFrame that contains information about how often each song occurs within the
spotify
DataFrame.- In this new DataFrame, each observation should represent a distinct song.
- Then, write a Python code to identify top 5% songs based on their frequency of appearances.
Answer:
Question 4
- Write a Python code to list all artists who have more than 50 unique songs in the
sportify
DataFrame.
Answer:
Question 5
- Write a Python code to create a DataFrame that identifies all the playlists featuring the song “One Dance” by Drake.
Answer:
Question 6
- Write a Python code to identify the longest and shortest duration songs (based on
duration_ms
) for each unique artist.
Answer:
Question 7
- Write a Python code to find the same song(s) appearing more than once in the same playlist.
Answer:
Question 8
- Write a Python code to filter all songs that appear in more than 100 different playlists.
Answer:
Part 3. Jupyter Notebook Blogging
= pd.read_csv("https://bcdanl.github.io/data/holiday_movies.csv") holiday_movies
The DataFrame
holiday_movies
comes from the IMDb.The following is the DataFrame,
holiday_movies
.
Variable description
tconst
: alphanumeric unique identifier of the titletitle_type
: the type/format of the title- (movie, video, or tvMovie)
primary_title
: the more popular title / the title used by the filmmakers on promotional materials at the point of releasesimple_title
: the title in lowercase, with punctuation removed, for easier filtering and groupingyear
: the release year of a titleruntime_minutes
: primary runtime of the title, in minutesaverage_rating
: weighted average of all the individual user ratings on IMDbnum_votes
: number of votes the title has received on IMDb (titles with fewer than 10 votes were not included in this dataset)
- The following is another DataFrame
holiday_movie_genres
that is related with the DataFrameholiday_movies
:
= pd.read_csv("https://bcdanl.github.io/data/holiday_movies.csv") holiday_movie_genres
- The DataFrame
holiday_movie_genres
include up to three genres associated with the titles that appear in the DataFrame.
Variable description
tconst
: alphanumeric unique identifier of the titlegenres
: genres associated with the title, one row per genre
- Using the provided DataFrames, write a blog post about Christmas movies in a Jupyter Notebook and publish it on your online blog.
- In your analysis, make use of counting, sorting, indexing, filtering, and joining techniques.