Homework 3

Python selenium & Pandas Basics

Author

Byeong-Hak Choe

Published

March 30, 2025

Modified

March 30, 2025

import pandas as pd
from itables import init_notebook_mode, show # for displaying an interactive DataFrame

Direction

  • Please submit your Python Script for Part 1 and Part 2 in Homework 3 to Brightspace with the name below:

    • danl_210_hw3_LASTNAME_FIRSTNAME.py
      ( e.g., danl_210_hw3_choe_byeonghak.py )
  • The due is April 9, 2025, 10:30 A.M.

  • Please send Prof. Choe an email (bchoe@geneseo.edu) if you have any questions.




Part 1. Collecting the IMDb Data using Python Selenium



Question 1

  • Go to the following Internet Movie Database (IMDb) webpage for
  • Provide your Python Selenium code to scrape the following information about Top 200 Family Movies and TV Shows by Popularity.
    • You should create the following variables in the DataFrame:
      • ranking (e.g., 1, 2, 3)
      • title (e.g., Snow White, Freakier Friday)
      • year (e.g., 2025, 2025-)
      • runtime (e.g., 1h 56m, 1h 40m)
      • rating (e.g., PG, TV-G, G)
      • imdb_score (e.g., 7.6, 6.5)
      • votes (e.g., 1.4k, 70k)
      • metascore (e.g., 56, 74)
      • plot (e.g., A princess joins forces with seven dwarfs to liberate her kingdom from her cruel stepmother the Evil Queen.)
  • Suppose ranking and title are initially combined in a single column called ranking_title, with values like:
    • “1. Snow White”
    • “2. Freakier Friday”
  • You can split this column into ranking and title using the following code:
# Split the 'ranking_title' column into two new variables, 'ranking' and 'title'
# by the first occurrence of a period (dot). 

# n=1 ensures the split to be limited to the first dot.
# expand=True tells pandas to return the results in a DataFrame
  # If expand were False, the result would be a Series of lists.

df[['ranking', 'title']] = df['ranking_title'].str.split('.', n=1, expand=True)
df = df.drop(columns=['ranking_title'])
  • Finally, export the resulting DataFrame to the CSV file.

Answer:



Part 2. Pandas Basic



Below is spotify DataFrame that reads the file spotify_all.csv containing data of Spotify users’ playlist information (Source: Spotify Million Playlist Dataset Challenge).

spotify = pd.read_csv('https://bcdanl.github.io/data/spotify_all.csv')


Variable Description

  • pid: playlist ID; unique ID for playlist
  • playlist_name: a name of playlist
  • pos: a position of the track within a playlist (starting from 0)
  • artist_name: name of the track’s primary artist
  • track_name: name of the track
  • duration_ms: duration of the track in milliseconds
  • album_name: name of the track’s album

Definition of a Song

  • In Part 2, a song is defined as a combination of a artist_name value and a track_name value.

  • E.g., the following provides the 12 distinct songs with the same track_nameI Love You:


Question 2

  • Write a Python code to identify the top five songs with the highest frequency of appearances in the spotify DataFrame.

Answer:


Question 3

  • Write a Python code to create a DataFrame that contains information about how often each song occurs within the spotify DataFrame.
    • In this new DataFrame, each observation should represent a distinct song.
  • Then, write a Python code to identify top 5% songs based on their frequency of appearances.

Answer:


Question 4

  • Write a Python code to list all artists who have more than 50 unique songs in the sportify DataFrame.

Answer:


Question 5

  • Write a Python code to create a DataFrame that identifies all the playlists featuring the song “One Dance” by Drake.

Answer:


Question 6

  • Write a Python code to identify the longest and shortest duration songs (based on duration_ms) for each unique artist.

Answer:


Question 7

  • Write a Python code to find the same song(s) appearing more than once in the same playlist.

Answer:


Question 8

  • Write a Python code to filter all songs that appear in more than 100 different playlists.

Answer:



Part 3. Jupyter Notebook Blogging



holiday_movies = pd.read_csv("https://bcdanl.github.io/data/holiday_movies.csv")
  • The DataFrame holiday_movies comes from the IMDb.

  • The following is the DataFrame, holiday_movies.

Variable description

  • tconst: alphanumeric unique identifier of the title

  • title_type: the type/format of the title

    • (movie, video, or tvMovie)
  • primary_title: the more popular title / the title used by the filmmakers on promotional materials at the point of release

  • simple_title: the title in lowercase, with punctuation removed, for easier filtering and grouping

  • year: the release year of a title

  • runtime_minutes: primary runtime of the title, in minutes

  • average_rating: weighted average of all the individual user ratings on IMDb

  • num_votes: number of votes the title has received on IMDb (titles with fewer than 10 votes were not included in this dataset)



  • The following is another DataFrame holiday_movie_genres that is related with the DataFrame holiday_movies:
holiday_movie_genres = pd.read_csv("https://bcdanl.github.io/data/holiday_movies.csv")
  • The DataFrame holiday_movie_genres include up to three genres associated with the titles that appear in the DataFrame.

Variable description

  • tconst: alphanumeric unique identifier of the title
  • genres: genres associated with the title, one row per genre


  • Using the provided DataFrames, write a blog post about Christmas movies in a Jupyter Notebook and publish it on your online blog.
    • In your analysis, make use of counting, sorting, indexing, filtering, and joining techniques.


Back to top