Homework 2

Scrapping Data with Python selenium

Author

Byeong-Hak Choe

Published

February 20, 2026

Modified

February 25, 2026

๐Ÿ“Œ Directions

  • Submit your Python script (*.py) to Brightspace using this format:
    • danl-210-hw2-LASTNAME-FIRSTNAME.py
      (e.g., danl-210-hw2-choe-byeonghak.py)
  • Due: March 4, 2026, 10:30 A.M.
  • Questions? Email Prof. Choe:


Setup

import pandas as pd
import os, time, random
from io import StringIO

# Import the necessary modules from the Selenium library
from selenium import webdriver  # Main module to control the browser
from selenium.webdriver.common.by import By  # Helps locate elements on the webpage
from selenium.webdriver.chrome.options import Options  # Allows setting browser options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import TimeoutException
from selenium.common.exceptions import StaleElementReferenceException

# Set the working directory path
wd_path = 'ABSOLUTE_PATHNAME_OF_YOUR_WORKING_DIRECTORY' # e.g., '/Users/bchoe/Documents/DANL-210'
os.chdir(wd_path)  # Change the current working directory to wd_path
os.getcwd()  # Retrieve and return the current working directory

# Create an instance of Chrome options
options = Options()

# Initialize the Chrome WebDriver with the specified options
driver = webdriver.Chrome(options=options)



Part 1. Collecting the IMDb Data using Python selenium

Write Python code that uses Selenium to collect movie and TV series information from IMDb.

Target page

Go to the IMDb search page for:

Tasks

  1. Use Selenium to open the target page.
  2. For each of the 100 titles listed on the page, scrape the following fields and store them in a single pandas DataFrame:
  • ranking (e.g., 1, 2, 3)
  • title (e.g., How to Train Your Dragon)
  • year (e.g., 2025, 2025-)
  • runtime (e.g., 1h 56m, 1h 40m)
  • rating (e.g., PG, TV-G, G)
  • is_series (e.g., True if TV Series; False otherwise)
  • metascore (e.g., 56, 74)
  • imdb_score (e.g., 7.6, 6.5)
  • votes (e.g., 1.4k, 70k)
  • plot (one-sentence description shown for the title)
  1. Export the final DataFrame to a CSV file (e.g., imdb_family_top100.csv).
Note
  • To handle inconsistencies in the webpage structure (some titles may be missing fields):
    • Use a tryโ€“except block to skip missing elements safely (e.g., no metascore, no rating) and fill with pd.NA or ''.
    • Use an ifโ€“elifโ€“else chain (with the in keyword) to classify whether an item is a TV Series or a Movie based on the text you scraped. For example, check whether the text contains "TV Series" and set is_series accordingly.

Helpful Data-Cleaning Snippet: Splitting ranking_title

Sometimes you may scrape the rank and title together as a single string like:

  • โ€œ1. Snow Whiteโ€
  • โ€œ2. Freakier Fridayโ€

If your DataFrame has a combined column named ranking_title, you can split it into ranking and title like this:

# Example: "1. Snow White" -> ranking="1", title=" Snow White"
df[["ranking", "title"]] = df["ranking_title"].str.split(".", n=1, expand=True)

# Optional: drop the original combined column
df = df.drop(columns=["ranking_title"])

Changing the Order of Columns

In pandas, you can reorder columns by selecting them in the exact order you want.

  • df[[...]] means: โ€œkeep only these columns, and arrange them in this order.โ€
  • This is useful right before exporting to CSV so your dataset has a clean, consistent layout.
  • If you include a column name that does not exist (typo, missing column), pandas will raise a KeyError.
# Reorder columns in df (and keep only these columns)
df = df[[
    "ranking", "title", "year", "runtime", "rating",
    "is_series", "metascore", "imdb_score", "votes", "plot"
]]

Output DataFrame

Finally, export your results:

df.to_csv("data/imdb_family_top100.csv", index=False)



Part 2. Collecting NBA Player Salaries (1999โ€“2000 to 2025โ€“2026) with pandas.read_html() + selenium

Write Python code that uses Selenium (to navigate pages) and pandas.read_html() (to extract the salary tables) to collect NBA player salary data from ESPN.

Target pages

Tasks

  1. Open the ESPN NBA salaries site using Selenium.

  2. Loop over:

    • Years: 2000 through 2026 (these correspond to seasons 1999โ€“2000 through 2025โ€“2026), and
    • Pages: 1 through the last page for that year.
  3. For each yearโ€“page:

    • Use pd.read_html() to read the salary table into a DataFrame.
    • Clean/rename columns as needed and keep only:
      • ranking
      • name_position
      • team
      • salary
    • Add a season column (e.g., 1999-2000, 2025-2026).
    • Split the name_position column into name and position columns.
  4. Append all pages and years into one DataFrame with exactly these columns:

    • season
    • ranking
    • name
    • position
    • team
    • salary
  5. Finally, export your results:

dfs.to_csv("data/nba_salary.csv", index=False)

Helpful Parsing Snippet (page count)

ESPN pagination text often looks like:

p_num = "1 of 12"
last_page = int(p_num.split(" of ")[1])  # -> 12

Use this to determine how many pages exist for each year, rather than hard-coding page limits.

Respectful Scraping: Add Randomized Delays

  • Be polite to ESPNโ€™s servers: for each webpage visit, pause with a randomized delay before scraping (or before moving to the next page), for example:
import time
import random

driver.get(url)
time.sleep(random.uniform(3, 5))

How to Add a Column to a DataFrame

In pandas, you can create (or overwrite) a column using square-bracket assignment:

  • df_example["season"] refers to the column named season.
  • Assigning a single value like "1999-2000" will broadcast that value to every row in the DataFrame.
  • If the column already exists, this will replace its values.
df_example = pd.DataFrame({
    "ranking": [1, 2, 3],
    "name_position": ["Stephen Curry, G", "Joel Embiid, C", "Nikola Jokic, C"],
    "team": ["GSW", "PHX", "LAL"],
    "salary": ["$59,606,817", "$55,224,526", "$55,224,526"]
})

df_example["season"] = "2025-2026"   # adds a new column to df

Helpful Data-Cleaning Snippet: Splitting name_position

# Example: "Stephen Curry, G" -> name="Stephen Curry", position="G"
df_example[["name", "position"]] = df_example["name_position"].str.split(", ", n=1, expand=True)

# Optional: drop the original combined column
df_example = df_example.drop(columns=["name_position"])

Replacing Values in a String

When you scrape salary values from a webpage, they often come in a formatted string like:

  • "$51,915,615"

Before you can convert salaries to numbers, you need to remove formatting characters such as:

  • the dollar sign ($)
  • commas (,)

In pandas, .str.replace() applies a text replacement to every value in the column (element-by-element).

# Remove "$" and "," so the values become plain digits like "51915615"
dfs["salary"] = dfs["salary"].str.replace("$", "", regex=False)
dfs["salary"] = dfs["salary"].str.replace(",", "", regex=False)

Output DataFrame





โœ… End of Homework 2

Back to top