Classwork 4
Scrapping Table Data with Python selenium with XPath approach
Below is to set up the web scrapping environment with Python selenium:
import pandas as pd
import numpy as np
import os, time, random
from io import StringIO
# Import the necessary modules from the Selenium library
from selenium import webdriver # Main module to control the browser
from selenium.webdriver.common.by import By # Helps locate elements on the webpage
from selenium.webdriver.chrome.options import Options # Allows setting browser options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import TimeoutException
from selenium.common.exceptions import StaleElementReferenceException
# Set the working directory path
wd_path = 'ABSOLUTE_PATHNAME_OF_YOUR_WORKING_DIRECTORY' # e.g., '/Users/bchoe/Documents/DANL-210'
os.chdir(wd_path) # Change the current working directory to wd_path
os.getcwd() # Retrieve and return the current working directory
# Create an instance of Chrome options
options = Options()
# Initialize the Chrome WebDriver with the specified options
driver = webdriver.Chrome(options=options)Collecting Rows into a pandas DataFrame
df = pd.DataFrame()
for i in range(1, 6):
obs_lst = [i, "Ava", "Geneseo", 2025]
obs = pd.DataFrame([obs_lst])
df = pd.concat([df, obs], ignore_index = True)
df.columns = ['id', 'name', 'school', 'year']- Start with an empty
DataFrame
df = pd.DataFrame()creates an empty table (no rows yet).
- Repeat the same process five times
for i in range(1, 6):runs the indented block 5 times (fori = 1, 2, 3, 4, 5).
- Create one “observation” as a list
obs_lst = [i, "Ava", "Geneseo", 2026]is a Python list with 3 values.- Think of this list as one row of data you want to add.
- Turn the list into a one-row
DataFrame
obs = pd.DataFrame([obs_lst])- Wrapping
obs_lstwith brackets ([obs_lst]) makes it a list of rows (2D), so pandas creates a 1-row DataFrame. - Because no column names are given, pandas assigns default column labels:
0,1,2.
- Append the new row to
dfby concatenation
df = pd.concat([df, obs], ignore_index=True)pd.concat(...)stacks DataFrames row-wise** (adds rows).**- The first argument
[df, obs]forpd.concat(...)is a list of DataFrames. ignore_index=Trueresets the row index to0, 1, 2, ...so the final table has clean numbering.
- The first argument
- Assign column names to the
DataFrame
df.columns = ['name', 'school', 'year']renames the columns ofdf.- This replaces the default column labels (
0,1,2) with meaningful names:namefor the first columnschoolfor the second columnyearfor the third column
- After this, you can refer to columns using
df['name'],df['school'], anddf['year'].
Result
After the loop finishes, df contains five identical rows:
Question 1.
Task: Using Selenium and a single for loop, scrape the table on the EIA page into a pandas DataFrame.
Hints
- Start by scraping the body rows (
tr) for cell values (td) - You can build XPaths with an f-string inside a loop.
Example idea (XPath f-string):
for i in range(1, 10):
xpath = f'//*[@id="main-content"]//table/tbody/tr[{i}]/td[1]'
print(xpath)Answer:
# TODO: find out the number of rows (<tr>) in the body table (<tbody>)
nrows =
df = pd.DataFrame()
for i in range(1, nrows + 1):
# TODO: scrape each cell's text in a single row
mon_yr =
retail_price =
refining =
distribution_marketing =
taxes =
crude_oil =
obs_lst = [mon_yr, retail_price, refining, distribution_marketing, taxes, crude_oil]
obs = pd.DataFrame([obs_lst])
df = pd.concat([df, obs], ignore_index = True)Web-scraping Strategy with XPath
Step 0. Set up your environment
- Import the libraries/modules you need (e.g., Selenium, pandas, time).
- Set your working directory (so your output file saves where you expect).
- Get to the webpage.
Step 1. Find an XPath pattern for the element(s) you want
- Use your browser’s Developer Tools (Inspect) to locate the target element.
- Identify what changes across repeated elements (often an index like
[1],[2],[3]).
Step 2. Build a dynamic XPath using an f-string
- Turn the XPath pattern into a template.
- Replace the changing part (usually the index) with
{i}so you can loop over it.
Step 3. Write a loop to collect the data. Things to decide before you code:
- How many times to iterate (e.g., number of rows, pages, or cards).
- How to locate elements
find_element()for a single element.find_elements()for a list of elements.
- How to store results
- Append each observation (row) to a list of dictionaries.
- Convert the list into a pandas
DataFrameat the end.
Step 4. Export the DataFrame to a CSV
- Use
df.to_csv("filename.csv", index=False)so the CSV is clean and ready to use.
# %%
# =============================================================================
# Step 0. Set up your environment
# =============================================================================
import pandas as pd
import numpy as np
import os, time, random
from io import StringIO
# Import the necessary modules from the Selenium library
from selenium import webdriver # Main module to control the browser
from selenium.webdriver.common.by import By # Helps locate elements on the webpage
from selenium.webdriver.chrome.options import Options # Allows setting browser options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import TimeoutException
from selenium.common.exceptions import StaleElementReferenceException
# Set the working directory path
wd_path = '/Users/bchoe/Documents/DANL-210' # e.g., '/Users/bchoe/Documents/DANL-210'
os.chdir(wd_path) # Change the current working directory to wd_path
os.getcwd() # Retrieve and return the current working directory
# Create an instance of Chrome options
options = Options()
# Initialize the Chrome WebDriver with the specified options
driver = webdriver.Chrome(options=options)
# %%
# =============================================================================
# Scrapping
# =============================================================================
url = 'https://www.eia.gov/petroleum/gasdiesel/gaspump_hist.php'
driver.get(url)
# Step 1. Find an XPath pattern for the element(s) you want
# Mon-yr
# Row 1
# /html/body/div[1]/div[2]/div/div[4]/div/div[1]/div/table/tbody/tr[1]/td[1]
# Row 2
# /html/body/div[1]/div[2]/div/div[4]/div/div[1]/div/table/tbody/tr[2]/td[1]
# retail price
# /html/body/div[1]/div[2]/div/div[4]/div/div[1]/div/table/tbody/tr[1]/td[2]
# /html/body/div[1]/div[2]/div/div[4]/div/div[1]/div/table/tbody/tr[2]/td[2]
# Step 2. Build a dynamic XPath using an f-string
# xpath_mon_yr = f'/html/body/div[1]/div[2]/div/div[4]/div/div[1]/div/table/tbody/tr[{i}]/td[1]'
# xpath_retail_price = f'/html/body/div[1]/div[2]/div/div[4]/div/div[1]/div/table/tbody/tr[{i}]/td[2]'
# xpath_refining = f'/html/body/div[1]/div[2]/div/div[4]/div/div[1]/div/table/tbody/tr[{i}]/td[3]'
# xpath_distribution_marketing = f'/html/body/div[1]/div[2]/div/div[4]/div/div[1]/div/table/tbody/tr[{i}]/td[4]'
# xpath_taxes = f'/html/body/div[1]/div[2]/div/div[4]/div/div[1]/div/table/tbody/tr[{i}]/td[5]'
# xpath_crude_oil = f'/html/body/div[1]/div[2]/div/div[4]/div/div[1]/div/table/tbody/tr[{i}]/td[6]'
# How many times to iterate
# TODO: find out the number of rows (<tr>) in the body table (<tbody>)
table = driver.find_element(By.TAG_NAME, 'tbody')
rows = table.find_elements(By.TAG_NAME, 'tr')
nrows = len(rows)
# Step 3. Write a loop to collect the data
df = pd.DataFrame()
for i in range(1, nrows + 1):
xpath_mon_yr = f'/html/body/div[1]/div[2]/div/div[4]/div/div[1]/div/table/tbody/tr[{i}]/td[1]'
xpath_retail_price = f'/html/body/div[1]/div[2]/div/div[4]/div/div[1]/div/table/tbody/tr[{i}]/td[2]'
xpath_refining = f'/html/body/div[1]/div[2]/div/div[4]/div/div[1]/div/table/tbody/tr[{i}]/td[3]'
xpath_distribution_marketing = f'/html/body/div[1]/div[2]/div/div[4]/div/div[1]/div/table/tbody/tr[{i}]/td[4]'
xpath_taxes = f'/html/body/div[1]/div[2]/div/div[4]/div/div[1]/div/table/tbody/tr[{i}]/td[5]'
xpath_crude_oil = f'/html/body/div[1]/div[2]/div/div[4]/div/div[1]/div/table/tbody/tr[{i}]/td[6]'
# How to locate elements
mon_yr = driver.find_element(By.XPATH,
xpath_mon_yr).text
retail_price = driver.find_element(By.XPATH,
xpath_retail_price).text
refining = driver.find_element(By.XPATH,
xpath_refining).text
distribution_marketing = driver.find_element(By.XPATH,
xpath_distribution_marketing).text
taxes = driver.find_element(By.XPATH,
xpath_refining).text
crude_oil = driver.find_element(By.XPATH,
xpath_refining).text
# How to store results
obs_lst = [mon_yr, retail_price, refining,
distribution_marketing, taxes, crude_oil]
obs = pd.DataFrame([obs_lst])
df = pd.concat([df, obs], ignore_index = True)
# Step 4. Export the DataFrame to a CSV
# Column names
df.columns = ['mon_yr', 'retail_price', 'refining',
'distribution_marketing', 'taxes', 'crude_oil']
df.to_csv('data/eia_2026_0220', index= False)Question 2 — Scrape with a nested for loop
Task: Do the same scraping task as Question 1, but this time use a nested for loop:
- Outer loop: iterate over rows (
tr) - Inner loop: iterate over columns (
td)- In the inner loop, append
valueto thedatalist.
- In the inner loop, append
# TODO: find out the number of rows (<tr>) and the number of columns (<td>) in each row in the body table (<tbody>)
nrows =
ncols =
df = pd.DataFrame()
for i in range(1, nrows + 1):
data = [] # creating an empty list for one row
for j in range(1, ncols + 1): # Iterate over column positions
# TODO: scrape each cell's text in a single row
value =
# TODO: append value to the data list
obs = pd.DataFrame([data])
df = pd.concat([df, obs], ignore_index=True)Nested for-loop
A nested loop is simply a loop inside another loop.
- The outer loop controls the “big step” (the main repetition).
- The inner loop runs all the way through for each single outer loop step.
So the pattern is:
- Do outer step #1
- run inner loop from start to finish
- run inner loop from start to finish
- Do outer step #2
- run inner loop from start to finish
- run inner loop from start to finish
- Repeat…
Generic example
for i in range(1, 4): # outer loop: i = 1, 2, 3
for j in range(1, 3): # inner loop: j = 1, 2 (restarts each time)
print(i, j)The key idea
If the outer loop runs A times and the inner loop runs B times, the inner block executes A Ă— B times.
Why we use nested loops
Nested loops are useful whenever you’re working with two dimensions, like:
- rows Ă— columns in a table
- pages Ă— items on each page
- years Ă— months
# TODO: find out the number of rows (<tr>) and the number of columns (<td>) in each row in the body table (<tbody>)
tbody = driver.find_element(By.TAG_NAME, 'tbody')
rows =tbody.find_elements(By.TAG_NAME, 'tr')
nrows = len(rows)
thead = driver.find_element(By.TAG_NAME, 'thead')
cols = thead.find_elements(By.TAG_NAME, 'th')
ncols = len(cols)
df = pd.DataFrame()
for i in range(1, nrows + 1):
data = [] # creating an empty list for one row
for j in range(1, ncols + 1): # Iterate over column positions
# TODO: scrape each cell's text in a single row
xpath_value = f'/html/body/div[1]/div[2]/div/div[4]/div/div[1]/div/table/tbody/tr[{i}]/td[{j}]'
value = driver.find_element(By.XPATH, xpath_value).text
# TODO: append value to the data list
data.append(value)
obs = pd.DataFrame([data])
df = pd.concat([df, obs], ignore_index=True)
df.columns = ['mon_yr', 'retail_price', 'refining', 'dist_mkt',
'taxes', 'crude_oil']
df.to_csv('data/eia_q2_2026_0223.csv', index = False)Discussion
Welcome to our Classwork 4 Discussion Board! đź‘‹
This space is designed for you to engage with your classmates about the material covered in Classwork 4.
Whether you are looking to delve deeper into the content, share insights, or have questions about the content, this is the perfect place for you.
If you have any specific questions for Byeong-Hak (@bcdanl) regarding the Classwork 4 materials or need clarification on any points, don’t hesitate to ask here.
All comments will be stored here.
Let’s collaborate and learn from each other!