Classwork 4

Scrapping Table Data with Python selenium with XPath approach

Author

Byeong-Hak Choe

Published

February 18, 2026

Modified

March 2, 2026


Below is to set up the web scrapping environment with Python selenium:

import pandas as pd
import numpy as np
import os, time, random
from io import StringIO

# Import the necessary modules from the Selenium library
from selenium import webdriver  # Main module to control the browser
from selenium.webdriver.common.by import By  # Helps locate elements on the webpage
from selenium.webdriver.chrome.options import Options  # Allows setting browser options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import TimeoutException
from selenium.common.exceptions import StaleElementReferenceException

# Set the working directory path
wd_path = 'ABSOLUTE_PATHNAME_OF_YOUR_WORKING_DIRECTORY' # e.g., '/Users/bchoe/Documents/DANL-210'
os.chdir(wd_path)  # Change the current working directory to wd_path
os.getcwd()  # Retrieve and return the current working directory

# Create an instance of Chrome options
options = Options()

# Initialize the Chrome WebDriver with the specified options
driver = webdriver.Chrome(options=options)


Collecting Rows into a pandas DataFrame

df = pd.DataFrame()

for i in range(1, 6):
    
    obs_lst = [i, "Ava", "Geneseo", 2025]
    
    obs = pd.DataFrame([obs_lst])
    df = pd.concat([df, obs], ignore_index = True)
    
df.columns = ['id', 'name', 'school', 'year']
  1. Start with an empty DataFrame
  • df = pd.DataFrame() creates an empty table (no rows yet).
  1. Repeat the same process five times
  • for i in range(1, 6): runs the indented block 5 times (for i = 1, 2, 3, 4, 5).
  1. Create one “observation” as a list
  • obs_lst = [i, "Ava", "Geneseo", 2026] is a Python list with 3 values.
  • Think of this list as one row of data you want to add.
  1. Turn the list into a one-row DataFrame
  • obs = pd.DataFrame([obs_lst])
  • Wrapping obs_lst with brackets ([obs_lst]) makes it a list of rows (2D), so pandas creates a 1-row DataFrame.
  • Because no column names are given, pandas assigns default column labels: 0, 1, 2.
  1. Append the new row to df by concatenation
  • df = pd.concat([df, obs], ignore_index=True)
  • pd.concat(...) stacks DataFrames row-wise** (adds rows).**
    • The first argument [df, obs] for pd.concat(...) is a list of DataFrames.
    • ignore_index=True resets the row index to 0, 1, 2, ... so the final table has clean numbering.
  1. Assign column names to the DataFrame
  • df.columns = ['name', 'school', 'year'] renames the columns of df.
  • This replaces the default column labels (0, 1, 2) with meaningful names:
    • name for the first column
    • school for the second column
    • year for the third column
  • After this, you can refer to columns using df['name'], df['school'], and df['year'].

Result

After the loop finishes, df contains five identical rows:


Question 1.

Task: Using Selenium and a single for loop, scrape the table on the EIA page into a pandas DataFrame.

Hints

  • Start by scraping the body rows (tr) for cell values (td)
  • You can build XPaths with an f-string inside a loop.

Example idea (XPath f-string):

for i in range(1, 10):
    xpath = f'//*[@id="main-content"]//table/tbody/tr[{i}]/td[1]'
    print(xpath)

Answer:

# TODO: find out the number of rows (<tr>) in the body table (<tbody>)
nrows = 

df = pd.DataFrame()
for i in range(1, nrows + 1):

    # TODO: scrape each cell's text in a single row
    mon_yr = 
    retail_price = 
    refining = 
    distribution_marketing = 
    taxes = 
    crude_oil = 
    
    obs_lst = [mon_yr, retail_price, refining, distribution_marketing, taxes, crude_oil]
    obs = pd.DataFrame([obs_lst])
    
    df = pd.concat([df, obs], ignore_index = True)

Web-scraping Strategy with XPath

Step 0. Set up your environment

  • Import the libraries/modules you need (e.g., Selenium, pandas, time).
  • Set your working directory (so your output file saves where you expect).
  • Get to the webpage.

Step 1. Find an XPath pattern for the element(s) you want

  • Use your browser’s Developer Tools (Inspect) to locate the target element.
  • Identify what changes across repeated elements (often an index like [1], [2], [3]).

Step 2. Build a dynamic XPath using an f-string

  • Turn the XPath pattern into a template.
  • Replace the changing part (usually the index) with {i} so you can loop over it.

Step 3. Write a loop to collect the data. Things to decide before you code:

  • How many times to iterate (e.g., number of rows, pages, or cards).
  • How to locate elements
    • find_element() for a single element.
    • find_elements() for a list of elements.
  • How to store results
    • Append each observation (row) to a list of dictionaries.
    • Convert the list into a pandas DataFrame at the end.

Step 4. Export the DataFrame to a CSV

  • Use df.to_csv("filename.csv", index=False) so the CSV is clean and ready to use.
# %%
# =============================================================================
# Step 0. Set up your environment
# =============================================================================
import pandas as pd
import numpy as np
import os, time, random
from io import StringIO

# Import the necessary modules from the Selenium library
from selenium import webdriver  # Main module to control the browser
from selenium.webdriver.common.by import By  # Helps locate elements on the webpage
from selenium.webdriver.chrome.options import Options  # Allows setting browser options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import TimeoutException
from selenium.common.exceptions import StaleElementReferenceException

# Set the working directory path
wd_path = '/Users/bchoe/Documents/DANL-210' # e.g., '/Users/bchoe/Documents/DANL-210'
os.chdir(wd_path)  # Change the current working directory to wd_path
os.getcwd()  # Retrieve and return the current working directory

# Create an instance of Chrome options
options = Options()

# Initialize the Chrome WebDriver with the specified options
driver = webdriver.Chrome(options=options)


# %%
# =============================================================================
# Scrapping
# =============================================================================
url = 'https://www.eia.gov/petroleum/gasdiesel/gaspump_hist.php'

driver.get(url)


# Step 1. Find an XPath pattern for the element(s) you want
# Mon-yr
# Row 1 
# /html/body/div[1]/div[2]/div/div[4]/div/div[1]/div/table/tbody/tr[1]/td[1]
# Row 2
# /html/body/div[1]/div[2]/div/div[4]/div/div[1]/div/table/tbody/tr[2]/td[1]

# retail price
# /html/body/div[1]/div[2]/div/div[4]/div/div[1]/div/table/tbody/tr[1]/td[2]
# /html/body/div[1]/div[2]/div/div[4]/div/div[1]/div/table/tbody/tr[2]/td[2]

# Step 2. Build a dynamic XPath using an f-string
# xpath_mon_yr = f'/html/body/div[1]/div[2]/div/div[4]/div/div[1]/div/table/tbody/tr[{i}]/td[1]'
# xpath_retail_price = f'/html/body/div[1]/div[2]/div/div[4]/div/div[1]/div/table/tbody/tr[{i}]/td[2]'
# xpath_refining = f'/html/body/div[1]/div[2]/div/div[4]/div/div[1]/div/table/tbody/tr[{i}]/td[3]'
# xpath_distribution_marketing = f'/html/body/div[1]/div[2]/div/div[4]/div/div[1]/div/table/tbody/tr[{i}]/td[4]'
# xpath_taxes = f'/html/body/div[1]/div[2]/div/div[4]/div/div[1]/div/table/tbody/tr[{i}]/td[5]'
# xpath_crude_oil = f'/html/body/div[1]/div[2]/div/div[4]/div/div[1]/div/table/tbody/tr[{i}]/td[6]'

# How many times to iterate
# TODO: find out the number of rows (<tr>) in the body table (<tbody>)
table = driver.find_element(By.TAG_NAME, 'tbody')
rows = table.find_elements(By.TAG_NAME, 'tr')
nrows = len(rows)


# Step 3. Write a loop to collect the data
df = pd.DataFrame()

for i in range(1, nrows + 1):

    xpath_mon_yr = f'/html/body/div[1]/div[2]/div/div[4]/div/div[1]/div/table/tbody/tr[{i}]/td[1]'
    xpath_retail_price = f'/html/body/div[1]/div[2]/div/div[4]/div/div[1]/div/table/tbody/tr[{i}]/td[2]'
    xpath_refining = f'/html/body/div[1]/div[2]/div/div[4]/div/div[1]/div/table/tbody/tr[{i}]/td[3]'
    xpath_distribution_marketing = f'/html/body/div[1]/div[2]/div/div[4]/div/div[1]/div/table/tbody/tr[{i}]/td[4]'
    xpath_taxes = f'/html/body/div[1]/div[2]/div/div[4]/div/div[1]/div/table/tbody/tr[{i}]/td[5]'
    xpath_crude_oil = f'/html/body/div[1]/div[2]/div/div[4]/div/div[1]/div/table/tbody/tr[{i}]/td[6]'
    
    # How to locate elements
    mon_yr = driver.find_element(By.XPATH, 
                                 xpath_mon_yr).text
    
    retail_price = driver.find_element(By.XPATH, 
                                       xpath_retail_price).text
    refining = driver.find_element(By.XPATH, 
                                   xpath_refining).text
    distribution_marketing = driver.find_element(By.XPATH, 
                                                 xpath_distribution_marketing).text
    taxes = driver.find_element(By.XPATH, 
                                xpath_refining).text
    crude_oil = driver.find_element(By.XPATH, 
                                    xpath_refining).text
    
    # How to store results
    obs_lst = [mon_yr, retail_price, refining, 
               distribution_marketing, taxes, crude_oil]
    obs = pd.DataFrame([obs_lst])
    
    df = pd.concat([df, obs], ignore_index = True)

# Step 4. Export the DataFrame to a CSV
# Column names
df.columns = ['mon_yr', 'retail_price', 'refining', 
              'distribution_marketing', 'taxes', 'crude_oil']

df.to_csv('data/eia_2026_0220', index= False)



Question 2 — Scrape with a nested for loop

Task: Do the same scraping task as Question 1, but this time use a nested for loop:

  • Outer loop: iterate over rows (tr)
  • Inner loop: iterate over columns (td)
    • In the inner loop, append value to the data list.
# TODO: find out the number of rows (<tr>) and the number of columns (<td>) in each row in the body table (<tbody>)
nrows = 
ncols = 

df = pd.DataFrame()
for i in range(1, nrows + 1):
  
    data = []    # creating an empty list for one row
    
    for j in range(1, ncols + 1):  # Iterate over column positions
        
        # TODO: scrape each cell's text in a single row
        value =
        
        # TODO: append value to the data list

        
    obs = pd.DataFrame([data])
    df = pd.concat([df, obs], ignore_index=True)

Nested for-loop

A nested loop is simply a loop inside another loop.

  • The outer loop controls the “big step” (the main repetition).
  • The inner loop runs all the way through for each single outer loop step.

So the pattern is:

  1. Do outer step #1
    • run inner loop from start to finish
  2. Do outer step #2
    • run inner loop from start to finish
  3. Repeat…
Generic example
for i in range(1, 4):        # outer loop: i = 1, 2, 3
    for j in range(1, 3):    # inner loop: j = 1, 2 (restarts each time)
        print(i, j)
The key idea

If the outer loop runs A times and the inner loop runs B times, the inner block executes A Ă— B times.

Why we use nested loops

Nested loops are useful whenever you’re working with two dimensions, like:

  • rows Ă— columns in a table
  • pages Ă— items on each page
  • years Ă— months
# TODO: find out the number of rows (<tr>) and the number of columns (<td>) in each row in the body table (<tbody>)
tbody = driver.find_element(By.TAG_NAME, 'tbody')
rows =tbody.find_elements(By.TAG_NAME, 'tr')
nrows = len(rows)

thead = driver.find_element(By.TAG_NAME, 'thead')
cols = thead.find_elements(By.TAG_NAME, 'th')
ncols = len(cols)

df = pd.DataFrame()
for i in range(1, nrows + 1):
  
    data = []    # creating an empty list for one row
    
    for j in range(1, ncols + 1):  # Iterate over column positions
        
        # TODO: scrape each cell's text in a single row
        xpath_value = f'/html/body/div[1]/div[2]/div/div[4]/div/div[1]/div/table/tbody/tr[{i}]/td[{j}]'
        value = driver.find_element(By.XPATH, xpath_value).text
        
        # TODO: append value to the data list
        data.append(value)
        
    obs = pd.DataFrame([data])
    df = pd.concat([df, obs], ignore_index=True)
    

df.columns = ['mon_yr', 'retail_price', 'refining', 'dist_mkt', 
              'taxes', 'crude_oil']

df.to_csv('data/eia_q2_2026_0223.csv', index = False)



Discussion

Welcome to our Classwork 4 Discussion Board! đź‘‹

This space is designed for you to engage with your classmates about the material covered in Classwork 4.

Whether you are looking to delve deeper into the content, share insights, or have questions about the content, this is the perfect place for you.

If you have any specific questions for Byeong-Hak (@bcdanl) regarding the Classwork 4 materials or need clarification on any points, don’t hesitate to ask here.

All comments will be stored here.

Let’s collaborate and learn from each other!

Back to top