Classwork 4
Scrapping Data with Python selenium
Below is to set up the web scrapping environment with Python selenium:
import pandas as pd
import os, time, random
from io import StringIO
# Import the necessary modules from the Selenium library
from selenium import webdriver # Main module to control the browser
from selenium.webdriver.common.by import By # Helps locate elements on the webpage
from selenium.webdriver.chrome.options import Options # Allows setting browser options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import TimeoutException
from selenium.common.exceptions import StaleElementReferenceException
# Set the working directory path
wd_path = 'ABSOLUTE_PATHNAME_OF_YOUR_WORKING_DIRECTORY' # e.g., '/Users/bchoe/Documents/DANL-210'
os.chdir(wd_path) # Change the current working directory to wd_path
os.getcwd() # Retrieve and return the current working directory
# Create an instance of Chrome options
options = Options()
# Initialize the Chrome WebDriver with the specified options
driver = webdriver.Chrome(options=options)Collecting Rows into a pandas DataFrame
df = pd.DataFrame()
for i in range(1, 6):
obs_lst = ["Ava", "Geneseo", 2025]
obs = pd.DataFrame([obs_lst])
df = pd.concat([df, obs], ignore_index = True)
df.columns = ['name', 'school', 'year']- Start with an empty
DataFrame
df = pd.DataFrame()creates an empty table (no rows yet).
- Repeat the same process five times
for i in range(1, 6):runs the indented block 5 times (fori = 1, 2, 3, 4, 5).
- Create one โobservationโ as a list
obs_lst = ["Ava", "Geneseo", 2026]is a Python list with 3 values.- Think of this list as one row of data you want to add.
- Turn the list into a one-row
DataFrame
obs = pd.DataFrame([obs_lst])- Wrapping
obs_lstwith brackets ([obs_lst]) makes it a list of rows (2D), so pandas creates a 1-row DataFrame. - Because no column names are given, pandas assigns default column labels:
0,1,2.
- Append the new row to
dfby concatenation
df = pd.concat([df, obs], ignore_index=True)pd.concat([...])stacks DataFrames row-wise (adds rows).ignore_index=Trueresets the row index to0, 1, 2, ...so the final table has clean numbering.
- Assign column names to the
DataFrame
df.columns = ['name', 'school', 'year']renames the columns ofdf.- This replaces the default column labels (
0,1,2) with meaningful names:namefor the first columnschoolfor the second columnyearfor the third column
- After this, you can refer to columns using
df['name'],df['school'], anddf['year'].
Result
After the loop finishes, df contains five identical rows:
Question 1.
Task: Using Selenium and a single for loop, scrape the table on the EIA page into a pandas DataFrame.
Hints
- Start by scraping the body rows (
tr) for cell values (td) - You can build XPaths with an f-string inside a loop.
Example idea (XPath f-string):
for i in range(1, 10):
xpath = f'//*[@id="main-content"]//table/tbody/tr[{i}]/td[1]'
print(xpath)Answer:
# TODO: find out the number of rows (<tr>) in the body table (<tbody>)
nrows =
df = pd.DataFrame()
for i in range(1, nrows + 1):
# TODO: scrape each cell's text in a single row
mon_yr =
retail_price =
refining =
distribution_marketing =
taxes =
crude_oil =
obs_lst = [mon_yr, retail_price, refining, distribution_marketing, taxes, crude_oil]
obs = pd.DataFrame([obs_lst])
df = pd.concat([df, obs], ignore_index = True)Question 2 โ Scrape with a nested for loop
Task: Do the same scraping task as Question 1, but this time use a nested for loop:
- Outer loop: iterate over rows (
tr) - Inner loop: iterate over columns (
td)- In the inner loop, append
valueto thedatalist.
- In the inner loop, append
Answer:
# TODO: find out the number of rows (<tr>) and the number of columns (<td>) in each row in the body table (<tbody>)
nrows =
ncols =
df = pd.DataFrame()
for i in range(1, nrows + 1):
data = [] # creating an empty list for one row
for j in range(1, ncols + 1): # Iterate over column positions
# TODO: scrape each cell's text in a single row
value =
# TODO: append value to the data list
obs = pd.DataFrame([data])
df = pd.concat([df, obs], ignore_index=True)Discussion
Welcome to our Classwork 4 Discussion Board! ๐
This space is designed for you to engage with your classmates about the material covered in Classwork 4.
Whether you are looking to delve deeper into the content, share insights, or have questions about the content, this is the perfect place for you.
If you have any specific questions for Byeong-Hak (@bcdanl) regarding the Classwork 4 materials or need clarification on any points, donโt hesitate to ask here.
All comments will be stored here.
Letโs collaborate and learn from each other!