Classwork 7

Scrapping Stock Data with Python selenium + pd.read_html()

Author

Byeong-Hak Choe

Published

March 4, 2026

Modified

March 6, 2026


Setup

Below is to set up the web scrapping environment with Python selenium:

import pandas as pd
import numpy as np
import os, time, random
from io import StringIO

# Import the necessary modules from the Selenium library
from selenium import webdriver  # Main module to control the browser
from selenium.webdriver.common.by import By  # Helps locate elements on the webpage
from selenium.webdriver.chrome.options import Options  # Allows setting browser options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import TimeoutException
from selenium.common.exceptions import StaleElementReferenceException

# Set the working directory path
wd_path = 'ABSOLUTE_PATHNAME_OF_YOUR_WORKING_DIRECTORY' # e.g., '/Users/bchoe/Documents/DANL-210'
os.chdir(wd_path)  # Change the current working directory to wd_path
os.getcwd()  # Retrieve and return the current working directory

# Create an instance of Chrome options
options = Options()

# Initialize the Chrome WebDriver with the specified options
driver = webdriver.Chrome(options=options)


Goal

  • From Yahoo! Finance, collect historical price data for AAPL, MSFT, and NVDA using a for-loop, selenium, and pd.read_html().
    • Combine results into one DataFrame.
    • Add a Symbol column (e.g., AAPL, MSFT, NVDA) so each row keeps its ticker label.
    • Use the date range 2023-01-01 to 2026-03-01.
  • Export the final dataset as: data/stocks.csv
# Example: Yahoo Finance "Historical Data" page (MSFT)
url = "https://finance.yahoo.com/quote/MSFT/history/?p=MSFT&period1=1672531200&period2=1772323200"
  • In Yahoo Finance URLs, period1 and period2 are Unix timestamps (seconds since 1970-01-01):
    • 1672531200 β†’ 2023-01-01
    • 1772323200 β†’ 2026-03-01
Warning

Your DataFrame may have an entry of dividend.

To separate these dividend entries from the actual stock trading data, we use the str.contains() method:

# Filter rows where the 'Open' column contains the word 'Dividend' (these represent dividend entries)
df_dividend = df_all[df_all['Open'].str.contains('Dividend', na=False)]

# Filter out dividend rows to keep only stock price data
df_stock = df_all[~df_all['Open'].str.contains('Dividend', na=False)]
# %%
# =============================================================================
# Setup
# =============================================================================
import pandas as pd
import os, time, random
from io import StringIO

# Import the necessary modules from the Selenium library
from selenium import webdriver  # Main module to control the browser
from selenium.webdriver.common.by import By  # Helps locate elements on the webpage
from selenium.webdriver.chrome.options import Options  # Allows setting browser options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import TimeoutException
from selenium.common.exceptions import StaleElementReferenceException

# Set the working directory path
wd_path = '/Users/bchoe/Documents/DANL-210' # e.g., '/Users/bchoe/Documents/DANL-210'
os.chdir(wd_path)  # Change the current working directory to wd_path
os.getcwd()  # Retrieve and return the current working directory

# Create an instance of Chrome options
options = Options()

# Initialize the Chrome WebDriver with the specified options
driver = webdriver.Chrome(options=options)



# %%
# =============================================================================
# for-loop
# =============================================================================

lst = ['AAPL', 'MSFT', 'NVDA']

# Example: Yahoo Finance "Historical Data" page (MSFT)
# url = "https://finance.yahoo.com/quote/MSFT/history/?p=MSFT&period1=1672531200&period2=1772323200"

dfs = pd.DataFrame()
for company in lst:
    
    url = f"https://finance.yahoo.com/quote/{company}/history/?p={company}&period1=1672531200&period2=1772323200"
    driver.get(url)
    time.sleep(random.uniform(4, 8))
    
    # Extract the <table> HTML element
    table_html = driver.find_element(By.TAG_NAME, 'table').get_attribute("outerHTML")
    
    # Parse the HTML table into a pandas DataFrame
    df = pd.read_html( StringIO(table_html) )[0]
    
    # Add a 'company' column to `df`, assigning the current loop's `company` value:
    df['company'] = company
    
    dfs = pd.concat([dfs, df], ignore_index=True)



Discussion

Welcome to our Classwork 7 Discussion Board! πŸ‘‹

This space is designed for you to engage with your classmates about the material covered in Classwork 7.

Whether you are looking to delve deeper into the content, share insights, or have questions about the content, this is the perfect place for you.

If you have any specific questions for Byeong-Hak (@bcdanl) regarding the Classwork 7 materials or need clarification on any points, don’t hesitate to ask here.

All comments will be stored here.

Let’s collaborate and learn from each other!

Back to top