Data Collection II: Web-scrapping Primer; Scrapping Data with selenium
February 13, 2026
“If you can see things in your web browser, you can scrape them.”
Warning
Legal ≠ ethical. Even if data is “public,” ToS, privacy expectations, and platform blocks still matter.
When we type a URL starting with https://:
https)example.com)/products)?id=...&cat=...) ← common in data pages#section) ← in-page referenceHTML (HyperText Markup Language) is the markup that defines the structure of a web page (headings, paragraphs, links, tables, etc.).
When you “scrape,” you usually:
If you don’t understand HTML, you can’t reliably target the right data.
Selenium is not “magic”—it automates a browser, but you still need to:
id / classhref, data-*)<h1> ... </h1><p> ... </p><a href="..."> ... </a><table> ... </table><div> ... </div><span> ... </span><div> and <span><div> – block-level containerSeoul is the capital city of South Korea…
<span> – inline containerMy mother has blue eyes…
selenium
WebDriver is an wire protocol that defines a language-neutral interface for controlling the behavior of web browsers.
The purpose of WebDriver is to control the behavior of web browsers programmatically, allowing automated interactions such as:
Selenium WebDriver refers to both the language bindings and the implementations of browser-controlling code.
pip:
pip install seleniumwebdriver.Chrome()webdriver from selenium and (2) the By and Options classes.
webdriver.Chrome() opens the Chrome browser that is being controlled by automated test software, selenium.import pandas as pd
import os, time, random
from io import StringIO
# Import the necessary modules from the Selenium library
from selenium import webdriver # Main module to control the browser
from selenium.webdriver.common.by import By # Helps locate elements on the webpage
from selenium.webdriver.chrome.options import Options # Allows setting browser options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import TimeoutException
from selenium.common.exceptions import StaleElementReferenceException
# Create an instance of Chrome options
options = Options()
# Initialize the Chrome WebDriver with the specified options
driver = webdriver.Chrome(options=options)get() Method in WebDriverget(url) from webdriver opens the specified URL in a web browser.webdriver in Google Chrome, you may see the message:
form_url = "https://qavbox.github.io/demo/webtable/"
driver.[?](form_url)
driver.close()
driver.quit()close() terminates the current browser window.quit() completely exits the webdriver session, closing a browser window.find_element()Elements panel, hover over the DOM structure to locate the desired element.find_element()<input>, <button>, <div>) used for the element.id, class, name) that define the element.find_element() & find_elements()find_element()find_element(By.ID, "id")
find_element(By.CLASS_NAME, "class name")
find_element(By.NAME, "name")
find_element(By.CSS_SELECTOR, "css selector")
find_element(By.TAG_NAME, "tag name")
find_element(By.LINK_TEXT, "link text")
find_element(By.PARTIAL_LINK_TEXT, "partial link text")
find_element(By.XPATH, "xpath")Selenium provides the find_element() method to locate elements in a page.
To find multiple elements (these methods will return a list):
find_elements()find_element(By.ID, "")find_element(By.ID, "") & find_elements(By.ID, ""):
form1:find_element(By.CLASS_NAME, "")find_element(By.CLASS_NAME, "") & find_elements(By.CLASS_NAME, ""):
homebtn class:find_element(By.NAME, "")find_element(By.CSS_SELECTOR, "")find_element(By.CSS_SELECTOR, "") & find_elements(By.CSS_SELECTOR, ""):
find_element(By.TAG_NAME, "")find_element(By.LINK_TEXT, "")find_element(By.PARTIAL_LINK_TEXT, "")find_element(By.XPATH, "")find_element(By.XPATH, "…") and find_elements(By.XPATH, "…"):
find_element(...) returns one matching element (the first match).find_elements(...) returns a list of all matching elements.// → search anywhere in the documenttag_name → HTML tag name (input, div, span, table, etc.)@attribute → attribute name (id, class, aria-label, role, data-*, etc.)'value' → the attribute’s value (quoted)When you right-click an element in Chrome DevTools → Copy, you often see:
//.../html/body/...In practice: prefer XPath (the shorter one) over Full XPath when possible.
<table> on a page, but the tables have no unique id or class.find_element(By.TAG_NAME, "table") is too vague because it returns only the first table.By.ID, By.CLASS_NAME, etc.) don’t work.seleniumLet’s do Classwork 4!
get_attribute()get_attribute() extracts an element’s attribute value.NoSuchElementException and try-except blocksNoSuchElementException.
try-except can be used to avoid the termination of the selenium code.WebDriverWaittime.sleep(random.uniform(a, b)) as a small human-like delay between actions/pages.time.sleep(random.uniform()) for politeness (respect servers).WebDriverWait() + a condition (presence/clickable).WebDriverWait() for robustness (wait for conditions).
Best practice: Use both—WebDriverWait for robustness, and small randomized sleeps for politeness.
time.sleep(random.uniform())import time, random
# Example: polite delay between actions/pages
time.sleep(random.uniform(0.5, 1.5)) # small jitter (adjust as needed)time.sleep() Alone is Not Robustimport time
url = "https://qavbox.github.io/demo/delay/"
driver.get(url)
driver.find_element(By.XPATH, '//*[@id="one"]/input').click()
time.sleep(2) # blind wait: always 2 seconds
element = driver.find_element(By.XPATH, '//*[@id="two"]')
element.texttime.sleep() is a blind wait:
WebDriverWait() + expected_conditionsdriver.get("https://qavbox.github.io/demo/delay/")
driver.find_element(By.XPATH, '//*[@id="one"]/input').click()
try:
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.XPATH, '//*[@id="two"]'))
)
print(element.text)
except TimeoutException:
print("Timed out: element did not appear within 10 seconds.")WebDriverWait() + expected_conditionsbtn = WebDriverWait(driver, 10).until(
EC.element_to_be_clickable((By.XPATH, '//*[@id="one"]/input'))
)
btn.click()pd.read_html() for Table Scrappingpd.read_html() for Table ScrappingYahoo! Finance has probably renewed its database system, so that yfinance does not work now.
Yahoo! Finance uses web table to display historical data about a company’s stock.
Let’s use Selenium with pd.read_html() to collect stock price data!
pd.read_html() for Yahoo! Finance Data# Load content page
url = 'https://finance.yahoo.com/quote/MSFT/history/?p=MSFT&period1=1672531200&period2=1772323200'
driver.get(url)
time.sleep(random.uniform(3, 5)) # wait for table to loadperiod1 and period2 values for Yahoo Finance URLs uses Unix timestamps (number of seconds since January 1, 1970),
get_attribute("outerHTML")# Extract the <table> HTML element
table_html = driver.find_element(By.TAG_NAME, 'table').get_attribute("outerHTML")
# Parse the HTML table into a pandas DataFrame
df = pd.read_html(StringIO(table_html))[0]StringIO turns that string into a file-like object, which is what pd.read_json() expects moving forward.
.get_attribute("outerHTML"): gets the entire HTML from the WebElement.