Web-scrapping with Python selenium
March 14, 2025
selenium
WebDriver is an wire protocol that defines a language-neutral interface for controlling the behavior of web browsers.
The purpose of WebDriver is to control the behavior of web browsers programmatically, allowing automated interactions such as:
Selenium WebDriver refers to both the language bindings and the implementations of browser-controlling code.
pip
:
pip install selenium
webdriver.Chrome()
webdriver
from selenium
and (2) the By
and Options
classes.
webdriver.Chrome()
opens the Chrome browser that is being controlled by automated test software, selenium
.# Import the necessary modules from the Selenium package
from selenium import webdriver # Main module to control the browser
from selenium.webdriver.common.by import By # Helps locate elements on the webpage
from selenium.webdriver.chrome.options import Options # Allows setting browser options
# Create an instance of Chrome options
options = Options()
options.add_argument("window-size=1400,1200") # Set the browser window size to 1400x1200
# Initialize the Chrome WebDriver with the specified options
driver = webdriver.Chrome(options=options) # Correct implementation
# Now you can use 'driver' to control the Chrome browser
get()
Method in WebDriverget(url)
from webdriver
opens the specified URL in a web browser.webdriver
in Google Chrome, you may see the message:
form_url = "https://qavbox.github.io/demo/webtable/"
driver.[?](form_url)
driver.close()
driver.quit()
close()
terminates the current browser window.quit()
completely exits the webdriver
session, closing a browser window.find_element()
Elements
panel, hover over the DOM structure to locate the desired element.find_element()
<input>
, <button>
, <div>
) used for the element.id
, class
, name
) that define the element.find_element()
& find_elements()
find_element()
find_element(By.ID, "id")
find_element(By.CLASS_NAME, "class name")
find_element(By.NAME, "name")
find_element(By.CSS_SELECTOR, "css selector")
find_element(By.TAG_NAME, "tag name")
find_element(By.LINK_TEXT, "link text")
find_element(By.PARTIAL_LINK_TEXT, "partial link text")
find_element(By.XPATH, "xpath")
Selenium provides the find_element()
method to locate elements in a page.
To find multiple elements (these methods will return a list):
find_elements()
find_element(By.ID, "")
find_element(By.ID, "")
& find_elements(By.ID, "")
:
form1
:find_element(By.CLASS_NAME, "")
find_element(By.CLASS_NAME, "")
& find_elements(By.CLASS_NAME, "")
:
homebtn
class:find_element(By.NAME, "")
find_element(By.CSS_SELECTOR, "")
find_element(By.CSS_SELECTOR, "")
& find_elements(By.CSS_SELECTOR, "")
:
find_element(By.TAG_NAME, "")
find_element(By.LINK_TEXT, "")
find_element(By.PARTIAL_LINK_TEXT, "")
find_element(By.XPATH, "")
find_element(By.XPATH, "")
& find_elements(By.XPATH, "")
:
//
→ Selects elements anywhere in the document.tag_name
→ HTML tag (input
, div
, span
, etc.).@attribute
→ Attribute name (id
, class
, aria-label
, etc.).value
→ Attribute’s assigned value.<tr>
(rows) and <th>
(headers) without an easily identifiable ID or class.find_element(By.TAG_NAME, "")
is not reliable due to multiple <tr>
and <th>
tags.By.ID
, By.CLASS_NAME
, etc.) don’t work.get_attribute()
get_attribute()
extracts an element’s attribute value.selenium
Let’s do Classwork 9!
NoSuchElementException
and WebDriverWait
NoSuchElementException
and try-except
blocksfrom selenium.common.exceptions import NoSuchElementException
try:
elem = driver.find_element(By.XPATH, "element_xpath")
elem.click()
except NoSuchElementException:
pass
NoSuchElementException
.
try-except
can be used to avoid the termination of the selenium code.time.sleep()
import time
# example webpage
url = "https://qavbox.github.io/demo/delay/"
driver.get(url)
driver.find_element(By.XPATH, '//*[@id="one"]/input').click()
time.sleep(5)
element = driver.find_element(By.XPATH, '//*[@id="two"]')
element.text
The time.sleep()
method is an explicit wait to set the condition to be an exact time period to wait.
In general, a more efficient solution than time.sleep()
would be to make WebDriver()
wait only as long as required.
implicitly_wait()
driver.find_element(By.XPATH, '//*[@id="oneMore"]/input[1]').click()
driver.implicitly_wait(10) # Wait up to 10 seconds for elements to appear
element2 = driver.find_element(By.ID, 'delay')
element2.text
implicitly_wait()
directs the webdriver
to wait for a certain measure of time before throwing an exception.
webdriver
will wait for the element before the exception occurs.WebDriverWait
and EC
presence_of_element_located
:visibility_of_element_located
:pd.read_html()
for Table Scrappingpd.read_html()
for Table ScrappingYahoo! Finance has probably renewed its database system, so that yfinance
does not work now.
Yahoo! Finance uses web table to display historical data about a company’s stock.
Let’s use Selenium with pd.read_html()
to collect stock price data!
pd.read_html()
for Yahoo! Finance Dataimport pandas as pd
import os, time, random
from selenium import webdriver
from selenium.webdriver.common.by import By
from io import StringIO
# Set working directory
os.chdir('/Users/bchoe/.../lecture-code/')
# Launch Chrome browser
driver = webdriver.Chrome()
StringIO
turns that string into a file-like object, which is what pd.read_json()
expects moving forward.
# Load content page
url = 'https://finance.yahoo.com/quote/MSFT/history/?p=MSFT&period1=1672531200&period2=1743379200'
driver.get(url)
time.sleep(random.uniform(3, 5)) # wait for table to load
period1
and period2
values for Yahoo Finance URLs uses Unix timestamps (number of seconds since January 1, 1970),
# Extract the <table> HTML element
table_html = driver.find_element(By.TAG_NAME, 'table').get_attribute("outerHTML")
# Parse the HTML table into a pandas DataFrame
df = pd.read_html(StringIO(table_html))[0]
.get_attribute("outerHTML")
: gets the entire HTML from the WebElement.
pd.read_html()
parses HTML tables from a given URL or from raw HTML content, and returns a list of DataFrames.
selenium
Let’s do Classwork 10!