Web-scrapping with Python selenium
March 14, 2025
seleniumWebDriver is an wire protocol that defines a language-neutral interface for controlling the behavior of web browsers.
The purpose of WebDriver is to control the behavior of web browsers programmatically, allowing automated interactions such as:
Selenium WebDriver refers to both the language bindings and the implementations of browser-controlling code.
pip:
pip install seleniumwebdriver.Chrome()webdriver from selenium and (2) the By and Options classes.
webdriver.Chrome() opens the Chrome browser that is being controlled by automated test software, selenium.# Import the necessary modules from the Selenium package
from selenium import webdriver # Main module to control the browser
from selenium.webdriver.common.by import By # Helps locate elements on the webpage
from selenium.webdriver.chrome.options import Options # Allows setting browser options
# Create an instance of Chrome options
options = Options()
options.add_argument("window-size=1400,1200") # Set the browser window size to 1400x1200
# Initialize the Chrome WebDriver with the specified options
driver = webdriver.Chrome(options=options) # Correct implementation
# Now you can use 'driver' to control the Chrome browserget() Method in WebDriverget(url) from webdriver opens the specified URL in a web browser.webdriver in Google Chrome, you may see the message:
form_url = "https://qavbox.github.io/demo/webtable/"
driver.[?](form_url)
driver.close()
driver.quit()close() terminates the current browser window.quit() completely exits the webdriver session, closing a browser window.find_element()Elements panel, hover over the DOM structure to locate the desired element.find_element()<input>, <button>, <div>) used for the element.id, class, name) that define the element.find_element() & find_elements()find_element()find_element(By.ID, "id")
find_element(By.CLASS_NAME, "class name")
find_element(By.NAME, "name")
find_element(By.CSS_SELECTOR, "css selector")
find_element(By.TAG_NAME, "tag name")
find_element(By.LINK_TEXT, "link text")
find_element(By.PARTIAL_LINK_TEXT, "partial link text")
find_element(By.XPATH, "xpath")Selenium provides the find_element() method to locate elements in a page.
To find multiple elements (these methods will return a list):
find_elements()find_element(By.ID, "")find_element(By.ID, "") & find_elements(By.ID, ""):
form1:find_element(By.CLASS_NAME, "")find_element(By.CLASS_NAME, "") & find_elements(By.CLASS_NAME, ""):
homebtn class:find_element(By.NAME, "")find_element(By.CSS_SELECTOR, "")find_element(By.CSS_SELECTOR, "") & find_elements(By.CSS_SELECTOR, ""):
find_element(By.TAG_NAME, "")find_element(By.LINK_TEXT, "")find_element(By.PARTIAL_LINK_TEXT, "")find_element(By.XPATH, "")find_element(By.XPATH, "") & find_elements(By.XPATH, ""):
// → Selects elements anywhere in the document.tag_name → HTML tag (input, div, span, etc.).@attribute → Attribute name (id, class, aria-label, etc.).value → Attribute’s assigned value.<tr> (rows) and <th> (headers) without an easily identifiable ID or class.find_element(By.TAG_NAME, "") is not reliable due to multiple <tr> and <th> tags.By.ID, By.CLASS_NAME, etc.) don’t work.get_attribute()get_attribute() extracts an element’s attribute value.seleniumLet’s do Classwork 9!
NoSuchElementException and WebDriverWaitNoSuchElementException and try-except blocksfrom selenium.common.exceptions import NoSuchElementException
try:
elem = driver.find_element(By.XPATH, "element_xpath")
elem.click()
except NoSuchElementException:
passNoSuchElementException.
try-except can be used to avoid the termination of the selenium code.time.sleep()import time
# example webpage
url = "https://qavbox.github.io/demo/delay/"
driver.get(url)
driver.find_element(By.XPATH, '//*[@id="one"]/input').click()
time.sleep(5)
element = driver.find_element(By.XPATH, '//*[@id="two"]')
element.textThe time.sleep() method is an explicit wait to set the condition to be an exact time period to wait.
In general, a more efficient solution than time.sleep() would be to make WebDriver() wait only as long as required.
implicitly_wait()driver.find_element(By.XPATH, '//*[@id="oneMore"]/input[1]').click()
driver.implicitly_wait(10) # Wait up to 10 seconds for elements to appear
element2 = driver.find_element(By.ID, 'delay')
element2.textimplicitly_wait() directs the webdriver to wait for a certain measure of time before throwing an exception.
webdriver will wait for the element before the exception occurs.WebDriverWait and ECpresence_of_element_located:visibility_of_element_located:pd.read_html() for Table Scrappingpd.read_html() for Table ScrappingYahoo! Finance has probably renewed its database system, so that yfinance does not work now.
Yahoo! Finance uses web table to display historical data about a company’s stock.
Let’s use Selenium with pd.read_html() to collect stock price data!
pd.read_html() for Yahoo! Finance Dataimport pandas as pd
import os, time, random
from selenium import webdriver
from selenium.webdriver.common.by import By
from io import StringIO
# Set working directory
os.chdir('/Users/bchoe/.../lecture-code/')
# Launch Chrome browser
driver = webdriver.Chrome()StringIO turns that string into a file-like object, which is what pd.read_json() expects moving forward.
# Load content page
url = 'https://finance.yahoo.com/quote/MSFT/history/?p=MSFT&period1=1672531200&period2=1743379200'
driver.get(url)
time.sleep(random.uniform(3, 5)) # wait for table to loadperiod1 and period2 values for Yahoo Finance URLs uses Unix timestamps (number of seconds since January 1, 1970),
# Extract the <table> HTML element
table_html = driver.find_element(By.TAG_NAME, 'table').get_attribute("outerHTML")
# Parse the HTML table into a pandas DataFrame
df = pd.read_html(StringIO(table_html))[0].get_attribute("outerHTML"): gets the entire HTML from the WebElement.
pd.read_html() parses HTML tables from a given URL or from raw HTML content, and returns a list of DataFrames.
seleniumLet’s do Classwork 10!