Lecture 16

Web-scrapping with Python selenium

Byeong-Hak Choe

SUNY Geneseo

March 14, 2025

Web-scrapping with Python selenium

WebDriver

  • WebDriver is an wire protocol that defines a language-neutral interface for controlling the behavior of web browsers.

  • The purpose of WebDriver is to control the behavior of web browsers programmatically, allowing automated interactions such as:

    • Extracting webpage content
    • Opening a webpage
    • Clicking buttons
    • Filling out forms
    • Running automated tests on web applications
  • Selenium WebDriver refers to both the language bindings and the implementations of browser-controlling code.

Driver

  • Each browser requires a specific WebDriver implementation, called a driver.
    • Web browsers (e.g., Chrome, Firefox, Edge) do not natively understand Selenium WebDriver commands.
    • To bridge this gap, each browser has its own WebDriver implementation, known as a driver.
  • The driver handles communication between Selenium WebDriver and the browser.
    • This driver acts as a middleman between Selenium WebDriver and the actual browser.
  • Different browsers have specific drivers:
    • ChromeDriver for Chrome
    • GeckoDriver for Firefox

Interaction Diagram

  • A simplified diagram of how WebDriver interacts with browser might look like this:

  • WebDriver interacts with the browser via the driver in a two-way communication process:
    1. Sends commands (e.g., open a page, click a button) to the browser.
    2. Receives responses from the browser.

Setting up

  • Install the Chrome or FireFox web-browser if you do not have either of them.
    • I will use the Chrome.
  • Install Selenium using pip:
    • On the Spyder Console, run the following:
    • pip install selenium
  • Selenium with Python is a well-documented reference.

Setting up - webdriver.Chrome()

  • To begin with, we import (1) webdriver from selenium and (2) the By and Options classes.
    • webdriver.Chrome() opens the Chrome browser that is being controlled by automated test software, selenium.
# Import the necessary modules from the Selenium package
from selenium import webdriver  # Main module to control the browser
from selenium.webdriver.common.by import By  # Helps locate elements on the webpage
from selenium.webdriver.chrome.options import Options  # Allows setting browser options

# Create an instance of Chrome options
options = Options()
options.add_argument("window-size=1400,1200")  # Set the browser window size to 1400x1200

# Initialize the Chrome WebDriver with the specified options
driver = webdriver.Chrome(options=options)  # Correct implementation

# Now you can use 'driver' to control the Chrome browser

get() Method in WebDriver

  • get(url) from webdriver opens the specified URL in a web browser.
  • When using webdriver in Google Chrome, you may see the message:
    • “Chrome is being controlled by automated test software.”
form_url = "https://qavbox.github.io/demo/webtable/"
driver.[?](form_url)
driver.close()
driver.quit()
  • close() terminates the current browser window.
  • quit() completely exits the webdriver session, closing a browser window.

Inspecting Web Elements with find_element()

  • Once the Google Chrome window loads with the provided URL, we need to find specific elements to interact with.
    • The easiest way to identify elements is by using Developer Tools to inspect the webpage structure.
  • To inspect an element:
    1. Right-click anywhere on the webpage.
    2. Select the Inspect option from the pop-up menu.
    3. In the Elements panel, hover over the DOM structure to locate the desired element.

Inspecting Web Elements with find_element()

  • When inspecting an element, look for:
    • HTML tag (e.g., <input>, <button>, <div>) used for the element.
    • Attributes (e.g., id, class, name) that define the element.
    • Attribute values that help uniquely identify the element.
    • Page structure to understand how elements are nested within each other.

Locating Web Elements by find_element() & find_elements()

Locating Web Elements by find_element()

  • There are various strategies to locate elements in a page.
find_element(By.ID, "id")
find_element(By.CLASS_NAME, "class name")
find_element(By.NAME, "name")
find_element(By.CSS_SELECTOR, "css selector")
find_element(By.TAG_NAME, "tag name")
find_element(By.LINK_TEXT, "link text")
find_element(By.PARTIAL_LINK_TEXT, "partial link text")
find_element(By.XPATH, "xpath")
  • Selenium provides the find_element() method to locate elements in a page.

  • To find multiple elements (these methods will return a list):

    • find_elements()

find_element(By.ID, "")

  • find_element(By.ID, "") & find_elements(By.ID, ""):
    • Return element(s) that match a given ID attribute value.
  • Example HTML code where an element has an ID attribute form1:
<form id="form1">...</form>
  • Example of locating the form using find_element(By.ID, ""):
form = driver.find_element(By.ID, "form1")
form.text  # Retrieves text content if available

find_element(By.CLASS_NAME, "")

  • find_element(By.CLASS_NAME, "") & find_elements(By.CLASS_NAME, ""):
    • Return element(s) matching a specific class attribute.
  • Example HTML code with a homebtn class:
<div class="homebtn" align="center">...</div>
home_button = driver.find_element(By.CLASS_NAME, "homebtn")
home_button.click()  # Clicks the home button
driver.back()  # Navigates back to the previous page

find_element(By.NAME, "")

  • find_element(By.NAME, "") & find_elements(By.NAME, ""):
    • Return element(s) with a matching name attribute.
  • Example HTML code with a name attribute home:
<input type="button" class="btn" name="home" value="Home" />
home_button2 = driver.find_element(By.NAME, "home")
home_button2.click()
driver.back()

find_element(By.CSS_SELECTOR, "")

  • find_element(By.CSS_SELECTOR, "") & find_elements(By.CSS_SELECTOR, ""):
    • Locate element(s) using a CSS selector.
  • Inspect the webpage using browser Developer Tools.
  • Locate the desired element in the Elements panel.
  • Right-click and select Copy selector
    • Let’s find out CSS selector for the Home button.
home_button3 = driver.find_element(By.CSS_SELECTOR, "body > div > a > input")
home_button3.click()
driver.back()

find_element(By.TAG_NAME, "")

  • find_element(By.TAG_NAME, "") & find_elements(By.TAG_NAME, ""):
    • Locate element(s) by a specific HTML tag.
table01 = driver.find_element(By.ID, "table01")
thead = table01.find_element(By.TAG_NAME, "thead")
thead.text

find_element(By.XPATH, "")

Understanding XPath

  • find_element(By.XPATH, "") & find_elements(By.XPATH, ""):
    • Return element(s) that match the specified XPath query.
  • XPath is a query language for searching and locating nodes in an XML document.
    • Supported by all major web browsers.
    • Used in Selenium to find elements when ID, name, or class attributes are not available.
    • Powerful for navigating complex HTML structures.

Basic XPath Syntax

//tag_name[@attribute='value']
  • // → Selects elements anywhere in the document.
  • tag_name → HTML tag (input, div, span, etc.).
  • @attribute → Attribute name (id, class, aria-label, etc.).
  • value → Attribute’s assigned value.

Absolute vs. Relative XPath

  • Absolute XPath → Defines the full path from the root node.
    • Reliable if the webpage structure does not change.
  • Relative XPath → Defines the path starting from a known element.
    • More flexible—doesn’t break as easily if the structure changes.

Example: Finding a Table Element with XPath

  • XPath can use attributes other than ID, name, or class to locate elements.
  • Suppose we want to retrieve data from a second table on a webpage.
  • The table contains multiple <tr> (rows) and <th> (headers) without an easily identifiable ID or class.
  • find_element(By.TAG_NAME, "") is not reliable due to multiple <tr> and <th> tags.

Extracting XPath from Developer Tools

  • Inspect the webpage using browser Developer Tools.
  • Locate the desired element in the Elements panel.
  • Right-click and select Copy XPath.
  • Example extracted XPath:
//*[@id="table02"]/tbody/tr[1]/td[1]
/html/body/form/fieldset/div/div/table/tbody/tr[1]/td[1]

Finding an Element Using XPath

  • Locate “Tiger Nixon” in the second table:
elt = driver.find_element(By.XPATH, '//*[@id="table02"]/tbody/tr[1]/td[1]')
print(elt.text)  # Output the extracted text

When to Use XPath

  • Use XPath when:
    • The element lacks a unique ID or class.
    • Other locator methods (By.ID, By.CLASS_NAME, etc.) don’t work.
  • Limitations:
    • XPath is less efficient than ID-based locators.
    • Page structure changes may break XPath-based automation.
  • For our tasks, however, XPath remains a reliable and effective method!

Retrieving Attribute Values with get_attribute()

  • get_attribute() extracts an element’s attribute value.
  • Useful for retrieving hidden properties not visible on the page.
<a href="https://www.selenium.dev/">Selenium</a>
<input id="btn" class="btn" type="button" onclick="change_text(this)" value="Delete">
driver.find_element(By.XPATH, '//*[@id="table01"]/tbody/tr[2]/td[3]/a').get_attribute('href')
driver.find_element(By.XPATH, '//*[@id="btn"]').get_attribute('value')

Web-scrapping with Python selenium

Let’s do Classwork 9!

NoSuchElementException and WebDriverWait

NoSuchElementException and try-except blocks

from selenium.common.exceptions import NoSuchElementException
try:
    elem = driver.find_element(By.XPATH, "element_xpath")
    elem.click()
except NoSuchElementException:
    pass
  • When a web element is not found, it throws the NoSuchElementException.
    • try-except can be used to avoid the termination of the selenium code.
  • This solution is to address the inconsistency in the DOM among the seemingly same pages.

Explicit wait with time.sleep()

import time

# example webpage
url = "https://qavbox.github.io/demo/delay/"
driver.get(url)

driver.find_element(By.XPATH, '//*[@id="one"]/input').click()
time.sleep(5)
element = driver.find_element(By.XPATH, '//*[@id="two"]')
element.text
  • The time.sleep() method is an explicit wait to set the condition to be an exact time period to wait.

  • In general, a more efficient solution than time.sleep() would be to make WebDriver() wait only as long as required.

Implicit wait with implicitly_wait()

driver.find_element(By.XPATH, '//*[@id="oneMore"]/input[1]').click()
driver.implicitly_wait(10)  # Wait up to 10 seconds for elements to appear
element2 = driver.find_element(By.ID, 'delay')
element2.text
  • Implicit wait with implicitly_wait() directs the webdriver to wait for a certain measure of time before throwing an exception.
    • Applies globally for the lifetime of the driver session.
  • Once this time is set, webdriver will wait for the element before the exception occurs.

Explicit wait with WebDriverWait and EC

  • An explicit wait allows you to wait for a specific condition before continuing.
  • Uses:
    • Wait for elements to appear, be visible, or be clickable.
    • More flexible and precise than implicit waits.
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
  • To wait for presence_of_element_located:
element = ( 
  WebDriverWait(driver, 20)  # 20 is timeout in seconds when an expectation is called
  .until(
    EC.presence_of_element_located(
      (By.XPATH, "element_xpath")
      )
    )
) 
  • To wait for visibility_of_element_located:
element = (
  WebDriverWait(driver, 20)
  .until(
    EC.visibility_of_element_located(
      (By.CSS_SELECTOR, "element_css")
      )
  )
)
  • To wait for element_to_be_clickable:
element = (
  WebDriverWait(driver, 20)
  .until(
    EC.element_to_be_clickable(
      (By.LINK_TEXT, "element_link_text")
      )
    )
)

Web-scrapping with Python selenium

Let’s do Classwork 10!