Scrapping WebMD Data with Python `selenium`

Final Exam, DANL 210-01, Spring 2025

Author

Byeong-Hak Choe

Published

May 13, 2025

Modified

March 1, 2026

📌 Directions

This is an exam on a paper, so minor coding errors are expected. My main focus is on your approach to each question — the logic, algorithms, and syntax you use. Nearly perfect code will be rewarded with bonus credit.

Data Collection with Selenium (Points: 24)

You will be collecting data from WebMD.com on internal medicine doctors located in Rochester, NY:

Screenshots

Start of the first page

End of the first page

Start of the second page

End of the second page

DOM Structure

Each doctor’s information is contained within a WebElement with class = “card-content”:

The site contains 35 pages, with each page accessible via a URL of the form:

https://doctor.webmd.com/providers/specialty/internal-medicine/new-york/rochester?pagenumber=1
https://doctor.webmd.com/providers/specialty/internal-medicine/new-york/rochester?pagenumber=2
…
https://doctor.webmd.com/providers/specialty/internal-medicine/new-york/rochester?pagenumber=35

Each page has 68 WebElements with class = “card-content”, except page 35, which may have fewer.
- Across all pages (1-35), the first eight of those 68 WebElements on each page are featured results and should be excluded from scraping.
- The remaining WebElements represent a standard listing and should be included in your data collection.
For each WebElement with class = “card-content” in the standard listing, these child elements are present:
- class = “prov-name” (Doctor’s name)
- class = “prov-specialty” (Specialty)
- class = “webmd-rate.on-desktop” (Number of ratings)
Some WebElements with class = “card-content” in the standard listing might also include:
- class = “addr-text-dist”: Parent WebElements with class = “addr-text-dist” include:
  - class = “addr-text” (Address)
- class = “avg-ratings” (Average rating)
- class = “prov-award” (Awards)
- class = “prov-experience” (Years of experience)
However, these optional child WebElements (class values with addr-text, avg-ratings, prov-experience, prov-award) are not present in some WebElements with class = “card-content”

`WebElement` Examples

The WebElement of class = “prov-name” is selected within one WebElement with class = “card-content”:

The WebElement of class = “addr-text” is selected within one WebElement with class = “card-content”:

The WebElement of class = “webmd-rate” is selected within one WebElement with class = “card-content”:

The WebElement of class = “avg-ratings” is selected within one WebElement with class = “card-content”:

The WebElement of class = “prov-award” is selected within one WebElement with class = “card-content”:

The WebElement of class = “prov-experience” is selected within one WebElement with class = “card-content”:

Task:

Loop through pages 1 to 35 by constructing the URL with an f-string using the format:
- f"{base_url}?pagenumber={page}"

base_url = 'https://doctor.webmd.com/providers/specialty/internal-medicine/new-york/rochester'

# Example URLs
url_1 = f"{base_url}?pagenumber=1"
url_2 = f"{base_url}?pagenumber=2"
...
url_35 = f"{base_url}?pagenumber=35"

On each page:
- Locate all elements with class=“card-content”.
- Skip the first 16 elements (0-15).
- Collect information about the providers who are not featured.
- For each remaining element, extract:
  - name (class=“prov-name”)
  - specialty (class=“prov-specialty”)
  - address (class=“addr-text”)
  - rated_number (class=“webmd-rate–number”)
  - avg_rating (class=“avg-ratings”, if present)
  - award (class=“prov-award”, if present)
  - experience (class=“prov-experience”, if present)
Concatenate each doctor’s data as a row in a pandas DataFrame
After loading each page, pause execution for a random 5–8 second interval.

Your Task: Complete the Script Below

Answer:

import pandas as pd
import os, time, random
from io import StringIO

# Import the necessary modules from the Selenium library
from selenium import webdriver  # Main module to control the browser
from selenium.webdriver.common.by import By  # Helps locate elements on the webpage
from selenium.webdriver.chrome.options import Options  # Allows setting browser options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import TimeoutException
from selenium.common.exceptions import StaleElementReferenceException

# Set the working directory path
wd_path = 'ABSOLUTE_PATHNAME_OF_YOUR_WORKING_DIRECTORY' # e.g., '/Users/bchoe/Documents/DANL-210'
os.chdir(wd_path)  # Change the current working directory to wd_path
os.getcwd()  # Retrieve and return the current working directory

# Create an instance of Chrome options
options = Options()
options.add_argument('--disable-blink-features=AutomationControlled')  # Prevent detection of automation by disabling blink features
options.page_load_strategy = 'eager'  # Load only essential content first, skipping non-critical resources

# Initialize the Chrome WebDriver with the specified options
driver = webdriver.Chrome(options=options)

_______________PROVIDE_YOUR_CODE_FROM_HERE_______________