Lecture 14

Spyder IDE; Scrapping Web-tables with pd.read_html()

Byeong-Hak Choe

bchoe@geneseo.edu

SUNY Geneseo

March 10, 2025

Data Collection

Learning Objectives

Scrapping web tables with pd.read_html()
Scrapping web data with selenium
Collecting web data with application programming interfaces (APIs)

Spyder IDE

Script Editor

From Script Editor (red box), we can create, open and edit files.

Console Pane

From Console Pane (blue box), we can interact directly with the Python interpreter, and type commands where Python will immediately execute them.

Variable Explorer

From Variable Explorer (yellow box), we can see the values of variables, data frames, and other objects that are currently stored in memory.

Data Containers in Variable Explorer

If we doucle-click the objects such as list and DataFrame objects, we can see what data are contained in such objects.

Keyboard Shortcuts

General shortcuts
- Undo: Ctrl + z (command + z for Mac users)
- Redo: Ctrl + Shift + z (command + shift + z for Mac users)
- Selection: Ctrl + Shift + Arrow (⬆️ ➡️ ⬇️ ⬅️)
- Page Up/Down: Fn + ⬆️ / ⬇️
Default shortcuts
- Comment (#): Ctrl + 1 (command + 1 for Mac users)
- Block-comment: Ctrl + 4 (command + 4 for Mac users)
- Run selection (or a current line): F9
- Run cell: Ctrl + Enter (# %% defines a cell)

Comments, Code Cells, and Keyboard Shortcuts

# %%
# =============================================================================
# SECTION TITLE
# =============================================================================
a = 1

The # mark is Spyder’s comment character.
It is recommended to use a coding block (defined by # %%) with block commenting (Ctrl/command + 4) for separating code sections.
To set your keyboard shortcuts,
- Preferences > Keyboard Shortcuts > Search “run” and/or “comment”
- Set the shortcuts for (1) run selection; (2) run cell; (3) toggle comment; and (4) blockcomment
- I use command + return for running a current line (selection)

Scrapping Tables with `pd.read_html()`

Let’s scrap the two tables in the following webpage:
- National Park Visitation Sets New Record as Economic Engines

import pandas as pd
url = "https://www.nps.gov/orgs/1207/national-park-visitation-sets-new-record-as-economic-engines.htm"

tables = pd.read_html(url)
len(tables)

read_html() read HTML tables into a list of DataFrame objects.

Refining Web-Scraped Data

Setting Column Names & Resetting Index

How do we clean tables after extracting them from a webpage?
- How can we set the first row of a DataFrame as its column names?
- How can we remove the first row and reset the index?

df_0 = tables[0]
df_0.columns = df_0.iloc[0]  # Set the first row as column names
df_0 = df_0[1:].reset_index(drop=True)  # Remove the first row & reset index

Why reset_index(drop=True)?
- Removes the old index and prevents it from being added as a new column.
- Ensures a clean, sequential index (starting from 0) for the modified DataFrame.

Finding the Absolute Path of a File/Folder

Windows 11

Step 1: Navigate to your folder using File Explorer.
Step 2: Right-click the desired file or folder.
Step 3: Click Copy as path.
Step 4: Paste the path into your Python script (Ctrl + V).
Step 5: Adjust backslashes in the path:
- Option 1: Replace backslashes (\) with forward slashes (/).
- Option 2: Replace single backslashes (\) with double backslashes (\\).

Mac

Step 1: Navigate to your folder using Finder.
Step 2: Select the file or folder by clicking on it.
Step 3: Press Option + Command + C.
Step 4: Paste the path into your Python script (Command + V).

Exporting a `DataFrame` as a CSV File with `to_csv()`

To export DataFrame as a CSV file, we use the to_csv() method.
- Before exporting, you can set the working directory to organize and manage the location of the CSV file.
- This helps in keeping your data analysis and exports well-organized.

# Import the os module to interact with the operating system
import os  

# Set the working directory path
wd_path = 'PATH_TO_YOUR_DATA_FOLDER' # Do not choose your personal website folder
os.chdir(wd_path)  # Change the current working directory to wd_path
os.getcwd()  # Retrieve and return the current working directory

# index=False to not write the row index in the CSV output
tables[0].to_csv('table.csv', index =False)

Scrapping Tables with `pd.read_html()`

Let’s do Classwork 8!

Lecture 14

Data Collection

Learning Objectives

Spyder IDE

Script Editor

Console Pane

Variable Explorer

Data Containers in Variable Explorer

Keyboard Shortcuts

Comments, Code Cells, and Keyboard Shortcuts

Scrapping Tables with pd.read_html()

Scrapping Tables with pd.read_html()

Refining Web-Scraped Data

Setting Column Names & Resetting Index

Finding the Absolute Path of a File/Folder

Windows 11

Mac

Exporting a DataFrame as a CSV File with to_csv()

Scrapping Tables with pd.read_html()

Scrapping Tables with `pd.read_html()`

Scrapping Tables with `pd.read_html()`

Exporting a `DataFrame` as a CSV File with `to_csv()`

Scrapping Tables with `pd.read_html()`