
Spyder IDE; Scrapping Web-tables with pd.read_html()
March 10, 2025
pd.read_html()selenium



list and DataFrame objects, we can see what data are contained in such objects.#): Ctrl + 1 (command + 1 for Mac users)# %% defines a cell)pd.read_html()pd.read_html()import pandas as pd
url = "https://www.nps.gov/orgs/1207/national-park-visitation-sets-new-record-as-economic-engines.htm"
tables = pd.read_html(url)
len(tables)read_html() read HTML tables into a list of DataFrame objects.df_0 = tables[0]
df_0.columns = df_0.iloc[0] # Set the first row as column names
df_0 = df_0[1:].reset_index(drop=True) # Remove the first row & reset indexreset_index(drop=True)?
\) with forward slashes (/).\) with double backslashes (\\).DataFrame as a CSV File with to_csv()DataFrame as a CSV file, we use the to_csv() method.
# Import the os module to interact with the operating system
import os
# Set the working directory path
wd_path = 'PATH_TO_YOUR_DATA_FOLDER' # Do not choose your personal website folder
os.chdir(wd_path) # Change the current working directory to wd_path
os.getcwd() # Retrieve and return the current working directory
# index=False to not write the row index in the CSV output
tables[0].to_csv('table.csv', index =False)pd.read_html()Let’s do Classwork 8!
Comments, Code Cells, and Keyboard Shortcuts
The
#mark is Spyder’s comment character.It is recommended to use a coding block (defined by
# %%) with block commenting (Ctrl/command + 4) for separating code sections.To set your keyboard shortcuts,