Spyder IDE; Scrapping Web-tables with pd.read_html()
March 10, 2025
pd.read_html()
selenium
list
and DataFrame
objects, we can see what data are contained in such objects.#
): Ctrl + 1 (command + 1 for Mac users)# %%
defines a cell)pd.read_html()
pd.read_html()
import pandas as pd
url = "https://www.nps.gov/orgs/1207/national-park-visitation-sets-new-record-as-economic-engines.htm"
tables = pd.read_html(url)
len(tables)
read_html()
read HTML tables into a list of DataFrame
objects.df_0 = tables[0]
df_0.columns = df_0.iloc[0] # Set the first row as column names
df_0 = df_0[1:].reset_index(drop=True) # Remove the first row & reset index
reset_index(drop=True)
?
\
) with forward slashes (/
).\
) with double backslashes (\\
).DataFrame
as a CSV File with to_csv()
DataFrame
as a CSV file, we use the to_csv()
method.
# Import the os module to interact with the operating system
import os
# Set the working directory path
wd_path = 'PATH_TO_YOUR_DATA_FOLDER' # Do not choose your personal website folder
os.chdir(wd_path) # Change the current working directory to wd_path
os.getcwd() # Retrieve and return the current working directory
# index=False to not write the row index in the CSV output
tables[0].to_csv('table.csv', index =False)
pd.read_html()
Let’s do Classwork 8!
Comments, Code Cells, and Keyboard Shortcuts
The
#
mark is Spyder’s comment character.It is recommended to use a coding block (defined by
# %%
) with block commenting (Ctrl/command + 4) for separating code sections.To set your keyboard shortcuts,