Data Collection I: DataFrame; Spyder IDE; Scrapping Web-tables with pd.read_html()
February 9, 2026
Series and DataFrameSeries and DataFrame
Series: A one-dimensional object containing a sequence of values (like a list).
DataFrame: A two-dimensional table made of multiple Series columns sharing a common index.
DataFrameDataFrame represent individual units or entities for which data is collected.DataFrameDataFrame represent attributes or characteristics measured across multiple observations.Name, Age, Grade, MajorEmployeeID, Name, Age, DepartmentCustomerID, Name, Age, Income, HousingTypeNote
DataFrame, a variable is a column of data.DataFrame
A DataFrame is tidy if it follows three rules:
A tidy DataFrame keeps your data organized, making it easier to understand, analyze, and share in any data analysis.
*.py) is a plain-text file that contains Python code you can run from your computer (or an IDE like Spyder).
list and DataFrame objects, we can see what data are contained in such objects.#): Ctrl + 1 (command + 1 for Mac users)# %% defines a cell)pd.read_html()pd.read_html()import pandas as pd
url = "https://www.nps.gov/orgs/1207/national-park-visitation-sets-new-record-as-economic-engines.htm"
tables = pd.read_html(url)
len(tables)
df_0 = tables[0]read_html() read HTML tables into a list of DataFrame objects.df_0 = tables[0]
df_0.columns = df_0.iloc[0] # Set the first row as column names
df_0 = df_0.iloc[1:] # Keeps rows from position 1 onwardDataFrame.iloc[]?DataFrame.iloc[...] is integer-location indexing:
DataFrame.iloc[]df_0.iloc[0] returns the first row (position 0) as a Series.DataFrame.) is used for an attribute or a method on objects.DataFrame.METHOD()) is a function that we can call on a DataFrame to perform operations, modify data, or derive insights.
df.info()DataFrame.ATTRIBUTE) is a property that provides information about the DataFrameโs structure or content without modifying it.
df.columnsDataFrameDataFrame object has a .info() method that provides a summary of a DataFrame:
.columns).shape).count())
NaN.An absolute pathname tells the computer the exact location of a file, starting from the very top folder of your computer.
In Python, you can see the working directory โ the folder where Python is currently โlookingโ for files โ by running os.getcwd() in the Console.
Examples of an absolute pathname for custdata_rev.csv:
/Users/user/documents/data/custdata_rev.csvC:\\Users\\user\\Documents\\data\\custdata_rev.csv
\\) because a single backslash (\) is treated as a special character in Python.custdata_rev.csv:
/Users/user/documents/data/custdata_rev.csv/Users/user/documents/data/custdata_rev.csv\) with forward slashes (/).\) with double backslashes (\\).DataFrame as a CSV File with to_csv()DataFrame as a CSV file, we use the to_csv() method.
data directory within your WD. This helps in keeping your data analysis and exports well-organized.# Import the os module to interact with the operating system
import os
# Set the working directory path
wd_path = 'ABSOLUTE_PATHNAME_OF_YOUR_WORKING_DIRECTORY' # e.g., '/Users/bchoe/Documents/DANL-210'
os.chdir(wd_path) # Change the current working directory to wd_path
os.getcwd() # Retrieve and return the current working directory
# index=False to not write the row index in the CSV output
df_0.to_csv('data/table.csv', index =False)pd.read_html()Letโs do Classwork 3!
๐ฌ Comments, Code Cells, and Keyboard Shortcuts
The
#mark is Spyderโs comment character.It is recommended to use a coding block (defined by
# %%) with block commenting (Ctrl/command + 4) for separating code sections.To set your keyboard shortcuts,