import os
import pandas as pdClasswork 3
Scrapping Web-tables with pd.read_html()
For Classwork 3, import os and pandas libraries:
Question 1. Load a CSV file into pandas
In this question, you will download a CSV file, put it into a data folder, and then load it into Python using pandas.
Step 1) Download the file
- Go to Brightspace β Course Files.
- Download
custdata_rev.csvto your computer.
Step 2) Create a data folder
In your course project folder (your working directory), create a folder named:
data
Example:
- If your course folder is
DANL-210and this folder is in yourDocumentsfolder, then you should have:- Mac:
'/Users/YOUR_USERNAME/Documents/DANL-210/data' - Windows:
'C:\\Users\\YOUR_USERNAME\\Documents\\DANL-210\\data'
- Mac:
Step 3) Move the CSV file into data
Move custdata_rev.csv into the data folder so it looks like:
- Mac:
DANL-210/data/custdata_rev.csv - Windows:
DANL-210\\data\\custdata_rev.csv
Step 4) Set your working directory in Python
Replace the path below with the folder that contains your project (the folder that contains data).
import os
import pandas as pd
# Replace this with YOUR project folder path (the folder that contains the "data" folder)
wd_path = "ABSOLUTE_PATH_TO_YOUR_WORKING_DIRECTORY"
os.chdir(wd_path) # set working directory
os.getcwd() # check current working directoryStep 5) Read the CSV in two ways
(a) Using a relative path
If your working directory is set to the project folder, you can load the file like this:
path_relative = "data/custdata_rev.csv"
# Read the CSV file into a pandas DataFrame.
# pd.read_csv(...) loads the file and creates a table-like object (a DataFrame) in Python.
# After this line runs, df_rel will contain all rows and columns from the CSV.
df_rel = pd.read_csv(path_relative)(b) Using an absolute path
You can also load the file using the full path to the file:
path_absolute = "ABSOLUTE_PATHNAME_OF_custdata_rev.csv"
df_abs = pd.read_csv(path_absolute)# Set the working directory path
wd_path = 'YOUR_ABSOLUTE_PATHNAME_FOR_DANL-210' # e.g., '/Users/bchoe/Documents/DANL-210'
os.chdir(wd_path) # Change the current working directory to wd_path
os.getcwd() # Retrieve and return the current working directory
path_relative = "data/custdata_rev.csv"
df_rel = pd.read_csv(path_relative)
path_absolute = "YOUR_ABSOLUTE_PATHNAME_FOR_DANL-210/data/custdata_rev.csv"
df_abs = pd.read_csv(path_absolute)Question 2. Scrapping a Web-table with pd.read_html()
- Store the table in the following webpage as a pandas
DataFrame:
url_eia = "https://www.eia.gov/petroleum/gasdiesel/gaspump_hist.php"- Export the
DataFrameas a CSV file.
url_eia = "https://www.eia.gov/petroleum/gasdiesel/gaspump_hist.php"
df_eia = pd.read_html(url_eia)
df_eia = df_eia[0]
df_eia.to_csv('data/eia_table.csv', index=False)
# Storing 'eia_table_new.csv' directly in the working directory.
df_eia.to_csv('eia_table_new.csv', index=False)
# Storing 'eia_table_abs.csv' using the absolte pathname
df_eia.to_csv('/Users/bchoe/Documents/DANL-210/eia_table_abs.csv',
index=False)Question 3. Scrapping Multiple Web-tables with pd.read_html()
- Store each table in the following webpage as a
pandasβDataFrame:
url_geneseo = "https://www.geneseo.edu/business/student%20outcomes"- Export the
DataFrameas a CSV file.
url_sob = 'https://www.geneseo.edu/business/student-outcomes/'
df_list = pd.read_html(url_sob)
df_sob_0 = df_list[0]
df_sob_1 = df_list[1]
df_sob_2 = df_list[2]
df_sob_3 = df_list[3]
df_sob_4 = df_list[4]
df_sob_1.columns = df_sob_1.iloc[0] + df_sob_1.iloc[1]
df_sob_1 = df_sob_1.iloc[2:]
df_sob_1.columns = ['Program', 'Percent (%)2015-16',
'Percent (%)2016-17', 'Percent (%)2017-18',
'Percent (%)2018-19', 'Percent (%)2019-20',
'Percent (%)5-Year % Change']Discussion
Welcome to our Classwork 3 Discussion Board! π
This space is designed for you to engage with your classmates about the material covered in Classwork 3.
Whether you are looking to delve deeper into the content, share insights, or have questions about the content, this is the perfect place for you.
If you have any specific questions for Byeong-Hak (@bcdanl) regarding the Classwork 3 materials or need clarification on any points, donβt hesitate to ask here.
All comments will be stored here.
Letβs collaborate and learn from each other!