Homework 4
Python selenium
, APIs, & Pandas Basics
Descriptive Statistics
The following provides the descriptive statistics for each part of the Homework 4:
Direction
Please submit your Python Script for Part 1 and Part 2 in Homework 4 to Brightspace with the name below:
danl_210_hw4_LASTNAME_FIRSTNAME.py
( e.g.,danl_210_hw4_choe_byeonghak.py
)
The due is April 18, 2025, 10:30 A.M.
Please send Prof. Choe an email (bchoe@geneseo.edu) if you have any questions.
Part 1. Data Collection
Question 1
Go to the following webpage: https://www.nyc.gov/site/finance/property/property-annualized-sales-update.page
Provide a Python Selenium code to download all the MS Excel files in the webpage above.
- Please make sure that you do not download any PDF files.
Click to Check the Answer!
# %%
# =============================================================================
# Setup
# =============================================================================
# import pandas as pd
# import numpy as np
# import os
# from selenium.webdriver.chrome.service import Service
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
import time
import random
# Optional: Set your desired download directory
# Default download directory is usually "Download"
= '/Users/bchoe/My Drive/suny-geneseo/spring2025/lecture-code/data'
download_dir
= Options()
chrome_options "prefs", {
chrome_options.add_experimental_option("download.default_directory": download_dir,
"download.prompt_for_download": False,
"download.directory_upgrade": True,
"safebrowsing.enabled": True # Allow downloads without warnings
})
= webdriver.Chrome(options=chrome_options)
driver = 'https://www.nyc.gov/site/finance/property/property-annualized-sales-update.page'
url
driver.get(url)
# %%
# =============================================================================
# Downloading only Excel files (.xlsx or .xls)
# =============================================================================
= driver.find_elements(By.PARTIAL_LINK_TEXT, 'Download')
excel_files_1 = driver.find_elements(By.PARTIAL_LINK_TEXT, 'M')
excel_files_2 = excel_files_1 + excel_files_2
excel_files
for item in excel_files:
if ".x" in item.get_attribute('href'):
item.click().5, 1.25)) time.sleep(random.uniform(
Question 2
The series_housing_price
DataFrame contains FRED series id
s for the Home Price Index across 16 metropolitan statistical areas, each categorized into three price tiers: low, middle, and high.
= 'https://bcdanl.github.io/data/fred_api_series_housing_price.csv'
url = pd.read_csv(url) series_housing_price
For more information regarding the index, please visit Standard & Poor’s. There is more information about home price sales pairs in the Methodology section. Copyright, 2016, Standard & Poor’s Financial Services LLC.
Provide a Python code (excluding your FRED API key) to construct the following DataFrame:
Note that the
id
variable in theseries_housing_price
DataFrame corresponds to theseries_id
key used in the parameter dictionary (param_dicts
) used for making requests to the FRED API.To iterate over the
id
values in theseries_housing_price
DataFrame, you can convert the variable to a list using.tolist()
:
for val in series_housing_price['id'].tolist():
# do something with val
- Below replaces ‘Home Price Index (’ with ’’ (empty string) in the
title
column:
'title'] = df_all['title'].str.replace('Home Price Index (', '') df_all[
- The
title
column can split into thetier
andcity
variables using the following code:
# Split the 'title' column into two new variables, 'ranking' and 'title'
# by the first occurrence of " Tier) for ".
# expand=True tells pandas to return the results in a DataFrame
# If expand were False, the result would be a Series of lists.
# Use \) to escape the closing parenthesis, which has special meaning in regex
'tier', 'city']] = df_all['title'].str.split(' Tier\) for ', expand=True) df_all[[
- The DataFrame is sorted first by
state
, then bycity
within eachstate
, then bydate
in descending order within eachcity
, and finally bytier
within eachcity
-date
group.- To sort observations in a custom order (like Low, Middle, High), you can convert the
tier
variable to a categorical variable with an explicit order, and then sort.
- To sort observations in a custom order (like Low, Middle, High), you can convert the
= pd.DataFrame({
df 'tier': ['Middle', 'High', 'Low', 'Middle', 'Low'],
'city': ['NY', 'LA', 'CHI', 'SF', 'SEA']
})
= ['Low', 'Middle', 'High']
tier_order 'tier'] = pd.Categorical(df['tier'], categories=tier_order, ordered=True)
df[
= df.sort_values('tier') df_sorted
Making requests to the FRED API
Click to Check the Answer!
import requests
import pandas as pd
= 'https://bcdanl.github.io/data/fred_api_series_housing_price.csv'
url = pd.read_csv(url)
series_housing_price
= 'YOUR_API_KEY'
api_key = "series/observations"
endpoint = "https://api.stlouisfed.org/"
url
= pd.DataFrame()
df_all for val in series_housing_price['id'].tolist():
= {
params 'api_key': api_key,
'file_type': 'json',
'series_id': f'{val}'
}= requests.get(url + "fred/" + endpoint,
response = params)
params
# Convert JSON response to Python dictionary.
= response.json()
fred
# Extract the "observations" list element.
= pd.DataFrame( fred['observations'] )
df = df[['date', 'value']]
df 'id'] = val
df[= pd.concat([df_all, df], ignore_index=True) df_all
Cleaning raw data
Click to Check the Answer!
# Join
= df_all.merge(df_housing, on = 'id', how = 'left')
df_all
# String operations
'title'] = df_all['title'].str.replace('Home Price Index (', '')
df_all['tier', 'city']] = df_all['title'].str.split(' Tier\) for ', expand=True)
df_all[['city', 'state']] = df_all['city'].str.split(', ', expand=True)
df_all[[
# Selecting variables
= df_all[['city', 'state', 'tier', 'date', 'value']]
df_all
# Renaming
= df_all.rename( columns = { "value": "home_price_index" } )
df_all
# Converting Data Types
= df_all.astype({'value': 'float',
df_all 'date': 'datetime64[ns]'})
= ['Low', 'Middle', 'High'] # your desired order
tier_order 'tier'] = pd.Categorical(df_all['tier'], categories=tier_order, ordered=True)
df_all[
# Sorting
= df_all.sort_values(['state', 'city', 'date', 'tier'],
df_all = [True, True, False, True])
ascending
# Exporting DataFrame as a CSV file
"PATHNAME_OF_df_all.csv",
df_all.to_csv(=False) index
Part 2. Pandas Basics
import pandas as pd
import numpy as np
The following describes the context of DataFrame in Part 2.
Open Payments, which is managed by the Centers for Medicare & Medicaid Services (CMS), is a national disclosure program created by the Affordable Care Act (ACA). The program promotes transparency and accountability by helping consumers understand the financial relationships between pharmaceutical and medical device industries, physicians, and non-physician practitioners (NPP). Non-physician practitioners (NPPs) collectively refers to physician assistants, nurse practitioners, clinical nurse specialists, certified registered nurse anesthetists/anesthesiologists and certified nurse-midwives provider types.
These financial relationships may include consulting fees, research grants, travel reimbursements, and payments made from the industry to medical practitioners. It is important to note that financial ties between the health care industry and health care providers do not necessarily indicate an improper relationship.
- Let us consider the CMS’ Open Payments data for selected cities in New York state in year 2022, the
cms_ny_2022
DataFrame:
= pd.read_csv('https://bcdanl.github.io/data/cms-2022-cities-all.zip') cms_ny_2022
Variable Description
variable | type | description |
---|---|---|
Record_ID |
numeric | Unique identifier for a transaction between a manufacturer and a physician or an NPP |
Month |
numeric | Month |
Day |
numeric | Year |
Amount_of_Payment |
numeric | US dollar amount of payment to a physician or NPP |
Product |
chracter | An indicator showing if the product is a Drug , Device , Biological , or Medical Supply . |
Product_Manufacturer |
chracter | Manufacturer making the payment to a physician/NPP |
NPI |
numeric | National Provider Identifier, a unique identification number for a physician/NPP receiving the payment |
Physician_or_NPP |
chracter | An indicator showing if the recipient of the payment is a physician or non-physician practitioner (NPP) |
First_Name |
chracter | First name of physician/NPP in New York state |
Last_Name |
chracter | Last name of physician/NPP in New York state |
City |
chracter | City of the physician/NPP in New York state |
Type |
chracter | Type of physician/NPP (e.g., Medical Doctor , Doctor of Optometry , Nurse Practitioner , Physician Assistant , and more) |
Taxonomy |
chracter | Standardized taxonomy of medical service provider (e.g., Allopathic & Osteopathic Physicians , Chiropractic Providers , Dental Providers , Physician Assistants & Advanced Practice Nursing Providers , and more) |
Specialty |
chracter | A specialty chosen within the standardized taxonomy (e.g., Psychiatry & Neurology , Physician Assistant , Chiropractor , Dentist , Family Medicine , Neurological Surgery , and more) |
Specialty_Detail |
chracter | Detail in the specialty if any (e.g., Family , Gastroenterology , Cardiovascular Disease , Neurology , Psychiatry , and more) |
- Note that each physician/NPP corresponds to one single
Specialty_Detail
value.
Question 3
- Write a Python code to convert the string type
Physician_or_NPP
variable into the category type variable.
Click to Check the Answer!
'Physician_or_NPP'] = cms_ny_2022['Physician_or_NPP'].astype('category') cms_ny_2022[
Question 4
- Write a Python code to replace the NaN value of the
Specialty_Detail
variable with the value of theSpecialty
variable.- For example, the following shows that
- All NaN values in the
Specialty_Detail
variable, where theSpecialty
is Physician Assistant, are replaced with Physician Assistant.
- All NaN values in the
- For example, the following shows that
np.where()
Click to Check the Answer!
= cms_ny_2022[['Specialty', 'Specialty_Detail']]
q4 'Specialty_Detail'] = np.where(q4['Specialty_Detail'].isna(),
q4['Specialty'],
q4['Specialty_Detail']
q4[ )
fillna()
Click to Check the Answer!
= cms_ny_2022[['Specialty', 'Specialty_Detail']]
q4 'Specialty_Detail']=q4['Specialty_Detail'].fillna(q4['Specialty']) q4[
Question 5
Write a Python code to count the number of transactions in which the payment amount (i.e., Amount_of_Payment
) falls in the top 10% of all payments in the DataFrame.
Click to Check the Answer!
= cms_ny_2022['Amount_of_Payment'].quantile(.9)
top10pct 'Amount_of_Payment'] > top10pct].shape[0] cms_ny_2022[ cms_ny_2022[
Question 6
Write a Python code to identify the city or cities with the highest number of non-missing values in the Specialty_Detail
variable of the cms_ny_2022
DataFrame.
Click to Check the Answer!
(~cms_ny_2022['Specialty_Detail'].isna() ]['City']
cms_ny_2022[
.value_counts()
.reset_index()1, 'count', keep="all")
.nlargest( )
Question 7
- Write a Python code to count the number of occurrences of each unique value in
Specialty_Detail
for each unique value inTaxonomy
.- For example, the following displays the number of occurrences of each unique value in
Specialty_Detail
for the Dental ProvidersTaxonomy
value.
- For example, the following displays the number of occurrences of each unique value in
Click to Check the Answer!
('Taxonomy', 'Specialty_Detail']]
cms_ny_2022[[
.value_counts()
.reset_index()'Taxonomy', 'Specialty_Detail'])
.sort_values([ )
Question 8
- Write a Python code to count the number of unique
Specialty_Detail
values for the value “Dental Providers” inTaxonomy
.
Click to Check the Answer!
('Taxonomy'] == "Dental Providers"]['Specialty_Detail']
cms_ny_2022[cms_ny_2022[
.nunique() )
Click to Check the Answer!
('Taxonomy'] == "Dental Providers"]['Specialty_Detail']
cms_ny_2022[cms_ny_2022[
.value_counts()0]
.shape[ )
Question 9
- Write a Python code (1) to count the number of missing values and (2) to calculate mean, standard deviation, minimum, first quartile, median, third quartile, and maximum of the
Amount_of_Payment
variable for the value Dermatology in theSpecialty
variable.
Click to Check the Answer!
('Specialty'] == "Dermatology"][['Amount_of_Payment']]
cms_ny_2022[cms_ny_2022[
.info() )
Click to Check the Answer!
= cms_ny_2022[cms_ny_2022['Specialty'] == "Dermatology"]['Amount_of_Payment'].isna()
x sum(x)
Click to Check the Answer!
('Specialty'] == "Dermatology"]['Amount_of_Payment']
cms_ny_2022[cms_ny_2022[
.describe() )
Question 10
- Write a Python code to replicate the following DataFrame, which calculates (1) the number of physicians in each city, (2) the number of NPPs in each city, (3) the number of physicians and NPPs in each city, and (4) the percentage of physicians and NPPs in each city, as shown below.
- Ensure that the sum of the percentages of physicians and NPPs within each city is 100.
- Ensure that the observations are sorted by
Total_Count
andCity
in descending and ascending orders, respectively.
Using join
Click to Check the Answer!
= (
df 'City', 'Physician_or_NPP', 'NPI']]
cms_ny_2022[['City', 'Physician_or_NPP']]
.drop_duplicates()[[
.value_counts()
.reset_index()
)
= (
df_1 'City', 'Physician_or_NPP', 'NPI']]
cms_ny_2022[['City']
.drop_duplicates()[
.value_counts()
.reset_index()= {'count' : 'Total_Count'})
.rename( columns
)
= (
df
df= 'City', how = 'left')
.merge(df_1, on 'Total_Count', 'Physician_or_NPP'], ascending = [False, True])
.sort_values([
)
'Percent'] = 100 * df['count'] / df['Total_Count'] df[
Using for-loop with dictionary container
Click to Check the Answer!
0]
cms_ny_2022.shape['NPI'].nunique()
cms_ny_2022[
= (
df 'City', 'Physician_or_NPP', 'NPI']]
cms_ny_2022[['City', 'Physician_or_NPP']]
.drop_duplicates()[[
.value_counts()
.reset_index()
)
= {}
d for city in cms_ny_2022['City'].unique():
= df[ df['City'] == city ]['count'].sum()
val = val
d[ city ]
= pd.DataFrame([d]).T.reset_index()
df_tot = ['City', 'Total_Count']
df_tot.columns
= df.merge(df_tot, on='City', how = 'left')
df 'Percent'] = 100 * df['count'] / df['Total_Count']
df[= df.sort_values(['Total_Count', 'City'], ascending = [False, True]) df
Group operation with groupby()
Click to Check the Answer!
= (
df_grouped 'City', 'Physician_or_NPP', 'NPI']]
cms_ny_2022[[
.drop_duplicates()'City', 'Physician_or_NPP'])
.groupby([# number of observations in each group
.size() ='count')
.reset_index(name
)
'Total_Count'] = df_grouped.groupby('City')['count'].transform('sum')
df_grouped['Percent'] = (100 * df_grouped['count'] / df_grouped['Total_Count'])
df_grouped[= df_grouped.sort_values(['Total_Count', 'City'], ascending = [False, True]) df_grouped
Group operation with assign()
and lambda
Click to Check the Answer!
= (
df_grouped 'City', 'Physician_or_NPP', 'NPI']]
cms_ny_2022[[
.drop_duplicates()'City', 'Physician_or_NPP'])
.groupby([
.size()= 'count')
.reset_index(name
.assign(= lambda df: df.groupby('City')['count'].transform('sum'),
Total_Count = lambda df: 100 * df['count'] / df['Total_Count']
Percent
)'Total_Count', 'City'], ascending=[False, True])
.sort_values([ )
Question 11
- Write a Python code to calculate:
- The total
Amount_of_Payment
theProduct_Manufacturer
AbbVie Inc. paid to physicians and non-physician practitioners (NPPs) inCities
of Rochester, Buffalo, and Syracuse. - How many times the
Product_Manufacturer
AbbVie Inc. paid to physicians and non-physician practitioners (NPPs) in inCities
of Rochester, Buffalo, and Syracuse.
- The total
Click to Check the Answer!
(
cms_ny_2022.query('Product_Manufacturer == "AbbVie Inc." & City.isin(["Rochester", "Buffalo", "Syracuse"])'
'Amount_of_Payment']
)[sum()
. )
Click to Check the Answer!
(
cms_ny_2022.query('Product_Manufacturer == "AbbVie Inc." & City.isin(["Rochester", "Buffalo", "Syracuse"])'
)0]
.shape[ )
Question 12
- Consider the following two DataFrames,
cms_ny_2022_npi
andcms_ny_2022_records
:
= pd.read_csv('https://bcdanl.github.io/data/cms-2022-cities-npi.csv') cms_ny_2022_npi
cms_ny_2022_npi
: This DataFrame is created fromcms_ny_2022
by selecting only the columnsNPI
,Physician_or_NPP
,First_Name
,Last_Name
,City
,Type
,Taxonomy
,Specialty
, andSpecialty_Detail.
- In addition, it contains only unique observations (i.e., there are no duplicate rows).
= pd.read_csv('https://bcdanl.github.io/data/cms-2022-cities-records.csv') cms_ny_2022_records
cms_ny_2022_records
: This DataFrame is derived fromcms_ny_2022
by including every observation but only retaining the columnsRecord_ID
,Month
,Day
,Amount_of_Payment
,Product
,Product_Manufacturer
, andNPI
.Write a Python code to create the
cms_ny_2022
DataFrame by using thecms_ny_2022_npi
andcms_ny_2022_records
DataFrames.
Below is the many-to-many join. Therefore, the cms_ny_2022_new
DataFrame has more observations than cms_ny_2022
.
Click to Check the Answer!
= cms_ny_2022_records.merge(cms_ny_2022_npi, on = 'NPI', how = 'left') cms_ny_2022_new
This happens because NPI
variable in the cms_ny_2022_npi
DataFrame has duplicate values.
Click to Check the Answer!
# Dividing `cms_ny_2022_npi` into the two DataFrames
= cms_ny_2022_npi[cms_ny_2022_npi['NPI'].duplicated(keep=False)].sort_values('NPI')
cms_ny_2022_npi_dup = cms_ny_2022_npi[~cms_ny_2022_npi['NPI'].duplicated(keep=False)]
cms_ny_2022_npi_unique
# Creating a DataFrame with only non-duplicate values in NPI variable:
= cms_ny_2022_npi_dup.drop_duplicates(subset=['NPI'], keep='first')
cms_ny_2022_npi_dup_to_unique = pd.concat([cms_ny_2022_npi_unique, cms_ny_2022_npi_dup_to_unique])
cms_ny_2022_npi_unique_all
# Below can match the number of observations in cms_ny_2022,
# but may not match values in non-NPI variables in cms_ny_2022_npi
= cms_ny_2022_records.merge(cms_ny_2022_npi_unique_all, on = 'NPI', how = 'left')
cms_ny_2022_new2
# To replicate cms_ny_2022, we need to create a new key variable that uniquely identifies observations in `cms_ny_2022_npi`.
Question 13
- Write a Python code to create the following DataFrame,
City_Physician_or_NPP
, which counts the number of physicians and the non-physician practitioners (NPPs) in each city.
Below is correct:
Click to Check the Answer!
= (
City_Physician_or_NPP 'City', 'Physician_or_NPP', 'NPI']]
cms_ny_2022[['City', 'Physician_or_NPP']]
.drop_duplicates()[[
.value_counts()
.reset_index()= 'City',
.pivot(index = 'Physician_or_NPP',
columns = 'count'
values
)
.reset_index() )
Below is incorrect:
Click to Check the Answer!
= (
City_Physician_or_NPP 'City', 'Physician_or_NPP']]
cms_ny_2022[[
.value_counts()
.reset_index()= 'City',
.pivot(index = 'Physician_or_NPP',
columns = 'count'
values
)
.reset_index() )