Provide a Python Selenium code to download all the MS Excel files in the webpage above.
Please make sure that you do not download any PDF files.
Click to Check the Answer!
# %%# =============================================================================# Setup# =============================================================================# import pandas as pd# import numpy as np# import os# from selenium.webdriver.chrome.service import Servicefrom selenium import webdriverfrom selenium.webdriver.chrome.options import Optionsfrom selenium.webdriver.common.by import Byimport timeimport random# Optional: Set your desired download directory# Default download directory is usually "Download"download_dir ='/Users/bchoe/My Drive/suny-geneseo/spring2025/lecture-code/data'chrome_options = Options()chrome_options.add_experimental_option("prefs", {"download.default_directory": download_dir,"download.prompt_for_download": False,"download.directory_upgrade": True,"safebrowsing.enabled": True# Allow downloads without warnings})driver = webdriver.Chrome(options=chrome_options)url ='https://www.nyc.gov/site/finance/property/property-annualized-sales-update.page'driver.get(url)# %%# =============================================================================# Downloading only Excel files (.xlsx or .xls)# =============================================================================excel_files_1 = driver.find_elements(By.PARTIAL_LINK_TEXT, 'Download')excel_files_2 = driver.find_elements(By.PARTIAL_LINK_TEXT, 'M')excel_files = excel_files_1 + excel_files_2for item in excel_files:if".x"in item.get_attribute('href'): item.click() time.sleep(random.uniform(.5, 1.25))
Question 2
The series_housing_price DataFrame contains FRED series ids for the Home Price Index across 16 metropolitan statistical areas, each categorized into three price tiers: low, middle, and high.
For more information regarding the index, please visit Standard & Poor’s. There is more information about home price sales pairs in the Methodology section. Copyright, 2016, Standard & Poor’s Financial Services LLC.
Provide a Python code (excluding your FRED API key) to construct the following DataFrame:
Note that the id variable in the series_housing_price DataFrame corresponds to the series_id key used in the parameter dictionary (param_dicts) used for making requests to the FRED API.
To iterate over the id values in the series_housing_price DataFrame, you can convert the variable to a list using .tolist():
for val in series_housing_price['id'].tolist():# do something with val
Below replaces ‘Home Price Index (’ with ’’ (empty string) in the title column:
df_all['title'] = df_all['title'].str.replace('Home Price Index (', '')
The title column can split into the tier and city variables using the following code:
# Split the 'title' column into two new variables, 'ranking' and 'title'# by the first occurrence of " Tier) for ". # expand=True tells pandas to return the results in a DataFrame# If expand were False, the result would be a Series of lists.# Use \) to escape the closing parenthesis, which has special meaning in regexdf_all[['tier', 'city']] = df_all['title'].str.split(' Tier\) for ', expand=True)
The DataFrame is sorted first by state, then by city within each state, then by date in descending order within each city, and finally by tier within each city-date group.
To sort observations in a custom order (like Low, Middle, High), you can convert the tier variable to a categorical variable with an explicit order, and then sort.
The following describes the context of DataFrame in Part 2.
Open Payments, which is managed by the Centers for Medicare & Medicaid Services (CMS), is a national disclosure program created by the Affordable Care Act (ACA). The program promotes transparency and accountability by helping consumers understand the financial relationships between pharmaceutical and medical device industries, physicians, and non-physician practitioners (NPP). Non-physician practitioners (NPPs) collectively refers to physician assistants, nurse practitioners, clinical nurse specialists, certified registered nurse anesthetists/anesthesiologists and certified nurse-midwives provider types.
These financial relationships may include consulting fees, research grants, travel reimbursements, and payments made from the industry to medical practitioners. It is important to note that financial ties between the health care industry and health care providers do not necessarily indicate an improper relationship.
Let us consider the CMS’ Open Payments data for selected cities in New York state in year 2022, the cms_ny_2022 DataFrame:
Unique identifier for a transaction between a manufacturer and a physician or an NPP
Month
numeric
Month
Day
numeric
Year
Amount_of_Payment
numeric
US dollar amount of payment to a physician or NPP
Product
chracter
An indicator showing if the product is a Drug, Device, Biological, or Medical Supply.
Product_Manufacturer
chracter
Manufacturer making the payment to a physician/NPP
NPI
numeric
National Provider Identifier, a unique identification number for a physician/NPP receiving the payment
Physician_or_NPP
chracter
An indicator showing if the recipient of the payment is a physician or non-physician practitioner (NPP)
First_Name
chracter
First name of physician/NPP in New York state
Last_Name
chracter
Last name of physician/NPP in New York state
City
chracter
City of the physician/NPP in New York state
Type
chracter
Type of physician/NPP (e.g., Medical Doctor, Doctor of Optometry, Nurse Practitioner, Physician Assistant, and more)
Taxonomy
chracter
Standardized taxonomy of medical service provider (e.g., Allopathic & Osteopathic Physicians, Chiropractic Providers, Dental Providers, Physician Assistants & Advanced Practice Nursing Providers, and more)
Specialty
chracter
A specialty chosen within the standardized taxonomy (e.g., Psychiatry & Neurology, Physician Assistant, Chiropractor, Dentist, Family Medicine, Neurological Surgery, and more)
Specialty_Detail
chracter
Detail in the specialty if any (e.g., Family, Gastroenterology, Cardiovascular Disease, Neurology, Psychiatry, and more)
Note that each physician/NPP corresponds to one single Specialty_Detail value.
Question 3
Write a Python code to convert the string type Physician_or_NPP variable into the category type variable.
Write a Python code to count the number of transactions in which the payment amount (i.e., Amount_of_Payment) falls in the top 10% of all payments in the DataFrame.
Write a Python code to identify the city or cities with the highest number of non-missing values in the Specialty_Detail variable of the cms_ny_2022 DataFrame.
Write a Python code (1) to count the number of missing values and (2) to calculate mean, standard deviation, minimum, first quartile, median, third quartile, and maximum of the Amount_of_Payment variable for the value Dermatology in the Specialty variable.
Write a Python code to replicate the following DataFrame, which calculates (1) the number of physicians in each city, (2) the number of NPPs in each city, (3) the number of physicians and NPPs in each city, and (4) the percentage of physicians and NPPs in each city, as shown below.
Ensure that the sum of the percentages of physicians and NPPs within each city is 100.
Ensure that the observations are sorted by Total_Count and City in descending and ascending orders, respectively.
Using for-loop with dictionary container
Click to Check the Answer!
cms_ny_2022.shape[0]cms_ny_2022['NPI'].nunique()df = ( cms_ny_2022[['City', 'Physician_or_NPP', 'NPI']] .drop_duplicates()[['City', 'Physician_or_NPP']] .value_counts() .reset_index() )d = {}for city in cms_ny_2022['City'].unique(): val = df[ df['City'] == city ]['count'].sum() d[ city ] = valdf_tot = pd.DataFrame([d]).T.reset_index()df_tot.columns = ['City', 'Total_Count']df = df.merge(df_tot, on='City', how ='left')df['Percent'] =100* df['count'] / df['Total_Count']df = df.sort_values(['Total_Count', 'City'], ascending = [False, True])
Group operation with groupby()
Click to Check the Answer!
df_grouped = ( cms_ny_2022[['City', 'Physician_or_NPP', 'NPI']] .drop_duplicates() .groupby(['City', 'Physician_or_NPP']) .size() # number of observations in each group .reset_index(name='count') )df_grouped['Total_Count'] = df_grouped.groupby('City')['count'].transform('sum')df_grouped['Percent'] = (100* df_grouped['count'] / df_grouped['Total_Count'])df_grouped = df_grouped.sort_values(['Total_Count', 'City'], ascending = [False, True])
The total Amount_of_Payment the Product_ManufacturerAbbVie Inc. paid to physicians and non-physician practitioners (NPPs) in Cities of Rochester, Buffalo, and Syracuse.
How many times the Product_ManufacturerAbbVie Inc. paid to physicians and non-physician practitioners (NPPs) in in Cities of Rochester, Buffalo, and Syracuse.
cms_ny_2022_npi: This DataFrame is created from cms_ny_2022 by selecting only the columns NPI, Physician_or_NPP, First_Name, Last_Name, City, Type, Taxonomy, Specialty, and Specialty_Detail.
In addition, it contains only unique observations (i.e., there are no duplicate rows).
cms_ny_2022_records: This DataFrame is derived from cms_ny_2022 by including every observation but only retaining the columns Record_ID, Month, Day, Amount_of_Payment, Product, Product_Manufacturer, and NPI.
Write a Python code to create the cms_ny_2022 DataFrame by using the cms_ny_2022_npi and cms_ny_2022_records DataFrames.
Below is the many-to-many join. Therefore, the cms_ny_2022_new DataFrame has more observations than cms_ny_2022.
Click to Check the Answer!
cms_ny_2022_new = cms_ny_2022_records.merge(cms_ny_2022_npi, on ='NPI', how ='left')
This happens because NPI variable in the cms_ny_2022_npi DataFrame has duplicate values.
Click to Check the Answer!
# Dividing `cms_ny_2022_npi` into the two DataFramescms_ny_2022_npi_dup = cms_ny_2022_npi[cms_ny_2022_npi['NPI'].duplicated(keep=False)].sort_values('NPI')cms_ny_2022_npi_unique = cms_ny_2022_npi[~cms_ny_2022_npi['NPI'].duplicated(keep=False)]# Creating a DataFrame with only non-duplicate values in NPI variable:cms_ny_2022_npi_dup_to_unique = cms_ny_2022_npi_dup.drop_duplicates(subset=['NPI'], keep='first')cms_ny_2022_npi_unique_all = pd.concat([cms_ny_2022_npi_unique, cms_ny_2022_npi_dup_to_unique])# Below can match the number of observations in cms_ny_2022, # but may not match values in non-NPI variables in cms_ny_2022_npicms_ny_2022_new2 = cms_ny_2022_records.merge(cms_ny_2022_npi_unique_all, on ='NPI', how ='left')# To replicate cms_ny_2022, we need to create a new key variable that uniquely identifies observations in `cms_ny_2022_npi`.
Question 13
Write a Python code to create the following DataFrame, City_Physician_or_NPP, which counts the number of physicians and the non-physician practitioners (NPPs) in each city.