Data Collection with APIs
April 11, 2025
# %%
defines a coding cell#
): Ctrl/command + 1Use a full screen mode for your Spyder IDE.
Use a trackpad gesture with three fingers to move across screens, or command+tab to move between a Chrome web browser and a Spyder IDE.
Clients are the typical web user’s internet-connected devices (e.g., a computer connected to Wi-Fi) and web-accessing software available on those devices (e.g., Firefox, Chrome).
Servers are computers that store webpages, sites, or apps.
When a client device wants to access a webpage, a copy of the webpage is downloaded from the server onto the client machine to be displayed in the user’s web browser.
Hypertext Transfer Protocol (HTTP) is a language for clients and servers to speak to each other.
Hypertext Transfer Protocol Secure (HTTPS) is an encrypted version of HTTP that provides secure communication between them.
When we type a web address with “https://” into our browser:
requests
library is the de facto standard for making HTTP requests in Python.GET
(i.e., asking a server to retrieve information). Other common methods include POST
, PUT
, and DELETE
.requests
Methodrequests
?
GET
, POST
, PUT
, DELETE
, etc.requests.get()
:
GET
request to retrieve data from a specified URL.import requests
url = 'https://www.example.com/.....'
param_dicts = {.....} # Optional, but we often need this
response = requests.get(url, params=param_dicts)
url
: API endpoint where data is requested.param_dicts
: Dictionary of query parameters sent with the request.response.status_code
: Returns HTTP status code (e.g., 200 for success).response.json()
: Converts JSON format data into a Python dictionary.response.text
: Decoded text version of the response.content
status_code
before processing the response.response.json()
for easier handling of JSON data.NYC Open Data (https://opendata.cityofnewyork.us) is free public data published by NYC agencies and other partners.
Many metropolitan cities have the Open Data websites too:
import requests
import pandas as pd
endpoint = 'https://data.cityofnewyork.us/resource/ic3t-wcy2.json' ## API endpoint
response = requests.get(endpoint)
content = response.json() # to convert JSON response data to a dictionary
df = pd.DataFrame(content)
request.get()
method sends a GET
request to the specified URL.response.json()
automatically convert JSON data into a dictionary.
Most API interfaces will only let you access and download data after you have registered an API key with them.
Let’s download economic data from the FRED https://fred.stlouisfed.org using its API.
You need to create an account https://fredaccount.stlouisfed.org/login/ to get an API key for your FRED account.
As with all APIs, a good place to start is the FRED API developer docs https://fred.stlouisfed.org/docs/api/fred/.
We are interested in series/observations https://fred.stlouisfed.org/docs/api/fred/series_observations.html
The parameters that we will use are api_key
, file_type
, and series_id
.
Replace “YOUR_API_KEY” with your actual API key in the following web address: https://api.stlouisfed.org/fred/series/observations?series_id=GNPCA&api_key=YOUR_API_KEY&file_type=json
requests
, json
, and pandas
libraries.
requests
comes with a variety of features that allow us to interact more flexibly and securely with web APIs.import requests # to handle API requests
import json # to parse JSON response data
import pandas as pd
param_dicts = {
'api_key': 'YOUR_FRED_API_KEY', ## Change to your own key
'file_type': 'json',
'series_id': 'GDPC1' ## ID for US real GDP
}
url = "https://api.stlouisfed.org/"
endpoint = "series/observations"
api_endpoint = url + "fred/" + endpoint # sum of strings
response = requests.get(api_endpoint, params = param_dicts)
json()
: Converts JSON into a Python dictionary object.content
are string
-type.
Let’s do Classwork 11!
pynytimes
While NYTimes Developer Portal APIs provides API documentation, it is time-consuming to go through the documentation.
There is an unofficial Python library called pynytimes
that provides a more user-friendly interface for working with the NYTimes API.
To get started, check out Introduction to pynytimes.
Most industry-scale websites display data from their database servers.
Sometimes, it is possible to find their hidden APIs to retrieve data!
Examples:
json
type response that seems to have datarequests
methodsOur course contents are limited to the very very basics of the requests
methods.
For those who are interested in the Python’s requests
, I recommend the following references:
hf_xxx...
). Keep your API token private.# Text you'd like to summarize
text_to_summarize = """
Summarize the following movie plot in a sentence.
Four misfits are suddenly pulled through a mysterious portal into a bizarre cubic wonderland that thrives on imagination.
To get back home they'll have to master this world while embarking on a quest with an unexpected expert crafter.
"""
import requests
# Your Hugging Face token (replace with your real token)
HF_API_TOKEN = "hf_your_real_token_here"
# The summarization model's API URL
API_URL = "https://api-inference.huggingface.co/models/facebook/bart-large-cnn"
# Add your token to the request headers
headers = {"Authorization": f"Bearer {HF_API_TOKEN}"}
payload = {"inputs": text_to_summarize}
headers
: The “Authorization”: “Bearer …” header lets Hugging Face know you have permission to use their API.payload
: The {“inputs”: requests.post()
: Sends your text to the model’s endpoint.# Send POST request
response = requests.post(API_URL, headers=headers, json=payload)
# Check response and extract the summary
if response.status_code == 200:
output = response.json()
summary = output[0]["summary_text"]
print("\nSummary from Hugging Face Inference API:")
print(summary)
else:
print(f"\nAPI request failed with status code {response.status_code}:")
print(response.text)