Big Data and the Modern Data Infrastructure
October 15, 2025
ID | Animal |
---|---|
1 | Dog |
2 | Cat |
3 | Bird |
ID | Education Level |
---|---|
1 | Bachelor’s |
2 | Master’s |
3 | PhD |
Ordinal Data: Categorical data where the categories have a meaningful order or ranking.
Order Matters: Categories can be ranked or ordered, but the differences between categories are not necessarily uniform.
Examples:
ID | Temperature (°F) |
---|---|
1 | 70 |
2 | 80 |
3 | 90 |
Interval Data: Numeric data where the differences between values are meaningful, but there is no true zero point.
Meaningful Intervals: The difference between values is consistent.
No True Zero: Zero does not indicate the absence of the quantity.
Examples:
ID | Height (cm) | Weight (kg) |
---|---|---|
1 | 160 | 55 |
2 | 175 | 70 |
3 | 170 | 65 |
Ratio Data: Numeric data with a true zero point, allowing for a full range of mathematical operations.
Meaningful Ratios: Comparisons like twice as much or half as much are valid.
True Zero: Zero indicates the absence of the quantity.
Examples:
Try it out → Classwork 7: Taxonomy of Data
filter()
, select()
)Note
ETL Step | Meaning in Data Workflow | Example in a Database Context |
---|---|---|
Extract | Retrieve raw data from an external source | Importing a CSV, Google Sheet, app response, or web data into a temporary table |
Transform | Clean, reshape, and structure the data | Filtering rows, selecting fields, and joining tables |
Load | Store the cleaned data for analysis | Writing the final structured table into a database as the analysis-ready dataset |
Goal: Collect raw data from external sources
(Google Sheets, CSV/Excel files, survey tools, web exports, or app submissions).
Quick Validation Check:
Storage at this stage:
Goal: Convert raw data into a clean, consistent, analysis-ready format
In DANL 101 using R (tidyverse
):
filter()
→ keep valid observationsselect()
→ keep relevant variables*_join()
→ combine multiple tables/data.framesStorage at this stage:
tidyverse
acting as the query engineGoal: Store the clean, final dataset as the official analysis table
Storage at this stage:
In DANL 101 using R (tidyverse
):
data.frame
functions as our primary analysis dataset, used for:
Try it out → Classwork 8: Databases — Social Media Analytics.
left_join()
tab_project <- read_csv("https://bcdanl.github.io/data/rdb-project_table.csv")
tab_department <- read_csv("https://bcdanl.github.io/data/rdb-department_table.csv")
tab_manager <- read_csv("https://bcdanl.github.io/data/rdb-manager_table.csv")
data.frame
(or table).filter()
, select()
, left_join()
) to retrieve and combine data.key
).val_x
, val_y
).left_join()
x
and adds matching information from y
.left_join()
is the most commonly used join.
x
) and simply attaches extra information (y
) when it exists.Big data and analytics are likely to be significant components of future careers across various fields.
Big data refers to enormous and complex data collections that traditional data management tools can’t handle effectively.
Five key characteristics of big data (5 V’s):
Name | Symbol | Value |
---|---|---|
Kilobyte | kB | 10³ |
Megabyte | MB | 10⁶ |
Gigabyte | GB | 10⁹ |
Terabyte | TB | 10¹² |
Petabyte | PB | 10¹⁵ |
Exabyte | EB | 10¹⁸ |
Zettabyte | ZB | 10²¹ |
Yottabyte | YB | 10²⁴ |
Brontobyte* | BB | 10²⁷ |
Gegobyte* | GeB | 10³⁰ |
Note: The asterisks (*) next to Brontobyte and Gegobyte in the original image have been preserved in this table. These likely indicate that these units are less commonly used or are proposed extensions to the standard system of byte units.
Increase in size of the global datasphere
Data Source | Description | URL |
---|---|---|
Bureau of Labor Statistics (BLS) | Provides access to data on inflation and prices, wages and benefits, employment, spending and time use, productivity, and workplace injuries | BLS |
FRED (Federal Reserve Economic Data) | Provides access to a vast collection of U.S. economic data, including interest rates, GDP, inflation, employment, and more | FRED |
Yahoo Finance | Provides comprehensive financial news, data, and analysis, including stock quotes, market data, and financial reports | Yahoo Finance |
IMF (International Monetary Fund) | Provides access to a range of economic data and reports on countries’ economies | IMF Data |
World Bank Open Data | Free and open access to global development data, including world development indicators | World Bank Open Data |
OECD Data | Provides access to economic, environmental, and social data and indicators from OECD member countries | OECD Data |
Data Source | Description | URL |
---|---|---|
Data.gov | Portal providing access to over 186,000 government data sets, related to topics such as agriculture, education, health, and public safety | Data.gov |
CIA World Factbook | Portal to information on the economy, government, history, infrastructure, military, and population of 267 countries | CIA World Factbook |
U.S. Census Bureau | Portal to a huge variety of government statistics and data relating to the U.S. economy and its population | U.S. Census Bureau |
European Union Open Data Portal | Provides access to public data from EU institutions | EU Open Data Portal |
New York City Open Data | Provides access to datasets from New York City, covering a wide range of topics such as public safety, transportation, and health | NYC Open Data |
Los Angeles Open Data | Portal for accessing public data from the City of Los Angeles, including transportation, public safety, and city services | LA Open Data |
Chicago Data Portal | Offers access to datasets from the City of Chicago, including crime data, transportation, and health statistics | Chicago Data Portal |
Data Source | Description | URL |
---|---|---|
Healthdata.gov | Portal to 125 years of U.S. health care data, including national health care expenditures, claim-level Medicare data, and other topics | Healthdata.gov |
World Health Organization (WHO) | Portal to data and statistics on global health issues | WHO Data |
National Centers for Environmental Information (NOAA) | Portal for accessing a variety of climate and weather data sets | NCEI |
NOAA National Weather Service | Provides weather, water, and climate data, forecasts and warnings | NOAA NWS |
FAO (Food and Agriculture Organization) | Provides access to data on food and agriculture, including data on production, trade, food security, and sustainability | FAOSTAT |
Pew Research Center Internet & Technology | Portal to research on U.S. politics, media and news, social trends, religion, Internet and technology, science, Hispanic, and global topics | Pew Research |
Data for Good from Facebook | Provides access to anonymized data from Facebook to help non-profits and research communities with insights on crises, health, and well-being | Facebook Data for Good |
Data for Good from Canada | Provides open access to datasets that address pressing social challenges across Canada | Data for Good Canada |
Data Source | Description | URL |
---|---|---|
Amazon Web Services (AWS) public data sets | Portal to a huge repository of public data, including climate data, the million song dataset, and data from the 1000 Genomes project | AWS Datasets |
Gapminder | Portal to data from the World Health Organization and World Bank on economic, medical, and social issues | Gapminder |
Google Dataset Search | Helps find datasets stored across the web | Google Dataset Search |
Kaggle Datasets | A community-driven platform with datasets from various fields, useful for machine learning and data science projects | Kaggle Datasets |
UCI Machine Learning Repository | A collection of databases, domain theories, and datasets used for machine learning research | UCI ML Repository |
United Nations Data | Provides access to global statistical data compiled by the United Nations | UN Data |
Humanitarian Data Exchange (HDX) | Provides humanitarian data from the United Nations, NGOs, and other organizations | HDX |
Democratizing Data from data.org | A platform providing access to high-impact datasets, tools, and resources aimed at solving critical global challenges | Democratizing Data |
Justia Federal District Court Opinions and Orders database | A free searchable database of full-text opinions and orders from civil cases heard in U.S. Federal District Courts | Justia |
Characteristic | Description |
---|---|
Large | Holds billions of records and petabytes of data |
Multiple Sources | Data comes from many internal and external sources via the ETL process |
Historical | Typically contains data spanning 5 years or more |
Cross-Organizational Access and Analysis | Data accessed and analyzed by users across the organization to support multiple business processes and decision-making |
Supports Various Analyses and Reporting | Enables drill-down analysis, metric development, trend identification |
Feature | Data Mart | Data Warehouse |
---|---|---|
Scope | Specific department or business area | Entire enterprise |
Data Volume | Smaller | Larger |
Complexity | Less complex | More complex |
Implementation Time | Shorter | Longer |
Cost | Lower | Higher |
Feature | Data Lake | Data Warehouse |
---|---|---|
Data Processing | Schema-on-read (processed when accessed) | Schema-on-write (processed before storage) |
Data State | Raw, unprocessed | Cleaned, transformed |
Data Types | All data types | Primarily structured |
Flexibility | High | Moderate |