Big Data and the Modern Data Infrastructure
October 15, 2025
| ID | Animal |
|---|---|
| 1 | Dog |
| 2 | Cat |
| 3 | Bird |
| ID | Education Level |
|---|---|
| 1 | Bachelor’s |
| 2 | Master’s |
| 3 | PhD |
Ordinal Data: Categorical data where the categories have a meaningful order or ranking.
Order Matters: Categories can be ranked or ordered, but the differences between categories are not necessarily uniform.
Examples:
| ID | Temperature (°F) |
|---|---|
| 1 | 70 |
| 2 | 80 |
| 3 | 90 |
Interval Data: Numeric data where the differences between values are meaningful, but there is no true zero point.
Meaningful Intervals: The difference between values is consistent.
No True Zero: Zero does not indicate the absence of the quantity.
Examples:
| ID | Height (cm) | Weight (kg) |
|---|---|---|
| 1 | 160 | 55 |
| 2 | 175 | 70 |
| 3 | 170 | 65 |
Ratio Data: Numeric data with a true zero point, allowing for a full range of mathematical operations.
Meaningful Ratios: Comparisons like twice as much or half as much are valid.
True Zero: Zero indicates the absence of the quantity.
Examples:
Try it out → Classwork 7: Taxonomy of Data
filter(), select())Note
| ETL Step | Meaning in Data Workflow | Example in a Database Context |
|---|---|---|
| Extract | Retrieve raw data from an external source | Importing a CSV, Google Sheet, app response, or web data into a temporary table |
| Transform | Clean, reshape, and structure the data | Filtering rows, selecting fields, and joining tables |
| Load | Store the cleaned data for analysis | Writing the final structured table into a database as the analysis-ready dataset |
Goal: Collect raw data from external sources
(Google Sheets, CSV/Excel files, survey tools, web exports, or app submissions).
Quick Validation Check:
Storage at this stage:
Goal: Convert raw data into a clean, consistent, analysis-ready format
In DANL 101 using R (tidyverse):
filter() → keep valid observationsselect() → keep relevant variables*_join() → combine multiple tables/data.framesStorage at this stage:
tidyverse acting as the query engineGoal: Store the clean, final dataset as the official analysis table
Storage at this stage:
In DANL 101 using R (tidyverse):
data.frame functions as our primary analysis dataset, used for:
Try it out → Classwork 8: Databases — Social Media Analytics.
left_join()
tab_project <-
read_csv("https://bcdanl.github.io/data/rdb-project_table.csv")
tab_department <-
read_csv("https://bcdanl.github.io/data/rdb-department_table.csv")
tab_manager <-
read_csv("https://bcdanl.github.io/data/rdb-manager_table.csv")data.frame (or table).filter(), select(), left_join()) to retrieve and combine data.
key).val_x, val_y).left_join()
x and adds matching information from y.left_join() is the most commonly used join.
x) and simply attaches extra information (y) when it exists.Big data and analytics are key components shaping the future across industries.
Refers to enormous, complex datasets that traditional tools can’t efficiently manage.
Characterized by the Five V’s:
| Unit | Symbol | Value |
|---|---|---|
| Kilobyte | kB | 10³ |
| Megabyte | MB | 10⁶ |
| Gigabyte | GB | 10⁹ |
| Terabyte | TB | 10¹² |
| Petabyte | PB | 10¹⁵ |
| Exabyte | EB | 10¹⁸ |
| Zettabyte | ZB | 10²¹ |
| Yottabyte | YB | 10²⁴ |
| Brontobyte* | BB | 10²⁷ |
| Gegobyte* | GeB | 10³⁰ |
*Less commonly used or proposed extensions.
Growth of the Global Datasphere

name, age, income)
| Characteristic | Description |
|---|---|
| Large | Stores billions of records and petabytes of data |
| Multiple Sources | Integrates internal and external data via ETL |
| Historical | Often includes 5+ years of archived data |
| Cross-Organizational | Accessible across departments for data-driven strategy |
| Supports Analysis & Reporting | Enables drill-downs and trend detection |
| Schema-Based | Data fits a predefined structure before being stored for efficient querying and analysis |
store_id (integer), sales (numeric)
Early adopter of data-driven supply chain optimization
Collects transaction data from 11,000+ stores and 25,000 suppliers
Uses real-time analytics to optimize pricing, inventory, and customer experience
In 1992, launched the first commercial data warehouse to exceed 1 TB
In 2025, processes data at a rate of 2.5 petabytes per hour