Lecture 7

Big Data and the Modern Data Infrastructure

Byeong-Hak Choe

SUNY Geneseo

October 15, 2025

A Simple Taxonomy of Data

Structured Data vs. Unstructured Data

  • Data comes in various formats.
    • Structured data: Has a predefined format, fits into traditional databases.
    • Unstructured data: Not organized in a predefined manner, comes from sources like documents, social media, emails, photos, videos, etc.

Data Types Overview

  • Categorical Data: Data that can be divided into distinct categories based on some qualitative attribute.
    • Nominal Data
    • Ordinal Data
  • Numeric Data: Data that represents measurable quantities and can be subjected to mathematical algebra.
    • Interval Data
    • Ratio Data

Categorical Data - Nominal

ID Animal
1 Dog
2 Cat
3 Bird
  • Nominal Data: Categorical data where the categories have no intrinsic order or ranking.
  • No Order: Categories are simply different; there is no logical sequence.
  • Examples:
    • Colors: Red, Blue, Green
    • Types of Animals: Dog, Cat, Bird

Categorical Data - Ordinal

ID Education Level
1 Bachelor’s
2 Master’s
3 PhD
  • Ordinal Data: Categorical data where the categories have a meaningful order or ranking.

  • Order Matters: Categories can be ranked or ordered, but the differences between categories are not necessarily uniform.

  • Examples:

    • Education Levels: High School, Bachelor’s, Master’s, PhD
    • Customer Satisfaction: Poor, Fair, Good, Excellent

Numeric Data - Interval

ID Temperature (°F)
1 70
2 80
3 90
  • Interval Data: Numeric data where the differences between values are meaningful, but there is no true zero point.

  • Meaningful Intervals: The difference between values is consistent.

  • No True Zero: Zero does not indicate the absence of the quantity.

  • Examples:

    • Temperature (°F): Zero degrees does not mean no temperature.
    • Time of Day in a 12-Hour Clock: Differences are meaningful, but there is no absolute zero.

Numeric Data - Ratio

ID Height (cm) Weight (kg)
1 160 55
2 175 70
3 170 65
  • Ratio Data: Numeric data with a true zero point, allowing for a full range of mathematical operations.

  • Meaningful Ratios: Comparisons like twice as much or half as much are valid.

  • True Zero: Zero indicates the absence of the quantity.

  • Examples:

    • Height in Centimeters: Zero means no height.
    • Weight in Kilograms: Zero means no weight.

Classwork: Taxonomy of Data

Try it outClasswork 7: Taxonomy of Data

Databases

What Is a Database?

  • A database (DB) is a structured collection of data stored electronically.
  • A Database Management System (DBMS) is software that helps us:
    • Store data (safely and efficiently)
    • Query data (like filter(), select())
    • Update data while keeping everything consistent and valid
  • Examples of DBMS:
    • PostgreSQL / MySQL (used by websites and companies)
    • Google BigQuery, Snowflake (large-scale analysis on cloud system)
    • Excel / Google Sheetsbasic storage only — not a full DBMS

Note

  • SQL stands for Structured Query Language, and a query simply means asking the data for something — such as filtering rows, selecting columns, or combining information.

ETL: 📥 Extract ➜ 🔧 Transform ➜ 💾 Load

  • ETL is the data preparation workflow used in analytics.
ETL Step Meaning in Data Workflow Example in a Database Context
Extract Retrieve raw data from an external source Importing a CSV, Google Sheet, app response, or web data into a temporary table
Transform Clean, reshape, and structure the data Filtering rows, selecting fields, and joining tables
Load Store the cleaned data for analysis Writing the final structured table into a database as the analysis-ready dataset
  • It makes raw data usable by making it clean, consistent, and connected before analysis begins.

📥 ETL — Extract Stage

  • Goal: Collect raw data from external sources
    (Google Sheets, CSV/Excel files, survey tools, web exports, or app submissions).

  • Quick Validation Check:

    • ✅ Column names match expected schema
    • ✅ Numeric fields contain numbers only (no symbols/text)
  • Storage at this stage:

    • Raw data is stored in a temporary DB area (like Google Sheets or CSV)
    • ⚠️ This is not yet the official DB — just a collection point

🔧 ETL — Transform Stage

  • Goal: Convert raw data into a clean, consistent, analysis-ready format

  • In DANL 101 using R (tidyverse):

    • filter() → keep valid observations
    • select() → keep relevant variables
    • *_join() → combine multiple tables/data.frames
  • Storage at this stage:

    • Cleaning happens in memory, not yet in the final database
    • The Google Sheet + R behave like a database + DBMS system, with tidyverse acting as the query engine

💾 ETL — Load Stage

  • Goal: Store the clean, final dataset as the official analysis table

  • Storage at this stage:

    • The cleaned data is written to a database area (this becomes the official dataset for analysis)
  • In DANL 101 using R (tidyverse):

    • The final data.frame functions as our primary analysis dataset, used for:
      • 📊 Summaries — descriptive statistics and numeric insights
      • 📈 Visualizations — plots, charts, and dashboard elements
      • 🎯 Storytelling & Analysis — interpreting and communicating insights
  • Try it outClasswork 8: Databases — Social Media Analytics.

Relational Data Thinking

  • During the Transform step in ETL, our data becomes:
    • Clean, structured in tables, and organized with a shared key column
  • At this point, we start thinking like database analysts:
    • Each table holds one type of data
    • A key column links tables together
  • In real-world analytics, data rarely lives in a single big file.
    • Example: one table stores student social media activity
    • Another table stores platform reference information
  • To analyze properly, we connect these tables using a key.
    • In R tidyverse: left_join()
    • In database language: a join operation

Relational Databases

  • When multiple tables are linked by keys, this structure is called a relational database

tab_project <- read_csv("https://bcdanl.github.io/data/rdb-project_table.csv")
tab_department <- read_csv("https://bcdanl.github.io/data/rdb-department_table.csv")
tab_manager <- read_csv("https://bcdanl.github.io/data/rdb-manager_table.csv")
  • A relational database organizes data into multiple related tables, called relations.
  • Each table stores data about one type of entity (e.g., projects, departments, managers).

Relational Database Characteristics

  1. Data is stored in a data.frame (or table).
  2. Each row = an observation (or record).
  3. Each column = a variable (or attribute).
  4. Each table has a key — a column that uniquely identifies each row.
  5. Keys allow us to link tables together.
  6. We use queries (like filter(), select(), left_join()) to retrieve and combine data.

Relational Tables and Keys

x <- data.frame(
    key = c(1, 2, 3),
    val_x = c('x1', 'x2', 'x3')
)
y <- data.frame(
    key = c(1, 2, 4),
    val_y = c('y1', 'y2', 'y3')
)
  • The colored column represents the “key” variable (key).
  • The grey column represents the “value” variable (val_x, val_y).

Joining Tables with left_join()

x |>
  left_join(y)
  • A left join keeps all rows from x and adds matching information from y.
  • Among the different join types, left_join() is the most commonly used join.
    • It does not lose information from your main data.frame (x) and simply attaches extra information (y) when it exists.
  • Try it outClasswork 9: Thinking Relationally: Joining Tables.

Big Data

Big Data

  • Big data and analytics are likely to be significant components of future careers across various fields.

  • Big data refers to enormous and complex data collections that traditional data management tools can’t handle effectively.

  • Five key characteristics of big data (5 V’s):

    1. Volume
    2. Velocity
    3. Value
    4. Veracity
    5. Variety

Five V’s - 1. Volume

  • In 2017, the digital universe contained an estimated 16.1 zettabytes of data.
  • Expected to grow to 163 zettabytes by 2025.
  • Much new data will come from embedded systems in smart devices.

Five V’s - 1. Volume

Name Symbol Value
Kilobyte kB 10³
Megabyte MB 10⁶
Gigabyte GB 10⁹
Terabyte TB 10¹²
Petabyte PB 10¹⁵
Exabyte EB 10¹⁸
Zettabyte ZB 10²¹
Yottabyte YB 10²⁴
Brontobyte* BB 10²⁷
Gegobyte* GeB 10³⁰

Note: The asterisks (*) next to Brontobyte and Gegobyte in the original image have been preserved in this table. These likely indicate that these units are less commonly used or are proposed extensions to the standard system of byte units.

Increase in size of the global datasphere

Five V’s - 2. Velocity

  • Refers to the rate at which new data is generated.
  • Estimated at 0.33 zetabytes each day (120 zetabytes annually).
  • 90% of the world’s data was generated in just the past two years.

Five V’s - 3. Value

  • Refers to the worth of data in decision-making.
  • Emphasizes the need to quickly identify and process relevant data.
  • Users may be able to find more patterns and interesting anomalies from “big” data than from “small” data.

Five V’s - 4. Veracity

  • Measures the quality of the data.
  • Considers accuracy, completeness, and currency of data.
  • Determines if the data can be trusted for good decision-making.

Five V’s - 5. Variety

Free Sources of Useful (Big) Data

Economics/Finance

Data Source Description URL
Bureau of Labor Statistics (BLS) Provides access to data on inflation and prices, wages and benefits, employment, spending and time use, productivity, and workplace injuries BLS
FRED (Federal Reserve Economic Data) Provides access to a vast collection of U.S. economic data, including interest rates, GDP, inflation, employment, and more FRED
Yahoo Finance Provides comprehensive financial news, data, and analysis, including stock quotes, market data, and financial reports Yahoo Finance
IMF (International Monetary Fund) Provides access to a range of economic data and reports on countries’ economies IMF Data
World Bank Open Data Free and open access to global development data, including world development indicators World Bank Open Data
OECD Data Provides access to economic, environmental, and social data and indicators from OECD member countries OECD Data

Free Sources of Useful (Big) Data

Government/Public Data

Data Source Description URL
Data.gov Portal providing access to over 186,000 government data sets, related to topics such as agriculture, education, health, and public safety Data.gov
CIA World Factbook Portal to information on the economy, government, history, infrastructure, military, and population of 267 countries CIA World Factbook
U.S. Census Bureau Portal to a huge variety of government statistics and data relating to the U.S. economy and its population U.S. Census Bureau
European Union Open Data Portal Provides access to public data from EU institutions EU Open Data Portal
New York City Open Data Provides access to datasets from New York City, covering a wide range of topics such as public safety, transportation, and health NYC Open Data
Los Angeles Open Data Portal for accessing public data from the City of Los Angeles, including transportation, public safety, and city services LA Open Data
Chicago Data Portal Offers access to datasets from the City of Chicago, including crime data, transportation, and health statistics Chicago Data Portal

Free Sources of Useful (Big) Data

Health, Climate/Environment, and Social Data

Data Source Description URL
Healthdata.gov Portal to 125 years of U.S. health care data, including national health care expenditures, claim-level Medicare data, and other topics Healthdata.gov
World Health Organization (WHO) Portal to data and statistics on global health issues WHO Data
National Centers for Environmental Information (NOAA) Portal for accessing a variety of climate and weather data sets NCEI
NOAA National Weather Service Provides weather, water, and climate data, forecasts and warnings NOAA NWS
FAO (Food and Agriculture Organization) Provides access to data on food and agriculture, including data on production, trade, food security, and sustainability FAOSTAT
Pew Research Center Internet & Technology Portal to research on U.S. politics, media and news, social trends, religion, Internet and technology, science, Hispanic, and global topics Pew Research
Data for Good from Facebook Provides access to anonymized data from Facebook to help non-profits and research communities with insights on crises, health, and well-being Facebook Data for Good
Data for Good from Canada Provides open access to datasets that address pressing social challenges across Canada Data for Good Canada

Free Sources of Useful (Big) Data

General Data Repositories

Data Source Description URL
Amazon Web Services (AWS) public data sets Portal to a huge repository of public data, including climate data, the million song dataset, and data from the 1000 Genomes project AWS Datasets
Gapminder Portal to data from the World Health Organization and World Bank on economic, medical, and social issues Gapminder
Google Dataset Search Helps find datasets stored across the web Google Dataset Search
Kaggle Datasets A community-driven platform with datasets from various fields, useful for machine learning and data science projects Kaggle Datasets
UCI Machine Learning Repository A collection of databases, domain theories, and datasets used for machine learning research UCI ML Repository
United Nations Data Provides access to global statistical data compiled by the United Nations UN Data
Humanitarian Data Exchange (HDX) Provides humanitarian data from the United Nations, NGOs, and other organizations HDX
Democratizing Data from data.org A platform providing access to high-impact datasets, tools, and resources aimed at solving critical global challenges Democratizing Data
Justia Federal District Court Opinions and Orders database A free searchable database of full-text opinions and orders from civil cases heard in U.S. Federal District Courts Justia

Technologies Used to Manage and Process Big Data

Technologies Used to Manage and Process Big Data

  • Definition of Big Data
    • Data sets so large and complex that traditional data management tools are inadequate.
  • The Need for Advanced Technologies
    • Emerging technologies are essential to manage, process, and analyze big data effectively.
  • Overview of Topics
    • Data Warehouses
    • Data Marts
    • Data Lakes

Challenges with Traditional Data Management

  • Limitations
    • Traditional software and hardware can’t handle the volume, velocity, and variety of big data.
  • Impact
    • Inability to store massive data volumes efficiently.
    • Difficulty in processing and analyzing data in a timely manner.
  • Solution
    • Adoption of new technologies specifically designed for big data management.

Data Warehouses

  • Definition
    • A large database that holds business information from many sources across the enterprise.
  • Purpose
    • Supports decision-making processes by providing a comprehensive view of the organization’s data.

Data Warehouses Characteristics

Characteristic Description
Large Holds billions of records and petabytes of data
Multiple Sources Data comes from many internal and external sources via the ETL process
Historical Typically contains data spanning 5 years or more
Cross-Organizational Access and Analysis Data accessed and analyzed by users across the organization to support multiple business processes and decision-making
Supports Various Analyses and Reporting Enables drill-down analysis, metric development, trend identification

Data Warehouse Architecture

  • Data Sources
    • Internal Systems: Online transaction processing systems, customer relationship management, enterprise resource planning.
    • External Systems: Social media, government databases, etc.
  • ETL Process
    • Extract, Transform, Load (to be discussed in detail).
  • Data Storage
    • Centralized repository optimized for query and analysis.
  • Data Access
    • Used by various departments for reporting, analysis, and decision-making.

Examples of Data Warehouse Usage

  • Walmart
    • Early adopter; used data warehouse to gain a competitive advantage in supply chain management.
    • Held transaction data from over 11,000 stores and 25,000 suppliers.
    • First commercial data warehouse to reach 1 terabyte in 1992.
    • Collects data over 2.5 petabytes per hour in 2024.

Examples of Data Warehouse Usage

  • WHOOP Wearable Device
    • Collects massive biometric data from athletes.
    • Data warehouse stores sensor data collected 100 times per second.
    • Provides insights on strain, recovery, and sleep to optimize performance.

Examples of Data Warehouse Usage

  • American Airlines
    • Flight attendants access customer data to enhance in-flight service.
    • Ability to resolve issues by offering free miles or travel vouchers based on customer history.

Data Quality in Data Warehouses

  • Challenges
    • Data Inconsistencies: Duplicate or missing data leading to incorrect analyses.
    • Dirty Data: Inaccurate, incomplete, or outdated information.
    • Misleading statistics: “Garbage in, garbage out”
  • Importance
    • Ensuring high data quality is critical to avoid misleading conclusions.
  • Solution
    • Implement robust ETL processes to cleanse and standardize data.

Data Marts

  • Subset of Data Warehouse
    • Focused on a specific business area or department.
  • Purpose
    • Provides relevant data to a specific group without the complexity of the entire data warehouse.
  • Advantages
    • Cost-Effective: Less expensive to implement and maintain.
    • Faster Access: Optimized for departmental needs.
    • Simplified Analysis: Easier to query and analyze due to reduced data volume.

Data Mart Use Cases

  • Departmental Focus
    • Finance: Financial reporting and analysis.
    • Marketing: Customer segmentation and campaign analysis.
    • Inventory: Stock levels and supply chain management.
  • Small to Medium-Sized Businesses
    • An affordable alternative to a full-scale data warehouse.

Data Marts vs. Data Warehouses

Feature Data Mart Data Warehouse
Scope Specific department or business area Entire enterprise
Data Volume Smaller Larger
Complexity Less complex More complex
Implementation Time Shorter Longer
Cost Lower Higher

Data Lakes

  • “Store Everything” Approach
    • Stores all data in its raw, unaltered form.
  • Purpose
    • Provides a centralized repository for all data, accommodating future analytical needs.
  • Characteristics
    • Data Variety: Includes structured, semi-structured, and unstructured data.
    • Flexibility: Data is available for any type of analysis at any time.

Data Lakes vs. Data Warehouses

Feature Data Lake Data Warehouse
Data Processing Schema-on-read (processed when accessed) Schema-on-write (processed before storage)
Data State Raw, unprocessed Cleaned, transformed
Data Types All data types Primarily structured
Flexibility High Moderate

Data Lakes - Case Study: Bechtel Corporation

  • About Bechtel
    • Global engineering, construction, and project management company.
  • Implementation
    • Built a 5-petabyte data lake consolidating years of project data.

Data Lakes - Case Study: Bechtel Corporation

  • Benefits
    • Historical Insights: Access to data from hundreds of projects worldwide.
    • Improved Forecasting: Better predictions of project outcomes.
    • Cost Reduction: Identifying inefficiencies to cut costs.
    • Competitive Advantage: Enhanced ability to win new contracts through data-driven insights.