Lecture 13

Data Preparation and Management

Byeong-Hak Choe

SUNY Geneseo

September 27, 2024

Technologies Used to Manage and Process Big Data

Hadoop

Introduction to Hadoop

  • Definition
    • An open-source software framework for storing and processing large data sets.
  • Components
    • Hadoop Distributed File System (HDFS): Distributed data storage.
    • MapReduce: Data processing model.

Hadoop

Introduction to Hadoop

  • Purpose
    • Enables distributed processing of large data sets across clusters of computers.

Hadoop

Hadoop Architecture - HDFS

  • HDFS
    • Divides data into blocks and distributes them across different servers for processing.
    • Provides a highly redundant computing environment
      • Allows the application to keep running even if individual servers fail.

Hadoop

Hadoop Architecture - MapReduce

  • MapReduce
    • Map Phase: Filters and sorts data.
      • e.g., Sorting customer orders based on their product IDs, with each group corresponding to a specific product ID.
    • Reduce Phase: Summarizes and aggregates results.
      • e.g., Counting the number of orders within each group, thereby determining the frequency of each product ID.

Hadoop

Hadoop Architecture - MapReduce

Hadoop

How Hadoop Works

  1. Data Distribution
    • Large data sets are split into smaller blocks.
  2. Data Storage
    • Blocks are stored across multiple servers in the cluster.
  3. Processing with MapReduce
    • Map Tasks: Executed on servers where data resides, minimizing data movement.
    • Reduce Tasks: Combine results from map tasks to produce final output.
  4. Fault Tolerance
    • Data replication ensures processing continues even if servers fail.

Hadoop

Extending Hadoop for Real-Time Processing

  • Limitation of Hadoop
    • Hadoop is originally designed for batch processing.
      • Batch Processing: Data or tasks are collected over a period of time and then processed all at once, typically at scheduled times or during periods of low activity.
      • Results come after the entire dataset is analyzed.
  • Real-Time Processing Limitation:
    • Hadoop cannot natively process real-time streaming data (e.g., stock prices flowing into stock exchanges, live sensor data)
  • Extending Hadoop’s Capabilities
    • Both Apache Storm and Apache Spark can run on top of Hadoop clusters, utilizing HDFS for storage.

Hadoop

Apache Storm and Apache Spark

Apache Storm

  • Functionality:
    • Processes real-time data streams.
    • Handles unbounded streams of data reliably and efficiently.
  • Use Cases:
    • Real-time analytics
    • Online machine learning
    • Continuous computation
    • Real-time data integration

Apache Spark

  • Functionality:
    • Provides in-memory computations for increased speed.
    • Supports both batch and streaming data processing through Spark Streaming.
  • Use Cases:
    • Interactive queries for quick, on-the-fly data analysis
    • Machine learning

Apache Storm and Apache Spark

Medscape: Real-Time Medical News for Healthcare Professionals

  • A medical news app for smartphones and tablets designed to keep healthcare professionals informed.
    • Provides up-to-date medical news and expert perspectives.

  • Real-Time Updates:
    • Uses Apache Storm to process about 500 million tweets per day.
    • Automatic Twitter feed integration helps users track important medical trends shared by physicians and medical commentators.

References