'); opacity:0.3;">

 Chapter 2: Data Centers, Cloud & Distributed Processing

Advanced Infrastructure for Big Data Systems

🏢 Data Centers
☁️ Cloud Computing
🔗 Distributed Systems
🐘 Hadoop Ecosystem

🎯 Learning Objectives

🏗️

Data Center Architecture

Master physical infrastructure: servers, cooling, power, networking, and performance optimization

☁️

Cloud Computing Models

Explore IaaS, PaaS, SaaS and cloud-native architectures for Big Data

🔗

Distributed Systems

Design sharding strategies, consistent hashing, and handle distributed challenges

⚙️

MapReduce & Hadoop

Implement parallel processing algorithms and explore the Hadoop ecosystem

🕐

Advanced Concurrency

Understand MVCC and Vector Clocks for distributed consistency

🎯

Practical Implementation

Design real-world solutions with hands-on exercises and case studies

🏢 Data Center Architecture & Infrastructure

🏗️ What is a Data Center?

A data center is a centralized facility housing computing infrastructure including servers, storage systems, networking equipment, and critical supporting infrastructure like power distribution, cooling systems, and security controls.

💻

Computing

Process applications & workloads

💾

Storage

Store & manage data assets

🌐

Networking

Connect resources internally & externally

Support Services

Power, cooling, security, monitoring

🔧 Physical Infrastructure Components

🖥️ Server Racks & Hardware Infrastructure

📏 Standard Specifications
  • 42U Racks: Standard height (1U = 1.75 inches)
  • Server Types: Blade, rack-mount, high-density
  • Power Density: Up to 25kW per rack
  • Cooling Design: Hot/cold aisle configuration
🔌 Infrastructure Management
  • Cable Management: Structured power & networking
  • Rack Organization: Optimized airflow patterns
  • Asset Tracking: RFID and barcode systems
  • Remote Access: IPMI and out-of-band management

Power Infrastructure & Energy Efficiency

📊 Power Usage Effectiveness (PUE)
PUE = Total Facility Energy / IT Equipment Energy
Ideal PUE: 1.0 (100% efficiency) | Industry Average: 1.4-1.8 | Best Practice: 1.2-1.3
🔋 Power Distribution
  • PDUs: Intelligent power management
  • UPS Systems: Backup during outages
  • Generators: Long-term power protection
  • A/B Feed: Redundant power paths
📈 Efficiency Metrics
  • Load Utilization: 80-90% optimal
  • Power Factor: >0.9 for efficiency
  • Harmonic Distortion: <5% THD
  • Monitoring: Real-time power analytics

❄️ Advanced Cooling Technologies

🏭
Data Center Cooling Architecture
Hot Aisle/Cold Aisle Configuration with CRAC Units
🌡️ Traditional Cooling
  • CRAC Units: Computer Room Air Conditioning
  • Raised Floor: Underfloor air distribution
  • Hot/Cold Aisles: Airflow optimization
  • Temperature: 68-77°F (20-25°C)
💧 Liquid Cooling
  • Direct-to-Chip: CPU/GPU cooling blocks
  • Immersion Cooling: Servers in dielectric fluid
  • Efficiency: 40-50% energy reduction
  • Density: Supports high-performance computing
🌿 Green Technologies
  • Free Cooling: Outside air when possible
  • Heat Recovery: Capture waste heat
  • Variable Speed: Adaptive fan controls
  • AI Optimization: ML-driven efficiency

💾 Storage Infrastructure Evolution

📀 Traditional Storage
  • DAS: Direct Attached Storage
  • NAS: Network Attached Storage (file-level)
  • SAN: Storage Area Networks (block-level)
  • Use Case: Structured data, databases
☁️ Software-Defined Storage
  • Virtualization: Abstract physical storage
  • Scalability: Scale-out architecture
  • Flexibility: Policy-driven management
  • Examples: VMware vSAN, Nutanix
🚀 Modern Storage
  • NVMe: High-performance SSDs
  • Object Storage: Web-scale data lakes
  • Hybrid: SSD + HDD optimization
  • Cloud Integration: Seamless hybrid storage

🌐 Network Infrastructure & Architecture

🔀 Network Switches

  • Access (ToR): Server connectivity within racks
  • Distribution: Aggregate multiple access switches
  • Core: High-speed backbone (100Gbps+)
  • SDN: Software-defined networking control
Modern: 400Gbps+ spine-leaf architecture

🛡️ Security Infrastructure

  • Firewalls: Network access control & inspection
  • IDS/IPS: Intrusion detection & prevention
  • DDoS Protection: Attack mitigation systems
  • Micro-segmentation: Zero-trust networking
AI-powered threat detection & response

⚖️ Load Balancers

  • Layer 4: Transport-layer traffic distribution
  • Layer 7: Application-aware intelligence
  • Global LB: Multi-site traffic management
  • Health Checks: Automatic failover
Auto-scaling with cloud integration

📐 Network Architecture Evolution

🏛️ Three-Tier (Traditional)
  • Core Layer: High-speed backbone
  • Aggregation: Service integration
  • Access: Server connectivity
  • Limitations: Oversubscription, complexity
🍃 Spine-Leaf (Modern)
  • Spine: Connect to all leaf switches
  • Leaf: Server and ToR connections
  • Benefits: Predictable latency, easy scaling
  • Use: Cloud and Big Data architectures

☁️ Cloud Computing for Big Data

🌍 Cloud Deployment Models

🌐

Public Cloud

Multi-tenant, internet-accessible, pay-per-use model

✅ Benefits:
  • Cost-effective scaling
  • Rapid deployment
  • Managed services
  • Global availability
⚠️ Challenges:
  • Data sovereignty
  • Security concerns
  • Vendor lock-in
  • Compliance complexity
🏢

Private Cloud

Single-tenant, on-premises or hosted infrastructure

✅ Benefits:
  • Enhanced security
  • Regulatory compliance
  • Full control
  • Predictable performance
⚠️ Challenges:
  • Higher costs
  • Maintenance overhead
  • Scaling limitations
  • Requires expertise
🔄

Hybrid Cloud

Combination of public and private cloud resources

🎯 Use Cases:
  • Burst computing
  • Data sovereignty
  • Disaster recovery
  • Cost optimization
🛠️ Technologies:
  • Cloud bursting
  • Data replication
  • Unified management
  • Identity federation

Multi-Cloud

Multiple cloud providers for different services

✅ Benefits:
  • Vendor diversification
  • Best-of-breed services
  • Cost optimization
  • Risk mitigation
⚠️ Challenges:
  • Increased complexity
  • Integration challenges
  • Data consistency
  • Management overhead

🏗️ Cloud Service Models (SPI Stack)

1 Infrastructure as a Service

Virtual computing, storage, and networking resources

You Manage:
Applications, Data, Middleware, OS, Runtime
Provider Manages:
Infrastructure, Virtualization, Servers, Storage
Examples:
  • 🟠 AWS EC2, EBS, VPC
  • 🔵 Azure Virtual Machines, Storage
  • 🔴 Google Compute Engine

2 Platform as a Service

Development frameworks, middleware, databases

You Manage:
Applications and Data
Provider Manages:
Runtime, Middleware, OS, Infrastructure
Examples:
  • 🟠 AWS Elastic Beanstalk
  • 🔵 Azure App Service
  • 🔴 Google App Engine

3 Software as a Service

Complete, ready-to-use applications

You Manage:
Only your data and user access
Provider Manages:
Everything else - Complete application stack
Examples:
  • 💼 Salesforce CRM
  • 📊 Microsoft 365
  • 🎓 Google Workspace

🌍 Major Cloud Providers for Big Data

Service Category 🟠 Amazon AWS 🔵 Microsoft Azure 🔴 Google Cloud
📦 Object Storage S3, EBS, Glacier Blob Storage, Data Lake Cloud Storage, Filestore
⚙️ Big Data Processing EMR, Glue, Lambda HDInsight, Databricks Dataflow, Dataproc
📊 Data Warehousing Redshift, Athena Synapse Analytics BigQuery
🤖 ML/AI Services SageMaker, Bedrock Machine Learning, Cognitive Services Vertex AI, AutoML
🌊 Stream Processing Kinesis, MSK Stream Analytics, Event Hubs Pub/Sub, Dataflow
🗄️ NoSQL Databases DynamoDB, DocumentDB Cosmos DB Firestore, Bigtable

🔗 Distributed Systems: Sharding & Data Partitioning

❓ Why Do We Need Sharding?

The Problem: As data grows beyond what a single database can handle efficiently, we need to distribute data across multiple servers. Sharding is the partitioning of data based on a key, enabling horizontal scaling.

📈 Scale Beyond Single Server

Single databases hit CPU, memory, and disk I/O limits. Sharding distributes load across many servers.

⚡ Parallel Query Execution

Queries can execute in parallel on multiple shards, improving throughput and reducing latency.

💾 Optimize Data Locality

Data placement affects network traffic. Sharding enables co-locating related data for efficiency.

🎯 Sharding Strategies

1 Range-Based Sharding

Data is partitioned based on ranges of a key. For example, users with IDs 1-1000 go to Shard A, 1001-2000 to Shard B, etc.

Visual Example:
📦 Shard A
Users
1-1000
📦 Shard B
Users
1001-2000
📦 Shard C
Users
2001-3000
✅ Advantages
  • Simple to implement
  • Range queries efficient
  • Easy to rebalance
  • Predictable placement
❌ Disadvantages
  • Hot spots possible
  • Uneven distribution
  • Manual rebalancing
  • Adding shards difficult

2 Hash-Based Sharding

A hash function maps each key to a shard. Formula: shard_id = hash(key) % num_shards

Algorithm Example:
# Given 4 shards and user_id = 2547
shard_id = hash(2547) % 4
       = 847392 % 4
       = 0  → Shard 0

# Consistent hash function ensures even distribution
# Most common: MurmurHash, CityHash, SHA-1
      
✅ Advantages
  • Uniform distribution
  • Minimal rebalancing
  • Fast shard lookup
  • Good for load balance
❌ Disadvantages
  • Range queries inefficient
  • Reshuffling on scale
  • No geographic locality
  • Adding shards requires rehashing

3 Consistent Hashing (Advanced)

Solves the reshuffling problem. Both keys and nodes are placed on a hash ring. Only nearby nodes are affected when nodes join/leave.

🔄 Consistent Hash Ring Visualization
Node A Node B Node C Node D Key K1 Key K2 Key K3

Keys map to nearest node clockwise on the ring

✅ Advantages
  • Minimal reshuffling
  • Only K/N keys move
  • Scales efficiently
  • Used in production
❌ Disadvantages
  • Complex implementation
  • Hot spots possible
  • Uneven distribution
  • Virtual nodes overhead
📚 Real-World Usage
  • Amazon DynamoDB: Consistent hashing for automatic data distribution
  • Apache Cassandra: Consistent hash with virtual nodes (vnodes) for load balancing
  • Redis Cluster: Hash slots (16,384) mapped to nodes
  • Memcached: Client-side consistent hashing for cache distribution

🔄 MapReduce: Distributed Batch Processing

🎯 The MapReduce Paradigm

MapReduce is a programming model for processing large datasets in parallel on a distributed cluster. Inspired by functional programming's map and reduce operations.

Key Idea: Divide work into independent map tasks → collect intermediate results → combine via reduce tasks

📊 MapReduce Execution Pipeline

1️⃣

Input Split

Data divided into chunks (HDFS block size)

2️⃣

Map Phase

Process records → emit key-value pairs

3️⃣

Shuffle & Sort

Group values by key, sort for reducers

4️⃣

Reduce Phase

Aggregate values → emit final output

📝 Complete Example: Word Count

Input Data

File 1: "Hello World Hello"
File 2: "Hello Hadoop"
File 3: "World"

Map Phase Output

File 1 Mapper 0: (Hello, 1), (World, 1), (Hello, 1)
File 2 Mapper 1: (Hello, 1), (Hadoop, 1)
File 3 Mapper 2: (World, 1)

After Shuffle & Sort

Reducer 0: (Hadoop, [1])
Reducer 1: (Hello, [1, 1, 1])
Reducer 2: (World, [1, 1])

Reduce Phase Output

(Hadoop, 1)
(Hello, 3)
(World, 2)

💻 Python Implementation (Pseudocode)

def mapper(key, value):
    """Emit (word, 1) for each word in line"""
    for word in value.split():
        yield (word, 1)

def reducer(key, values):
    """Sum counts for each word"""
    total = sum(values)
    yield (key, total)

# Execution
mapper("doc1", "hello world hello")
# Output: ("hello", 1), ("world", 1), ("hello", 1)

reducer("hello", [1, 1, 1])
# Output: ("hello", 3)
    

⚡ Strengths

  • Automatic parallelization - Framework handles distribution
  • Fault tolerance - Failed tasks automatically rerun
  • Data locality - Computation moves to data
  • Scalability - Linear scaling with cluster size
  • Simple programming model - Easy to understand

⚠️ Limitations

  • Batch only - No real-time processing
  • High disk I/O - Intermediate results on disk
  • Iterative workloads - Inefficient for ML
  • Startup overhead - Job initialization cost
  • Complex joins - Multi-way joins challenging

🕐 Advanced Concurrency: MVCC & Vector Clocks

📚 MVCC: Multi-Version Concurrency Control

Problem: How do we allow readers and writers to work simultaneously without locking conflicts?
Solution: Maintain multiple versions of each record; readers see consistent snapshots.

🔄 MVCC Timeline Example

T1 (Write v0): User Age = 25 (Version 0)

T2 (Read from T1): Reader sees v0 → Age = 25

T3 (Write v1): User Age = 30 (Version 1 created)

T4 (Read from T3): Reader sees v1 → Age = 30

T5 (Garbage Collection): v0 deleted (no active readers)

🔍 How It Works
  • Each write creates new version
  • Readers see snapshot at timestamp
  • No locks on read operations
  • Writers still use locks (minimal)
  • Garbage collection cleans old versions
⚙️ Implementation Details
  • Version numbers: Incremented per transaction
  • Read timestamps: Record snapshot seen
  • Undo/Redo logs: For crash recovery
  • Compaction: Merge versions periodically
  • Used by: PostgreSQL, MySQL InnoDB

🕒 Vector Clocks: Ordering Distributed Events

Problem: In distributed systems, there's no global clock. How do we determine if Event A happened before Event B?
Solution: Use vector clocks to track causality between events.

📐 Vector Clock Rules

1️⃣ Initialize

Each process starts with vector [0, 0, ..., 0]

2️⃣ Local Event

Process increments its own clock component

3️⃣ Send Message

Include vector clock with message

4️⃣ Receive Message

Take element-wise maximum, then increment own

Example: 3 Processes

Process A, B, C initially: [0,0,0]

Event A1: Local event at A
  A's clock: [1,0,0]

Event B1: Local event at B
  B's clock: [0,1,0]

Message: A sends to B with [1,0,0]
  B receives, updates: max([0,1,0], [1,0,0]) = [1,1,0]
  B increments own: [1,2,0]

Event C1: Local event at C
  C's clock: [0,0,1]

Message: B sends to C with [1,2,0]
  C receives: max([0,0,1], [1,2,0]) = [1,2,1]
  C increments own: [1,2,2]

Causality: A1 → B1 → C1 (shown by vector progression)
      
✅ Applications
  • Distributed version control
  • Conflict detection (Cassandra)
  • Causal consistency guarantees
  • Eventual consistency tracking
  • Message ordering verification
⚠️ Limitations
  • Memory grows with N processes
  • Network overhead increases
  • Doesn't detect all concurrency
  • Partial ordering (not total)
  • Lamport clocks simpler alternative

🐘 Hadoop Ecosystem: A Complete Big Data Platform

🏗️ Hadoop Architecture Overview

Hadoop is an open-source framework for reliable, scalable, distributed computing. It follows the shared-nothing architecture where each node is independent.

📚 Hadoop Ecosystem Components

📦

HDFS

Hadoop Distributed File System

  • Block-based storage (128MB/256MB)
  • NameNode (metadata) + DataNodes (data)
  • Default replication factor: 3
  • Rack-aware placement policy
  • Write-once, append semantics
⚙️

YARN

Yet Another Resource Negotiator

  • Resource Manager (global scheduler)
  • Node Manager (per-node agent)
  • Application Master (per app)
  • Container-based execution model
  • Supports multiple frameworks
🔄

MapReduce v2

Distributed batch processing

  • Splits, map, shuffle, reduce
  • Combiner for local aggregation
  • Speculative execution
  • Task failure recovery
  • Counter and metrics tracking
🐝

Hive

SQL query engine for Hadoop

  • HiveQL (SQL-like language)
  • Compiles to MapReduce/Spark jobs
  • Tables with schemas (partitioned)
  • Metadata stored in metastore
  • ETL workloads
📄

HBase

NoSQL wide-column database

  • Built on HDFS
  • Column-oriented storage
  • Row key lookup O(1)
  • Sorted row key range scans
  • Real-time random access

Spark

In-memory distributed computing

  • RDD (Resilient Distributed Datasets)
  • 100x faster than MapReduce
  • Batch, stream, ML, SQL unified
  • In-memory caching
  • Multiple language support
🌊

Flume

Log aggregation & streaming

  • Source → Channel → Sink pipeline
  • Reliable delivery
  • Multiple sources/sinks
  • HDFS, Kafka integration
  • Log collection at scale

🏗️ Typical Hadoop Cluster Architecture

Master Nodes (Control)
NameNode
HDFS Metadata
ResourceManager
YARN Scheduler
JobTracker
Job Management
Worker Nodes (Execution) × N
DataNode
Data Storage
NodeManager
Container Mgmt
TaskTracker
Task Execution

🎓 Chapter 2 Complete!

You've mastered:
✅ Data Center architecture & infrastructure
✅ Cloud computing models (IaaS, PaaS, SaaS)
✅ Distributed systems (sharding, consistent hashing)
✅ MapReduce programming for massive-scale processing
✅ MVCC & Vector Clocks for consistency
✅ Complete Hadoop ecosystem architecture

📚 Next Chapter: NoSQL Databases - Scaling Beyond Relational Systems

Learn about Key-Value stores, Column families, Document databases, and Graph databases!

Last modified: Tuesday, 21 October 2025, 9:37 PM