🎯 Learning Objectives
Data Center Architecture
Master physical infrastructure: servers, cooling, power, networking, and performance optimization
Cloud Computing Models
Explore IaaS, PaaS, SaaS and cloud-native architectures for Big Data
Distributed Systems
Design sharding strategies, consistent hashing, and handle distributed challenges
MapReduce & Hadoop
Implement parallel processing algorithms and explore the Hadoop ecosystem
Advanced Concurrency
Understand MVCC and Vector Clocks for distributed consistency
Practical Implementation
Design real-world solutions with hands-on exercises and case studies
🏢 Data Center Architecture & Infrastructure
🏗️ What is a Data Center?
A data center is a centralized facility housing computing infrastructure including servers, storage systems, networking equipment, and critical supporting infrastructure like power distribution, cooling systems, and security controls.
Computing
Process applications & workloads
Storage
Store & manage data assets
Networking
Connect resources internally & externally
Support Services
Power, cooling, security, monitoring
🔧 Physical Infrastructure Components
🖥️ Server Racks & Hardware Infrastructure
📏 Standard Specifications
- 42U Racks: Standard height (1U = 1.75 inches)
- Server Types: Blade, rack-mount, high-density
- Power Density: Up to 25kW per rack
- Cooling Design: Hot/cold aisle configuration
🔌 Infrastructure Management
- Cable Management: Structured power & networking
- Rack Organization: Optimized airflow patterns
- Asset Tracking: RFID and barcode systems
- Remote Access: IPMI and out-of-band management
⚡ Power Infrastructure & Energy Efficiency
📊 Power Usage Effectiveness (PUE)
🔋 Power Distribution
- PDUs: Intelligent power management
- UPS Systems: Backup during outages
- Generators: Long-term power protection
- A/B Feed: Redundant power paths
📈 Efficiency Metrics
- Load Utilization: 80-90% optimal
- Power Factor: >0.9 for efficiency
- Harmonic Distortion: <5% THD
- Monitoring: Real-time power analytics
❄️ Advanced Cooling Technologies
🌡️ Traditional Cooling
- CRAC Units: Computer Room Air Conditioning
- Raised Floor: Underfloor air distribution
- Hot/Cold Aisles: Airflow optimization
- Temperature: 68-77°F (20-25°C)
💧 Liquid Cooling
- Direct-to-Chip: CPU/GPU cooling blocks
- Immersion Cooling: Servers in dielectric fluid
- Efficiency: 40-50% energy reduction
- Density: Supports high-performance computing
🌿 Green Technologies
- Free Cooling: Outside air when possible
- Heat Recovery: Capture waste heat
- Variable Speed: Adaptive fan controls
- AI Optimization: ML-driven efficiency
💾 Storage Infrastructure Evolution
📀 Traditional Storage
- DAS: Direct Attached Storage
- NAS: Network Attached Storage (file-level)
- SAN: Storage Area Networks (block-level)
- Use Case: Structured data, databases
☁️ Software-Defined Storage
- Virtualization: Abstract physical storage
- Scalability: Scale-out architecture
- Flexibility: Policy-driven management
- Examples: VMware vSAN, Nutanix
🚀 Modern Storage
- NVMe: High-performance SSDs
- Object Storage: Web-scale data lakes
- Hybrid: SSD + HDD optimization
- Cloud Integration: Seamless hybrid storage
🌐 Network Infrastructure & Architecture
🔀 Network Switches
- Access (ToR): Server connectivity within racks
- Distribution: Aggregate multiple access switches
- Core: High-speed backbone (100Gbps+)
- SDN: Software-defined networking control
🛡️ Security Infrastructure
- Firewalls: Network access control & inspection
- IDS/IPS: Intrusion detection & prevention
- DDoS Protection: Attack mitigation systems
- Micro-segmentation: Zero-trust networking
⚖️ Load Balancers
- Layer 4: Transport-layer traffic distribution
- Layer 7: Application-aware intelligence
- Global LB: Multi-site traffic management
- Health Checks: Automatic failover
📐 Network Architecture Evolution
🏛️ Three-Tier (Traditional)
- Core Layer: High-speed backbone
- Aggregation: Service integration
- Access: Server connectivity
- Limitations: Oversubscription, complexity
🍃 Spine-Leaf (Modern)
- Spine: Connect to all leaf switches
- Leaf: Server and ToR connections
- Benefits: Predictable latency, easy scaling
- Use: Cloud and Big Data architectures
☁️ Cloud Computing for Big Data
🌍 Cloud Deployment Models
Public Cloud
Multi-tenant, internet-accessible, pay-per-use model
✅ Benefits:
- Cost-effective scaling
- Rapid deployment
- Managed services
- Global availability
⚠️ Challenges:
- Data sovereignty
- Security concerns
- Vendor lock-in
- Compliance complexity
Private Cloud
Single-tenant, on-premises or hosted infrastructure
✅ Benefits:
- Enhanced security
- Regulatory compliance
- Full control
- Predictable performance
⚠️ Challenges:
- Higher costs
- Maintenance overhead
- Scaling limitations
- Requires expertise
Hybrid Cloud
Combination of public and private cloud resources
🎯 Use Cases:
- Burst computing
- Data sovereignty
- Disaster recovery
- Cost optimization
🛠️ Technologies:
- Cloud bursting
- Data replication
- Unified management
- Identity federation
Multi-Cloud
Multiple cloud providers for different services
✅ Benefits:
- Vendor diversification
- Best-of-breed services
- Cost optimization
- Risk mitigation
⚠️ Challenges:
- Increased complexity
- Integration challenges
- Data consistency
- Management overhead
🏗️ Cloud Service Models (SPI Stack)
1 Infrastructure as a Service
Virtual computing, storage, and networking resources
You Manage:
Provider Manages:
Examples:
- 🟠 AWS EC2, EBS, VPC
- 🔵 Azure Virtual Machines, Storage
- 🔴 Google Compute Engine
2 Platform as a Service
Development frameworks, middleware, databases
You Manage:
Provider Manages:
Examples:
- 🟠 AWS Elastic Beanstalk
- 🔵 Azure App Service
- 🔴 Google App Engine
3 Software as a Service
Complete, ready-to-use applications
You Manage:
Provider Manages:
Examples:
- 💼 Salesforce CRM
- 📊 Microsoft 365
- 🎓 Google Workspace
🌍 Major Cloud Providers for Big Data
🔗 Distributed Systems: Sharding & Data Partitioning
❓ Why Do We Need Sharding?
The Problem: As data grows beyond what a single database can handle efficiently, we need to distribute data across multiple servers. Sharding is the partitioning of data based on a key, enabling horizontal scaling.
📈 Scale Beyond Single Server
Single databases hit CPU, memory, and disk I/O limits. Sharding distributes load across many servers.
⚡ Parallel Query Execution
Queries can execute in parallel on multiple shards, improving throughput and reducing latency.
💾 Optimize Data Locality
Data placement affects network traffic. Sharding enables co-locating related data for efficiency.
🎯 Sharding Strategies
1 Range-Based Sharding
Data is partitioned based on ranges of a key. For example, users with IDs 1-1000 go to Shard A, 1001-2000 to Shard B, etc.
Visual Example:
1-1000
1001-2000
2001-3000
✅ Advantages
- Simple to implement
- Range queries efficient
- Easy to rebalance
- Predictable placement
❌ Disadvantages
- Hot spots possible
- Uneven distribution
- Manual rebalancing
- Adding shards difficult
2 Hash-Based Sharding
A hash function maps each key to a shard. Formula: shard_id = hash(key) % num_shards
Algorithm Example:
# Given 4 shards and user_id = 2547
shard_id = hash(2547) % 4
= 847392 % 4
= 0 → Shard 0
# Consistent hash function ensures even distribution
# Most common: MurmurHash, CityHash, SHA-1
✅ Advantages
- Uniform distribution
- Minimal rebalancing
- Fast shard lookup
- Good for load balance
❌ Disadvantages
- Range queries inefficient
- Reshuffling on scale
- No geographic locality
- Adding shards requires rehashing
3 Consistent Hashing (Advanced)
Solves the reshuffling problem. Both keys and nodes are placed on a hash ring. Only nearby nodes are affected when nodes join/leave.
🔄 Consistent Hash Ring Visualization
Keys map to nearest node clockwise on the ring
✅ Advantages
- Minimal reshuffling
- Only K/N keys move
- Scales efficiently
- Used in production
❌ Disadvantages
- Complex implementation
- Hot spots possible
- Uneven distribution
- Virtual nodes overhead
📚 Real-World Usage
- Amazon DynamoDB: Consistent hashing for automatic data distribution
- Apache Cassandra: Consistent hash with virtual nodes (vnodes) for load balancing
- Redis Cluster: Hash slots (16,384) mapped to nodes
- Memcached: Client-side consistent hashing for cache distribution
🔄 MapReduce: Distributed Batch Processing
🎯 The MapReduce Paradigm
MapReduce is a programming model for processing large datasets in parallel on a distributed cluster. Inspired by functional programming's map and reduce operations.
Key Idea: Divide work into independent map tasks → collect intermediate results → combine via reduce tasks
📊 MapReduce Execution Pipeline
Input Split
Data divided into chunks (HDFS block size)
Map Phase
Process records → emit key-value pairs
Shuffle & Sort
Group values by key, sort for reducers
Reduce Phase
Aggregate values → emit final output
📝 Complete Example: Word Count
Input Data
File 2: "Hello Hadoop"
File 3: "World"
Map Phase Output
File 2 Mapper 1: (Hello, 1), (Hadoop, 1)
File 3 Mapper 2: (World, 1)
After Shuffle & Sort
Reducer 1: (Hello, [1, 1, 1])
Reducer 2: (World, [1, 1])
Reduce Phase Output
(Hello, 3)
(World, 2)
💻 Python Implementation (Pseudocode)
def mapper(key, value): """Emit (word, 1) for each word in line""" for word in value.split(): yield (word, 1) def reducer(key, values): """Sum counts for each word""" total = sum(values) yield (key, total) # Execution mapper("doc1", "hello world hello") # Output: ("hello", 1), ("world", 1), ("hello", 1) reducer("hello", [1, 1, 1]) # Output: ("hello", 3)
⚡ Strengths
- Automatic parallelization - Framework handles distribution
- Fault tolerance - Failed tasks automatically rerun
- Data locality - Computation moves to data
- Scalability - Linear scaling with cluster size
- Simple programming model - Easy to understand
⚠️ Limitations
- Batch only - No real-time processing
- High disk I/O - Intermediate results on disk
- Iterative workloads - Inefficient for ML
- Startup overhead - Job initialization cost
- Complex joins - Multi-way joins challenging
🕐 Advanced Concurrency: MVCC & Vector Clocks
📚 MVCC: Multi-Version Concurrency Control
Problem: How do we allow readers and writers to work simultaneously without locking conflicts?
Solution: Maintain multiple versions of each record; readers see consistent snapshots.
🔄 MVCC Timeline Example
T1 (Write v0): User Age = 25 (Version 0)
T2 (Read from T1): Reader sees v0 → Age = 25
T3 (Write v1): User Age = 30 (Version 1 created)
T4 (Read from T3): Reader sees v1 → Age = 30
T5 (Garbage Collection): v0 deleted (no active readers)
🔍 How It Works
- Each write creates new version
- Readers see snapshot at timestamp
- No locks on read operations
- Writers still use locks (minimal)
- Garbage collection cleans old versions
⚙️ Implementation Details
- Version numbers: Incremented per transaction
- Read timestamps: Record snapshot seen
- Undo/Redo logs: For crash recovery
- Compaction: Merge versions periodically
- Used by: PostgreSQL, MySQL InnoDB
🕒 Vector Clocks: Ordering Distributed Events
Problem: In distributed systems, there's no global clock. How do we determine if Event A happened before Event B?
Solution: Use vector clocks to track causality between events.
📐 Vector Clock Rules
1️⃣ Initialize
Each process starts with vector [0, 0, ..., 0]
2️⃣ Local Event
Process increments its own clock component
3️⃣ Send Message
Include vector clock with message
4️⃣ Receive Message
Take element-wise maximum, then increment own
Example: 3 Processes
Process A, B, C initially: [0,0,0]
Event A1: Local event at A
A's clock: [1,0,0]
Event B1: Local event at B
B's clock: [0,1,0]
Message: A sends to B with [1,0,0]
B receives, updates: max([0,1,0], [1,0,0]) = [1,1,0]
B increments own: [1,2,0]
Event C1: Local event at C
C's clock: [0,0,1]
Message: B sends to C with [1,2,0]
C receives: max([0,0,1], [1,2,0]) = [1,2,1]
C increments own: [1,2,2]
Causality: A1 → B1 → C1 (shown by vector progression)
✅ Applications
- Distributed version control
- Conflict detection (Cassandra)
- Causal consistency guarantees
- Eventual consistency tracking
- Message ordering verification
⚠️ Limitations
- Memory grows with N processes
- Network overhead increases
- Doesn't detect all concurrency
- Partial ordering (not total)
- Lamport clocks simpler alternative
🐘 Hadoop Ecosystem: A Complete Big Data Platform
🏗️ Hadoop Architecture Overview
Hadoop is an open-source framework for reliable, scalable, distributed computing. It follows the shared-nothing architecture where each node is independent.
📚 Hadoop Ecosystem Components
HDFS
Hadoop Distributed File System
- Block-based storage (128MB/256MB)
- NameNode (metadata) + DataNodes (data)
- Default replication factor: 3
- Rack-aware placement policy
- Write-once, append semantics
YARN
Yet Another Resource Negotiator
- Resource Manager (global scheduler)
- Node Manager (per-node agent)
- Application Master (per app)
- Container-based execution model
- Supports multiple frameworks
MapReduce v2
Distributed batch processing
- Splits, map, shuffle, reduce
- Combiner for local aggregation
- Speculative execution
- Task failure recovery
- Counter and metrics tracking
Hive
SQL query engine for Hadoop
- HiveQL (SQL-like language)
- Compiles to MapReduce/Spark jobs
- Tables with schemas (partitioned)
- Metadata stored in metastore
- ETL workloads
HBase
NoSQL wide-column database
- Built on HDFS
- Column-oriented storage
- Row key lookup O(1)
- Sorted row key range scans
- Real-time random access
Spark
In-memory distributed computing
- RDD (Resilient Distributed Datasets)
- 100x faster than MapReduce
- Batch, stream, ML, SQL unified
- In-memory caching
- Multiple language support
Flume
Log aggregation & streaming
- Source → Channel → Sink pipeline
- Reliable delivery
- Multiple sources/sinks
- HDFS, Kafka integration
- Log collection at scale
🏗️ Typical Hadoop Cluster Architecture
🎓 Chapter 2 Complete!
You've mastered:
✅ Data Center architecture & infrastructure
✅ Cloud computing models (IaaS, PaaS, SaaS)
✅ Distributed systems (sharding, consistent hashing)
✅ MapReduce programming for massive-scale processing
✅ MVCC & Vector Clocks for consistency
✅ Complete Hadoop ecosystem architecture
📚 Next Chapter: NoSQL Databases - Scaling Beyond Relational Systems
Learn about Key-Value stores, Column families, Document databases, and Graph databases!