'); opacity:0.3;">

Chapter 2: Data Centers, Cloud & Distributed Processing

Advanced Infrastructure for Big Data Systems

🏢 Data Centers

☁️ Cloud Computing

🔗 Distributed Systems

🐘 Hadoop Ecosystem

🎯 Learning Objectives

🏗️

Data Center Architecture

Master physical infrastructure: servers, cooling, power, networking, and performance optimization

☁️

Cloud Computing Models

Explore IaaS, PaaS, SaaS and cloud-native architectures for Big Data

🔗

Distributed Systems

Design sharding strategies, consistent hashing, and handle distributed challenges

⚙️

MapReduce & Hadoop

Implement parallel processing algorithms and explore the Hadoop ecosystem

🕐

Advanced Concurrency

Understand MVCC and Vector Clocks for distributed consistency

🎯

Practical Implementation

Design real-world solutions with hands-on exercises and case studies

🏢 Data Center Architecture & Infrastructure

🏗️ What is a Data Center?

A data center is a centralized facility housing computing infrastructure including servers, storage systems, networking equipment, and critical supporting infrastructure like power distribution, cooling systems, and security controls.

💻

Computing

Process applications & workloads

💾

Storage

Store & manage data assets

🌐

Networking

Connect resources internally & externally

⚡

Support Services

Power, cooling, security, monitoring

🔧 Physical Infrastructure Components

🖥️ Server Racks & Hardware Infrastructure

📏 Standard Specifications

42U Racks: Standard height (1U = 1.75 inches)
Server Types: Blade, rack-mount, high-density
Power Density: Up to 25kW per rack
Cooling Design: Hot/cold aisle configuration

🔌 Infrastructure Management

Cable Management: Structured power & networking
Rack Organization: Optimized airflow patterns
Asset Tracking: RFID and barcode systems
Remote Access: IPMI and out-of-band management

⚡ Power Infrastructure & Energy Efficiency

📊 Power Usage Effectiveness (PUE)

                        PUE = Total Facility Energy / IT Equipment Energy
                    
                        Ideal PUE: 1.0 (100% efficiency) | Industry Average: 1.4-1.8 | Best Practice: 1.2-1.3

🔋 Power Distribution

PDUs: Intelligent power management
UPS Systems: Backup during outages
Generators: Long-term power protection
A/B Feed: Redundant power paths

📈 Efficiency Metrics

Load Utilization: 80-90% optimal
Power Factor: >0.9 for efficiency
Harmonic Distortion: <5% THD
Monitoring: Real-time power analytics

❄️ Advanced Cooling Technologies

🏭

Data Center Cooling Architecture

Hot Aisle/Cold Aisle Configuration with CRAC Units

🌡️ Traditional Cooling

CRAC Units: Computer Room Air Conditioning
Raised Floor: Underfloor air distribution
Hot/Cold Aisles: Airflow optimization
Temperature: 68-77°F (20-25°C)

💧 Liquid Cooling

Direct-to-Chip: CPU/GPU cooling blocks
Immersion Cooling: Servers in dielectric fluid
Efficiency: 40-50% energy reduction
Density: Supports high-performance computing

🌿 Green Technologies

Free Cooling: Outside air when possible
Heat Recovery: Capture waste heat
Variable Speed: Adaptive fan controls
AI Optimization: ML-driven efficiency

💾 Storage Infrastructure Evolution

📀 Traditional Storage

DAS: Direct Attached Storage
NAS: Network Attached Storage (file-level)
SAN: Storage Area Networks (block-level)
Use Case: Structured data, databases

☁️ Software-Defined Storage

Virtualization: Abstract physical storage
Scalability: Scale-out architecture
Flexibility: Policy-driven management
Examples: VMware vSAN, Nutanix

🚀 Modern Storage

NVMe: High-performance SSDs
Object Storage: Web-scale data lakes
Hybrid: SSD + HDD optimization
Cloud Integration: Seamless hybrid storage

🌐 Network Infrastructure & Architecture

🔀 Network Switches

Access (ToR): Server connectivity within racks
Distribution: Aggregate multiple access switches
Core: High-speed backbone (100Gbps+)
SDN: Software-defined networking control

Modern: 400Gbps+ spine-leaf architecture

🛡️ Security Infrastructure

Firewalls: Network access control & inspection
IDS/IPS: Intrusion detection & prevention
DDoS Protection: Attack mitigation systems
Micro-segmentation: Zero-trust networking

AI-powered threat detection & response

⚖️ Load Balancers

Layer 4: Transport-layer traffic distribution
Layer 7: Application-aware intelligence
Global LB: Multi-site traffic management
Health Checks: Automatic failover

Auto-scaling with cloud integration

📐 Network Architecture Evolution

🏛️ Three-Tier (Traditional)

Core Layer: High-speed backbone
Aggregation: Service integration
Access: Server connectivity
Limitations: Oversubscription, complexity

🍃 Spine-Leaf (Modern)

Spine: Connect to all leaf switches
Leaf: Server and ToR connections
Benefits: Predictable latency, easy scaling
Use: Cloud and Big Data architectures

☁️ Cloud Computing for Big Data

🌍 Cloud Deployment Models

🌐

Public Cloud

Multi-tenant, internet-accessible, pay-per-use model

✅ Benefits:

Cost-effective scaling
Rapid deployment
Managed services
Global availability

⚠️ Challenges:

Data sovereignty
Security concerns
Vendor lock-in
Compliance complexity

🏢

Private Cloud

Single-tenant, on-premises or hosted infrastructure

✅ Benefits:

Enhanced security
Regulatory compliance
Full control
Predictable performance

⚠️ Challenges:

Higher costs
Maintenance overhead
Scaling limitations
Requires expertise

🔄

Hybrid Cloud

Combination of public and private cloud resources

🎯 Use Cases:

Burst computing
Data sovereignty
Disaster recovery
Cost optimization

🛠️ Technologies:

Cloud bursting
Data replication
Unified management
Identity federation

Multi-Cloud

Multiple cloud providers for different services

✅ Benefits:

Vendor diversification
Best-of-breed services
Cost optimization
Risk mitigation

⚠️ Challenges:

Increased complexity
Integration challenges
Data consistency
Management overhead

🏗️ Cloud Service Models (SPI Stack)

1 Infrastructure as a Service

Virtual computing, storage, and networking resources

You Manage:

Applications, Data, Middleware, OS, Runtime

Provider Manages:

Infrastructure, Virtualization, Servers, Storage

Examples:

🟠 AWS EC2, EBS, VPC
🔵 Azure Virtual Machines, Storage
🔴 Google Compute Engine

2 Platform as a Service

Development frameworks, middleware, databases

You Manage:

Applications and Data

Provider Manages:

Runtime, Middleware, OS, Infrastructure

Examples:

🟠 AWS Elastic Beanstalk
🔵 Azure App Service
🔴 Google App Engine

3 Software as a Service

Complete, ready-to-use applications

You Manage:

Only your data and user access

Provider Manages:

Everything else - Complete application stack

Examples:

💼 Salesforce CRM
📊 Microsoft 365
🎓 Google Workspace

🌍 Major Cloud Providers for Big Data

Service Category	🟠 Amazon AWS	🔵 Microsoft Azure	🔴 Google Cloud
📦 Object Storage	S3, EBS, Glacier	Blob Storage, Data Lake	Cloud Storage, Filestore
⚙️ Big Data Processing	EMR, Glue, Lambda	HDInsight, Databricks	Dataflow, Dataproc
📊 Data Warehousing	Redshift, Athena	Synapse Analytics	BigQuery
🤖 ML/AI Services	SageMaker, Bedrock	Machine Learning, Cognitive Services	Vertex AI, AutoML
🌊 Stream Processing	Kinesis, MSK	Stream Analytics, Event Hubs	Pub/Sub, Dataflow
🗄️ NoSQL Databases	DynamoDB, DocumentDB	Cosmos DB	Firestore, Bigtable

🔗 Distributed Systems: Sharding & Data Partitioning

❓ Why Do We Need Sharding?

The Problem: As data grows beyond what a single database can handle efficiently, we need to distribute data across multiple servers. Sharding is the partitioning of data based on a key, enabling horizontal scaling.

📈 Scale Beyond Single Server

Single databases hit CPU, memory, and disk I/O limits. Sharding distributes load across many servers.

⚡ Parallel Query Execution

Queries can execute in parallel on multiple shards, improving throughput and reducing latency.

💾 Optimize Data Locality

Data placement affects network traffic. Sharding enables co-locating related data for efficiency.

🎯 Sharding Strategies

1 Range-Based Sharding

Data is partitioned based on ranges of a key. For example, users with IDs 1-1000 go to Shard A, 1001-2000 to Shard B, etc.

Visual Example:

📦 Shard A
Users
1-1000
→
📦 Shard B
Users
1001-2000
→
📦 Shard C
Users
2001-3000

✅ Advantages

Simple to implement
Range queries efficient
Easy to rebalance
Predictable placement

❌ Disadvantages

Hot spots possible
Uneven distribution
Manual rebalancing
Adding shards difficult

2 Hash-Based Sharding

A hash function maps each key to a shard. Formula: shard_id = hash(key) % num_shards

Algorithm Example:

# Given 4 shards and user_id = 2547
shard_id = hash(2547) % 4
       = 847392 % 4
       = 0  → Shard 0

# Consistent hash function ensures even distribution
# Most common: MurmurHash, CityHash, SHA-1

✅ Advantages

Uniform distribution
Minimal rebalancing
Fast shard lookup
Good for load balance

❌ Disadvantages

Range queries inefficient
Reshuffling on scale
No geographic locality
Adding shards requires rehashing

3 Consistent Hashing (Advanced)

Solves the reshuffling problem. Both keys and nodes are placed on a hash ring. Only nearby nodes are affected when nodes join/leave.

🔄 Consistent Hash Ring Visualization

Keys map to nearest node clockwise on the ring

✅ Advantages

Minimal reshuffling
Only K/N keys move
Scales efficiently
Used in production

❌ Disadvantages

Complex implementation
Hot spots possible
Uneven distribution
Virtual nodes overhead

📚 Real-World Usage

Amazon DynamoDB: Consistent hashing for automatic data distribution
Apache Cassandra: Consistent hash with virtual nodes (vnodes) for load balancing
Redis Cluster: Hash slots (16,384) mapped to nodes
Memcached: Client-side consistent hashing for cache distribution

🔄 MapReduce: Distributed Batch Processing

🎯 The MapReduce Paradigm

MapReduce is a programming model for processing large datasets in parallel on a distributed cluster. Inspired by functional programming's map and reduce operations.

Key Idea: Divide work into independent map tasks → collect intermediate results → combine via reduce tasks

📊 MapReduce Execution Pipeline

1️⃣

Input Split

Data divided into chunks (HDFS block size)

2️⃣

Map Phase

Process records → emit key-value pairs

3️⃣

Shuffle & Sort

Group values by key, sort for reducers

4️⃣

Reduce Phase

Aggregate values → emit final output

📝 Complete Example: Word Count

Input Data

      File 1: "Hello World Hello"

      File 2: "Hello Hadoop"

      File 3: "World"

Map Phase Output

      File 1 Mapper 0: (Hello, 1), (World, 1), (Hello, 1)

      File 2 Mapper 1: (Hello, 1), (Hadoop, 1)

      File 3 Mapper 2: (World, 1)

After Shuffle & Sort

      Reducer 0: (Hadoop, [1])

      Reducer 1: (Hello, [1, 1, 1])

      Reducer 2: (World, [1, 1])

Reduce Phase Output

      (Hadoop, 1)

      (Hello, 3)

      (World, 2)

💻 Python Implementation (Pseudocode)

def mapper(key, value):
    """Emit (word, 1) for each word in line"""
    for word in value.split():
        yield (word, 1)

def reducer(key, values):
    """Sum counts for each word"""
    total = sum(values)
    yield (key, total)

# Execution
mapper("doc1", "hello world hello")
# Output: ("hello", 1), ("world", 1), ("hello", 1)

reducer("hello", [1, 1, 1])
# Output: ("hello", 3)

⚡ Strengths

Automatic parallelization - Framework handles distribution
Fault tolerance - Failed tasks automatically rerun
Data locality - Computation moves to data
Scalability - Linear scaling with cluster size
Simple programming model - Easy to understand

⚠️ Limitations

Batch only - No real-time processing
High disk I/O - Intermediate results on disk
Iterative workloads - Inefficient for ML
Startup overhead - Job initialization cost
Complex joins - Multi-way joins challenging

🕐 Advanced Concurrency: MVCC & Vector Clocks

📚 MVCC: Multi-Version Concurrency Control

Problem: How do we allow readers and writers to work simultaneously without locking conflicts?
Solution: Maintain multiple versions of each record; readers see consistent snapshots.

🔄 MVCC Timeline Example

T1 (Write v0): User Age = 25 (Version 0)

T2 (Read from T1): Reader sees v0 → Age = 25

T3 (Write v1): User Age = 30 (Version 1 created)

T4 (Read from T3): Reader sees v1 → Age = 30

T5 (Garbage Collection): v0 deleted (no active readers)

🔍 How It Works

Each write creates new version
Readers see snapshot at timestamp
No locks on read operations
Writers still use locks (minimal)
Garbage collection cleans old versions

⚙️ Implementation Details

Version numbers: Incremented per transaction
Read timestamps: Record snapshot seen
Undo/Redo logs: For crash recovery
Compaction: Merge versions periodically
Used by: PostgreSQL, MySQL InnoDB

🕒 Vector Clocks: Ordering Distributed Events

Problem: In distributed systems, there's no global clock. How do we determine if Event A happened before Event B?
Solution: Use vector clocks to track causality between events.

📐 Vector Clock Rules

1️⃣ Initialize

Each process starts with vector [0, 0, ..., 0]

2️⃣ Local Event

Process increments its own clock component

3️⃣ Send Message

Include vector clock with message

4️⃣ Receive Message

Take element-wise maximum, then increment own

Example: 3 Processes

Process A, B, C initially: [0,0,0]

Event A1: Local event at A
  A's clock: [1,0,0]

Event B1: Local event at B
  B's clock: [0,1,0]

Message: A sends to B with [1,0,0]
  B receives, updates: max([0,1,0], [1,0,0]) = [1,1,0]
  B increments own: [1,2,0]

Event C1: Local event at C
  C's clock: [0,0,1]

Message: B sends to C with [1,2,0]
  C receives: max([0,0,1], [1,2,0]) = [1,2,1]
  C increments own: [1,2,2]

Causality: A1 → B1 → C1 (shown by vector progression)

✅ Applications

Distributed version control
Conflict detection (Cassandra)
Causal consistency guarantees
Eventual consistency tracking
Message ordering verification

⚠️ Limitations

Memory grows with N processes
Network overhead increases
Doesn't detect all concurrency
Partial ordering (not total)
Lamport clocks simpler alternative

🐘 Hadoop Ecosystem: A Complete Big Data Platform

🏗️ Hadoop Architecture Overview

Hadoop is an open-source framework for reliable, scalable, distributed computing. It follows the shared-nothing architecture where each node is independent.

📚 Hadoop Ecosystem Components

📦

HDFS

Hadoop Distributed File System

Block-based storage (128MB/256MB)
NameNode (metadata) + DataNodes (data)
Default replication factor: 3
Rack-aware placement policy
Write-once, append semantics

⚙️

YARN

Yet Another Resource Negotiator

Resource Manager (global scheduler)
Node Manager (per-node agent)
Application Master (per app)
Container-based execution model
Supports multiple frameworks

🔄

MapReduce v2

Distributed batch processing

Splits, map, shuffle, reduce
Combiner for local aggregation
Speculative execution
Task failure recovery
Counter and metrics tracking

🐝

Hive

SQL query engine for Hadoop

HiveQL (SQL-like language)
Compiles to MapReduce/Spark jobs
Tables with schemas (partitioned)
Metadata stored in metastore
ETL workloads

📄

HBase

NoSQL wide-column database

Built on HDFS
Column-oriented storage
Row key lookup O(1)
Sorted row key range scans
Real-time random access

⚡

Spark

In-memory distributed computing

RDD (Resilient Distributed Datasets)
100x faster than MapReduce
Batch, stream, ML, SQL unified
In-memory caching
Multiple language support

🌊

Flume

Log aggregation & streaming

Source → Channel → Sink pipeline
Reliable delivery
Multiple sources/sinks
HDFS, Kafka integration
Log collection at scale

🏗️ Typical Hadoop Cluster Architecture

Master Nodes (Control)

NameNode

HDFS Metadata

ResourceManager

YARN Scheduler

JobTracker

Job Management

Worker Nodes (Execution) × N

DataNode

Data Storage

NodeManager

Container Mgmt

TaskTracker

Task Execution

🎓 Chapter 2 Complete!

You've mastered:
✅ Data Center architecture & infrastructure
✅ Cloud computing models (IaaS, PaaS, SaaS)
✅ Distributed systems (sharding, consistent hashing)
✅ MapReduce programming for massive-scale processing
✅ MVCC & Vector Clocks for consistency
✅ Complete Hadoop ecosystem architecture

📚 Next Chapter: NoSQL Databases - Scaling Beyond Relational Systems

Learn about Key-Value stores, Column families, Document databases, and Graph databases!

Last modified: Tuesday, 21 October 2025, 9:37 PM

Gestion et Analyse des Méga-donnéesData Centers, Cloud & Distributed Processing

Chapter 2: Data Centers, Cloud & Distributed Processing

🎯 Learning Objectives

Data Center Architecture

Cloud Computing Models

Distributed Systems

MapReduce & Hadoop

Advanced Concurrency

Practical Implementation

🏢 Data Center Architecture & Infrastructure

🏗️ What is a Data Center?

Computing

Storage

Networking

Support Services

🔧 Physical Infrastructure Components

🖥️ Server Racks & Hardware Infrastructure

📏 Standard Specifications

🔌 Infrastructure Management

⚡ Power Infrastructure & Energy Efficiency

📊 Power Usage Effectiveness (PUE)

🔋 Power Distribution

📈 Efficiency Metrics

❄️ Advanced Cooling Technologies

🌡️ Traditional Cooling

💧 Liquid Cooling

🌿 Green Technologies

💾 Storage Infrastructure Evolution

📀 Traditional Storage

☁️ Software-Defined Storage

🚀 Modern Storage

🌐 Network Infrastructure & Architecture

🔀 Network Switches

🛡️ Security Infrastructure

⚖️ Load Balancers

📐 Network Architecture Evolution

🏛️ Three-Tier (Traditional)

🍃 Spine-Leaf (Modern)

☁️ Cloud Computing for Big Data

🌍 Cloud Deployment Models

Public Cloud

✅ Benefits:

⚠️ Challenges:

Private Cloud

✅ Benefits:

⚠️ Challenges:

Hybrid Cloud

🎯 Use Cases:

🛠️ Technologies:

Multi-Cloud

✅ Benefits:

⚠️ Challenges:

🏗️ Cloud Service Models (SPI Stack)

1 Infrastructure as a Service

You Manage:

Provider Manages:

Examples:

2 Platform as a Service

You Manage:

Provider Manages:

Examples:

3 Software as a Service

You Manage:

Provider Manages:

Examples:

🌍 Major Cloud Providers for Big Data

🔗 Distributed Systems: Sharding & Data Partitioning

❓ Why Do We Need Sharding?

📈 Scale Beyond Single Server

⚡ Parallel Query Execution

💾 Optimize Data Locality

🎯 Sharding Strategies

1 Range-Based Sharding

Visual Example:

✅ Advantages

❌ Disadvantages

2 Hash-Based Sharding

Algorithm Example:

✅ Advantages

❌ Disadvantages

Gestion et Analyse des Méga-données
Data Centers, Cloud & Distributed Processing