CHAPTER 3 - CONTINUATION

Security, Integrity 

& Modern Processing

Green IT • Data Security • Spark & Flink • Big Data Research


📚 Course Structure - 6 Sections

🌱

Section 1

Green IT & Energy Efficiency

  • PUE metrics & optimization
  • Sustainable data storage
  • Advanced cooling solutions
🔐

Section 2

NoSQL Data Security

  • Encryption strategies
  • Distributed authentication
  • Threat models & defense

Section 3

Data Integrity Assurance

  • Cryptographic hashing
  • Blockchain for audit
  • Consistency verification
👤

Section 4

GDPR & Data Privacy

  • Anonymization techniques
  • Right to be forgotten
  • Compliance frameworks

Section 5

Spark & Flink Advanced

  • Stream processing security
  • Distributed optimization
  • Real-time analytics
🔬

Section 6

Research & Innovation

  • Active research areas
  • Open challenges
  • Future directions

🌱 Section 1: Green IT & Sustainable Large-Scale Storage

📊 Context: The Energy Crisis in Data Centers

3%

Of global electricity consumed by data centers

600M

Metric tons CO₂ annually from tech industry

+25%

Annual growth in data storage demands

📈 PUE: The Key Efficiency Metric

PUE = Total Facility Power / IT Equipment Power
Interpretation Guide
PUE ≤ 1.2 Excellent (Google, Meta 2023)
PUE 1.3-1.5 Very Good (modern data centers)
PUE > 1.8 Needs improvement (legacy systems)

🔧 Green Technologies for NoSQL Storage

❄️ Liquid Cooling
  • Reduces PUE by 30-40%
  • Direct-to-chip cooling
  • Heat recovery systems
  • Precision fluid circulation
🌊 Free Cooling
  • Outside air when temperature permits
  • 10 months/year in Europe
  • Reduces cooling costs
  • Strategic geographic placement
🔋 Renewable Energy
  • Google: 100% renewable
  • Solar + wind farms
  • Geothermal in Iceland
  • Hydro in Scandinavia
🤖 AI Optimization
  • DeepMind -40% energy
  • Load prediction ML
  • Dynamic adjustment
  • Predictive maintenance
💾 Efficient Replication
  • Optimized replication factor
  • Rack-aware placement
  • Data deduplication
  • Transparent compression
📍 Data Locality
  • Edge computing proximity
  • CDN distribution
  • Regional optimization
  • Low-cost regions

🔐 Section 2: Security in NoSQL Systems

⚠️ NoSQL-Specific Threat Models

1. NoSQL Injection
// Vulnerable:
db.users.find({email: userInput})

// Attack:
userInput = {$ne: null}

Bypass validation with operators

2. Unauthorized Access
  • No authentication by default
  • Weak credentials exposure
  • Unprotected network access
  • Cross-tenant data leakage
3. Unencrypted Data
  • Data at rest in plaintext
  • HTTP instead of HTTPS transit
  • Unencrypted replication
  • Unprotected backups
4. Misconfiguration
  • Bind 0.0.0.0 exposed to internet
  • Default admin credentials
  • TLS not enabled
  • Unprotected shards/nodes

🛡️ Comprehensive Security Strategies

Authentication Layer

LDAP/Kerberos

Centralized Active Directory, enterprise SSO

OAuth/SAML

Cloud integration, federation support

Multi-Factor Authentication

2FA/3FA with TOTP, certificates, biometric

Encryption Strategy

At Rest

AES-256 with KMS key management

In Transit

TLS 1.3 with X.509 certificates

Replication

Encrypted replication channels with authentication

Section 3: Data Integrity in Distributed Systems

🔍 Distributed Integrity Challenges

Network Partitions

Inconsistent nodes during network failures → data divergence

Replication Lag

Replicas not in sync → stale reads possible

Node Failures

Shard crash → data loss if factor=1

🛠️ Integrity Assurance Techniques

🔗 Cryptographic Hashing
SHA-256

Checksum each document, detect modifications

Merkle Trees

Hash tree for dataset integrity verification

⛓️ Blockchain Audit
Append-Only Logs

Immutable transaction history, complete auditability

Hash Chaining

Each block hashes previous block

✍️ Digital Signatures
HMAC

Message Authentication Code with shared secret

RSA Signatures

Non-repudiation and source verification

👤 Section 4: GDPR & Personal Data Confidentiality

📋 GDPR in Big Data Context

🌍 Global Impact
  • GDPR (EU): €20M or 4% revenue fines
  • CCPA (California): $100-7,500 per violation
  • LGPD (Brazil): Similar to GDPR
  • PIPL (China): 50M yuan fines
⚖️ Core Principles
  • Lawfulness of processing
  • Data minimization
  • Purpose limitation
  • Accuracy & integrity

🔒 Privacy Protection Techniques for NoSQL

🎭 Anonymization
K-Anonymity

Each record indistinguishable from k-1 others

L-Diversity

Sensitive attributes have diverse values

T-Closeness

Distribution indistinguishable from overall

🔐 Pseudonymization
Hashed IDs

Replace PII with hash values (irreversible)

Tokenization

Replace with random token, keep mapping secure

Format-Preserving

Keep data format for analytics

🗑️ Right to be Forgotten
Data Deletion

Complete removal from all systems

Backup Handling

Redact from archived backups

Cascade Delete

Remove linked data automatically

⚠️ GDPR Challenges in NoSQL Systems

Challenge Problem Solution
Distributed Data Data in multiple shards/replicas Cascade delete across cluster
Immutable Logs Append-only logs preserve old data Redaction, not deletion
Backups Data in historical snapshots Purge old backups, anonymize new
Cache Layers Data cached (Redis, Memcached) TTL policy, manual purge

Section 5: Apache Spark & Flink - Modern Large-Scale Processing

🚀 Evolution: From Batch to Real-Time

The Processing Timeline

1
Hadoop MapReduce

2006-2013
Batch only
Disk-based

2
Apache Spark

2014+
100x faster
In-memory

3
Apache Flink

2015+
Native streaming
Low latency

4
Modern Era

2020+
Unified
Real-time ML

🔥 Apache Spark: Complete Architecture

🏗️ Spark Cluster Architecture

Driver Program SparkContext DAG Scheduler Cluster Manager (YARN/Mesos/K8s) Worker Node 1 Executor Task Task Worker Node 2 Executor Task Task Worker Node N Executor Task Task
Driver Program
  • Runs main() function
  • Creates SparkContext
  • Builds DAG of operations
  • Schedules tasks
Cluster Manager
  • Allocates resources
  • YARN, Mesos, K8s
  • Manages worker nodes
  • Handles failures
Executors & Tasks
  • Execute actual work
  • Store RDD partitions
  • Return results to driver
  • JVM processes

💎 RDD: Resilient Distributed Dataset

What is an RDD?

Resilient Distributed Dataset - The fundamental data abstraction in Spark. An immutable, distributed collection of objects that can be processed in parallel.

🔄
Resilient

Fault-tolerant, auto-recovery

🌐
Distributed

Partitioned across cluster

📦
Dataset

Collection of data

RDD Operations: Transformations vs Actions
🔄 Transformations (Lazy)

Create new RDD from existing one. Not executed until action called.

val rdd = sc.textFile("file.txt")
val filtered = rdd.filter(line => line.contains("error"))
val mapped = filtered.map(line => line.length)
// Nothing executed yet!

Common transformations:

  • map(), filter(), flatMap()
  • groupByKey(), reduceByKey()
  • join(), union(), distinct()
⚡ Actions (Eager)

Trigger computation and return results to driver or write to storage.

val count = rdd.count() // Executes!
val first = rdd.first() // Executes!
rdd.saveAsTextFile("output") // Executes!
rdd.collect() // Brings all data to driver

Common actions:

  • count(), collect(), takeNon
  • reduce(), fold(), aggregate()
  • saveAsTextFile(), foreach()

📊 Spark SQL & DataFrames

From RDD to DataFrame: The Evolution
Feature RDD DataFrame Dataset
Type Safety Compile-time Runtime Compile-time
Schema No schema Has schema Has schema
Optimization Manual Catalyst optimizer Catalyst + Tungsten
API Functional SQL + functional SQL + functional
Performance Good Better (2-5x) Best (type-safe)
📝 DataFrame Code Examples
// Create DataFrame from JSON
val df = spark.read.json("users.json")

// Show schema
df.printSchema()
// root
//  |-- name: string
//  |-- age: long
//  |-- city: string

// Select & filter using DataFrame API
df.select("name", "age")
  .filter(df("age") > 25)
  .show()

// Using SQL syntax
df.createOrReplaceTempView("users")
spark.sql("""
  SELECT city, COUNT(*) as count, AVG(age) as avg_age
  FROM users
  WHERE age > 25
  GROUP BY city
  ORDER BY count DESC
""").show()

// Join with another DataFrame
val orders = spark.read.json("orders.json")
df.join(orders, df("id") === orders("user_id"))
  .groupBy("name")
  .agg(sum("amount").alias("total_spent"))
  .show()
      

🔗 Spark Integration with NoSQL Databases

🍃 MongoDB + Spark
// Read from MongoDB
val df = spark.read
  .format("mongodb")
  .option("uri", 
    "mongodb://host:27017/db.collection")
  .load()

// Process with Spark
val result = df
  .filter($"age" > 25)
  .groupBy($"city")
  .count()

// Write back to MongoDB
result.write
  .format("mongodb")
  .mode("append")
  .option("uri", 
    "mongodb://host:27017/db.results")
  .save()
        

Use Cases: Aggregation pipelines, ETL, real-time analytics on MongoDB data

💎 Cassandra + Spark
// Read from Cassandra
val df = spark.read
  .format("org.apache.spark.sql.cassandra")
  .options(Map(
    "table" -> "users",
    "keyspace" -> "analytics"
  ))
  .load()

// Filter pushed down to Cassandra
val filtered = df
  .filter($"country" === "USA")
  .select("name", "age")

filtered.write
  .format("org.apache.spark.sql.cassandra")
  .options(Map(
    "table" -> "processed_users",
    "keyspace" -> "results"
  ))
  .save()
        

Use Cases: Time-series analysis, batch processing, data migration

🤖 Spark MLlib: Machine Learning at Scale

Complete ML Pipeline Example
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.feature.{VectorAssembler, StandardScaler}
import org.apache.spark.ml.Pipeline

// Load data from NoSQL
val data = spark.read
  .format("mongodb")
  .option("uri", "mongodb://localhost/ml.training")
  .load()

// Feature engineering
val assembler = new VectorAssembler()
  .setInputCols(Array("age", "income", "credit_score"))
  .setOutputCol("features")

val scaler = new StandardScaler()
  .setInputCol("features")
  .setOutputCol("scaledFeatures")

// Train model
val lr = new LogisticRegression()
  .setFeaturesCol("scaledFeatures")
  .setLabelCol("default")
  .setMaxIter(100)

// Create pipeline
val pipeline = new Pipeline()
  .setStages(Array(assembler, scaler, lr))

val model = pipeline.fit(data)

// Make predictions and save to NoSQL
val predictions = model.transform(testData)

predictions.write
  .format("mongodb")
  .option("uri", "mongodb://localhost/ml.predictions")
  .save()
      

🚀 Spark Performance Optimization

🎯 Partitioning Strategy
  • repartition(): Increase/decrease partitions
  • coalesce(): Reduce partitions efficiently
  • partitionBy(): Custom partitioning logic
  • Optimal: 2-4 partitions per core
💾 Caching & Persistence
  • cache(): Store in memory (default)
  • persist(): Custom storage level
  • MEMORY_AND_DISK: Spill to disk if needed
  • Cache reused DataFrames only
⚡ Broadcast Variables
  • Efficiently share large read-only data
  • Broadcast small lookup tables
  • Avoid shuffling large data
  • Example: broadcast joins
🔀 Shuffle Optimization
  • Minimize shuffle operations
  • Use reduceByKey over groupByKey
  • Configure shuffle partitions
  • spark.sql.shuffle.partitions=200
🎛️ Memory Configuration
  • spark.executor.memory
  • spark.driver.memory
  • spark.memory.fraction=0.6
  • Monitor GC overhead
📊 Catalyst Optimizer
  • Automatic query optimization
  • Predicate pushdown
  • Column pruning
  • Constant folding

🔐 Security in Apache Spark

Authentication & Authorization
Kerberos Integration

Enterprise authentication via Kerberos tickets

SASL Support

Secure communication between components

ACL Management

Fine-grained access control on data & operations

Encryption
TLS/SSL

Network encryption for all communications

At-Rest Encryption

Encrypt data stored on disk and in memory

Credential Management

Secure storage via Hadoop Credential Provider

📡 Spark Streaming: Real-Time Processing

The DStream Abstraction

DStream (Discretized Stream): A continuous stream of RDDs. Spark Streaming divides incoming stream into batches, treating each batch as an RDD.

Live Data Stream → Batches (e.g., 1 sec) → RDDs → Transformations → Output

📥 Data Sources for Streaming
Kafka

Fault-tolerant message queue, exactly-once semantics

val kafkaStream = KafkaUtils
.createDirectStream(ssc, ...)
.map(record => record._2)
TCP Socket

Simple socket connection for streaming

val socketStream = ssc
.socketTextStream("localhost", 9999)
HDFS/S3

Monitor directory for new files

val fileStream = ssc
.textFileStream("hdfs://path")
Kinesis / Pulsar

AWS managed or Apache messaging

val kinesisStream =
KinesisUtils.createStream(...)
📝 Complete Streaming Example
import org.apache.spark.streaming._
import org.apache.spark.streaming.kafka.KafkaUtils

// Create Spark Streaming context
val ssc = new StreamingContext(sparkConf, Seconds(2))

// Create Kafka stream
val kafkaStream = KafkaUtils
  .createDirectStream[String, String, StringDecoder, StringDecoder](
    ssc,
    Map("bootstrap.servers" -> "localhost:9092"),
    Set("events")
  )

// Parse JSON and filter
val events = kafkaStream
  .map(_._2)  // Get value only
  .map(JSON.parseFull(_))  // Parse JSON
  .filter(event => event("eventType") == "click")

// Window operations (count clicks per 10 seconds)
val clickCounts = events
  .map(event => ("clicks", 1))
  .reduceByKeyAndWindow(
    (a, b) => a + b,
    Seconds(10),  // window size
    Seconds(2)   // slide interval
  )

// Write to MongoDB every batch
clickCounts.foreachRDD { rdd =>
  rdd.toDF("metric", "count")
    .write
    .format("mongodb")
    .mode("append")
    .option("uri", "mongodb://localhost/analytics.metrics")
    .save()
}

// Start streaming
ssc.start()
ssc.awaitTermination()
    
⚙️ Key Streaming Operations
Operation Purpose Example
map() Transform each element dstream.map(x => x * 2)
window() Operate on sliding windows dstream.window(Seconds(30), Seconds(10))
updateStateByKey() Maintain stateful data Track running total per key
join() Combine with another stream stream1.join(stream2)
foreachRDD() Execute code per batch Save to DB, API calls

🌊 Apache Flink: Native Stream Processing

Why Flink? The Streaming-First Platform

Spark Streaming: Micro-batching (DStreams as sequence of RDDs)
Apache Flink: True streaming (continuous processing of events)

Flink processes data as it arrives, event-by-event, with sub-second latency

🏗️ Flink Architecture
TaskManager 1 Slot 1 Task Slot 2 Task JobManager (Master) Dispatcher • ResourceManager TaskManager Coordination TaskManager 2 Slot 1 Task Slot 2 Task Data Stream Flow
JobManager
  • Coordinates execution
  • Manages slots
  • Handles failover
  • Master process
TaskManager
  • Executes tasks
  • Manages slots
  • Memory management
  • Worker processes
Slots
  • Parallel subtasks
  • Resource units
  • Isolated JVM
  • Task deployment

📊 Flink DataStream API

Complete Flink Streaming Example
import org.apache.flink.api.scala._
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer

// Set up streaming environment
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(4)

// Create Kafka source
val kafkaConsumer = new FlinkKafkaConsumer[String](
  "events",
  new SimpleStringSchema(),
  kafkaProperties
)

val stream = env
  .addSource(kafkaConsumer)
  .map(event => JSON.parse(event))
  .filter(event => event("type") == "purchase")

// Time-based windowing (tumbling window, 10 seconds)
val windowed = stream
  .map(event => (
    event("product_id").asInstanceOf[String],
    event("amount").asInstanceOf[Double]
  ))
  .keyBy(0)
  .timeWindow(Time.seconds(10))
  .sum(1)

// Side output for late data (watermark)
val lateOutputTag = new OutputTag[String]("late")

// Session windowing (events grouped if < 5 sec apart)
val sessions = stream
  .keyBy(event => event("user_id").asInstanceOf[String])
  .window(EventTimeSessionWindows.withGap(Time.seconds(5)))
  .apply(new WindowFunction[...])

// Write to multiple sinks
stream.addSink(new MongoDBSink("mongodb://localhost/raw_events"))

windowed.addSink(new ElasticsearchSink("elasticsearch://localhost/aggregates"))

// Execute
env.execute("Flink Streaming Job")
    
🎯 Key Flink Concepts
Event Time vs Processing Time
Event Time

When event occurred (in event data)

Processing Time

When system processes it (current clock)

Watermarks

Temporal progress indicator, handles late data

Windowing in Flink
Tumbling Windows

Fixed non-overlapping intervals (10 sec)

Sliding Windows

Overlapping windows (10 sec size, 5 sec slide)

Session Windows

Events grouped by inactivity gap

⚡ Spark Streaming vs Flink: Detailed Comparison

Aspect Apache Spark Apache Flink
Architecture Micro-batching (DStreams) True streaming (events)
Latency Seconds (batch interval) Sub-second (milliseconds)
Event Time Supported (Spark 2.0+) Native first-class citizen
Exactly-Once With checkpointing Guaranteed (snapshots)
State Management Limited (updateStateByKey) Sophisticated (backends)
Batch Integration Native (same RDD/DF) Via batch API
SQL Support Yes (Spark SQL) Via Flink SQL
ML Integration MLlib streaming Via external systems
Maturity Very mature (2013+) Growing (2015+)

🎯 When to Choose What?

✅ Choose Spark Streaming When:
  • Need unified batch + streaming
  • Machine learning important
  • Latency > 1 second acceptable
  • Team knows Spark well
  • SQL querying needed
  • Complex batch transformations
✅ Choose Flink When:
  • Sub-second latency critical
  • Complex event processing
  • Sophisticated state management
  • Event time semantics crucial
  • Dedicated streaming job
  • Complex windowing logic

🚀 Deployment Strategies & Integration with NoSQL

Lambda Architecture: Batch + Real-Time
Data Source

├─→ Speed Layer (Flink/Spark Streaming) → Real-time View (Redis/MongoDB)
├─→ Batch Layer (Spark Batch) → Batch View (HBase/Cassandra)

Serving Layer (Merge results from both)
Example: Real-Time Analytics Pipeline
// Kafka → Flink → Redis (real-time)
env.addSource(KafkaConsumer())
  .map(event => aggregate(event))
  .addSink(RedisSink("localhost:6379"))

// HDFS → Spark → Cassandra (batch)
spark.read.parquet("hdfs://data/raw")
  .groupBy("metric")
  .agg(sum("value"))
  .write
  .format("org.apache.spark.sql.cassandra")
  .save()

// Query layer merges both
SELECT * FROM redis_metrics
UNION ALL
SELECT * FROM cassandra_aggregates
ORDER BY timestamp DESC
      

🔬 Section 6: Research Frontiers & Future Directions

🎯 Priority Research Areas 2024-2025

1. 🤖 AI-Powered Optimization
  • ML for query optimization
  • Self-tuning databases
  • Adaptive indexing strategies
  • ML-based sharding policies
2. 📊 Hybrid Architectures
  • OLTP + OLAP fusion
  • HTAP systems
  • Real-time analytics
  • Lambda/Kappa architectures
3. 🔐 End-to-End Encryption
  • Homomorphic encryption
  • Query on encrypted data
  • Zero-knowledge proofs
  • Searchable encryption
4. 🌍 Edge & Fog Computing
  • Distributed processing near data
  • IoT database systems
  • Lightweight embedded NoSQL
  • Bandwidth optimization
5. ⛓️ Blockchain Integration
  • Smart contracts for data
  • Decentralized databases
  • Immutable audit trails
  • Distributed consensus
6. 🌿 Green Data Systems
  • Carbon-neutral computing
  • Energy-aware storage
  • Hardware optimization
  • Renewable-powered DCs

❓ Open Challenges in BigData

⚙️ Technical Challenges
Consistency vs Scalability

CAP theorem limits - find optimal tradeoffs

Operational Complexity

Managing 1000+ nodes with automation

Query Optimization

Distributed query planning at scale

🛡️ Security & Privacy
Data Privacy at Scale

Anonymization without utility loss

Secure Computation

Processing without decryption

Compliance Automation

Automated GDPR/CCPA compliance

🔮 Emerging Trends & Technologies

Trend Description Timeline
Vector Databases Native support for embeddings & similarity search 2024-2025
AI-Optimized Hardware Specialized accelerators for ML workloads 2025+
Serverless Data Managed, auto-scaling without infrastructure 2025+
Quantum-Safe Encryption Post-quantum cryptography for data security 2026+

🎓 Course Finished!

Modifié le: mardi 28 octobre 2025, 21:17