CHAPTER 3 - CONTINUATION

Security, Integrity

& Modern Processing

Green IT • Data Security • Spark & Flink • Big Data Research

📚 Course Structure - 6 Sections

🌱

Section 1

Green IT & Energy Efficiency

PUE metrics & optimization
Sustainable data storage
Advanced cooling solutions

🔐

Section 2

NoSQL Data Security

Encryption strategies
Distributed authentication
Threat models & defense

✅

Section 3

Data Integrity Assurance

Cryptographic hashing
Blockchain for audit
Consistency verification

👤

Section 4

GDPR & Data Privacy

Anonymization techniques
Right to be forgotten
Compliance frameworks

⚡

Section 5

Spark & Flink Advanced

Stream processing security
Distributed optimization
Real-time analytics

🔬

Section 6

Research & Innovation

Active research areas
Open challenges
Future directions

🌱 Section 1: Green IT & Sustainable Large-Scale Storage

📊 Context: The Energy Crisis in Data Centers

Of global electricity consumed by data centers

600M

Metric tons CO₂ annually from tech industry

+25%

Annual growth in data storage demands

📈 PUE: The Key Efficiency Metric

                    PUE = Total Facility Power / IT Equipment Power
                

Interpretation Guide

PUE ≤ 1.2 Excellent (Google, Meta 2023)

PUE 1.3-1.5 Very Good (modern data centers)

PUE > 1.8 Needs improvement (legacy systems)

🔧 Green Technologies for NoSQL Storage

❄️ Liquid Cooling

Reduces PUE by 30-40%
Direct-to-chip cooling
Heat recovery systems
Precision fluid circulation

🌊 Free Cooling

Outside air when temperature permits
10 months/year in Europe
Reduces cooling costs
Strategic geographic placement

🔋 Renewable Energy

Google: 100% renewable
Solar + wind farms
Geothermal in Iceland
Hydro in Scandinavia

🤖 AI Optimization

DeepMind -40% energy
Load prediction ML
Dynamic adjustment
Predictive maintenance

💾 Efficient Replication

Optimized replication factor
Rack-aware placement
Data deduplication
Transparent compression

📍 Data Locality

Edge computing proximity
CDN distribution
Regional optimization
Low-cost regions

🔐 Section 2: Security in NoSQL Systems

⚠️ NoSQL-Specific Threat Models

1. NoSQL Injection

                        // Vulnerable:

                        db.users.find({email: userInput})

                        // Attack:

                        userInput = {$ne: null}

Bypass validation with operators

2. Unauthorized Access

No authentication by default
Weak credentials exposure
Unprotected network access
Cross-tenant data leakage

3. Unencrypted Data

Data at rest in plaintext
HTTP instead of HTTPS transit
Unencrypted replication
Unprotected backups

4. Misconfiguration

Bind 0.0.0.0 exposed to internet
Default admin credentials
TLS not enabled
Unprotected shards/nodes

🛡️ Comprehensive Security Strategies

Authentication Layer

LDAP/Kerberos

Centralized Active Directory, enterprise SSO

OAuth/SAML

Cloud integration, federation support

Multi-Factor Authentication

2FA/3FA with TOTP, certificates, biometric

Encryption Strategy

At Rest

AES-256 with KMS key management

In Transit

TLS 1.3 with X.509 certificates

Replication

Encrypted replication channels with authentication

✅ Section 3: Data Integrity in Distributed Systems

🔍 Distributed Integrity Challenges

Network Partitions

Inconsistent nodes during network failures → data divergence

Replication Lag

Replicas not in sync → stale reads possible

Node Failures

Shard crash → data loss if factor=1

🛠️ Integrity Assurance Techniques

🔗 Cryptographic Hashing

SHA-256

Checksum each document, detect modifications

Merkle Trees

Hash tree for dataset integrity verification

⛓️ Blockchain Audit

Append-Only Logs

Immutable transaction history, complete auditability

Hash Chaining

Each block hashes previous block

✍️ Digital Signatures

HMAC

Message Authentication Code with shared secret

RSA Signatures

Non-repudiation and source verification

👤 Section 4: GDPR & Personal Data Confidentiality

📋 GDPR in Big Data Context

🌍 Global Impact

GDPR (EU): €20M or 4% revenue fines
CCPA (California): $100-7,500 per violation
LGPD (Brazil): Similar to GDPR
PIPL (China): 50M yuan fines

⚖️ Core Principles

Lawfulness of processing
Data minimization
Purpose limitation
Accuracy & integrity

🔒 Privacy Protection Techniques for NoSQL

🎭 Anonymization

K-Anonymity

Each record indistinguishable from k-1 others

L-Diversity

Sensitive attributes have diverse values

T-Closeness

Distribution indistinguishable from overall

🔐 Pseudonymization

Hashed IDs

Replace PII with hash values (irreversible)

Tokenization

Replace with random token, keep mapping secure

Format-Preserving

Keep data format for analytics

🗑️ Right to be Forgotten

Data Deletion

Complete removal from all systems

Backup Handling

Redact from archived backups

Cascade Delete

Remove linked data automatically

⚠️ GDPR Challenges in NoSQL Systems

Challenge	Problem	Solution
Distributed Data	Data in multiple shards/replicas	Cascade delete across cluster
Immutable Logs	Append-only logs preserve old data	Redaction, not deletion
Backups	Data in historical snapshots	Purge old backups, anonymize new
Cache Layers	Data cached (Redis, Memcached)	TTL policy, manual purge

⚡ Section 5: Apache Spark & Flink - Modern Large-Scale Processing

🚀 Evolution: From Batch to Real-Time

The Processing Timeline

Hadoop MapReduce

2006-2013
Batch only
Disk-based

Apache Spark

2014+
100x faster
In-memory

Apache Flink

2015+
Native streaming
Low latency

Modern Era

2020+
Unified
Real-time ML

🔥 Apache Spark: Complete Architecture

🏗️ Spark Cluster Architecture

Driver Program

Runs main() function
Creates SparkContext
Builds DAG of operations
Schedules tasks

Cluster Manager

Allocates resources
YARN, Mesos, K8s
Manages worker nodes
Handles failures

Executors & Tasks

Execute actual work
Store RDD partitions
Return results to driver
JVM processes

💎 RDD: Resilient Distributed Dataset

What is an RDD?

Resilient Distributed Dataset - The fundamental data abstraction in Spark. An immutable, distributed collection of objects that can be processed in parallel.

🔄

Resilient

Fault-tolerant, auto-recovery

🌐

Distributed

Partitioned across cluster

📦

Dataset

Collection of data

RDD Operations: Transformations vs Actions

🔄 Transformations (Lazy)

Create new RDD from existing one. Not executed until action called.

                        val rdd = sc.textFile("file.txt")

                        val filtered = rdd.filter(line => line.contains("error"))

                        val mapped = filtered.map(line => line.length)

                        // Nothing executed yet!

Common transformations:

map(), filter(), flatMap()
groupByKey(), reduceByKey()
join(), union(), distinct()

⚡ Actions (Eager)

Trigger computation and return results to driver or write to storage.

                        val count = rdd.count() // Executes!

                        val first = rdd.first() // Executes!

                        rdd.saveAsTextFile("output") // Executes!

                        rdd.collect() // Brings all data to driver

Common actions:

count(), collect(), take
reduce(), fold(), aggregate()
saveAsTextFile(), foreach()

📊 Spark SQL & DataFrames

From RDD to DataFrame: The Evolution

Feature	RDD	DataFrame	Dataset
Type Safety	Compile-time	Runtime	Compile-time
Schema	No schema	Has schema	Has schema
Optimization	Manual	Catalyst optimizer	Catalyst + Tungsten
API	Functional	SQL + functional	SQL + functional
Performance	Good	Better (2-5x)	Best (type-safe)

📝 DataFrame Code Examples

// Create DataFrame from JSON
val df = spark.read.json("users.json")

// Show schema
df.printSchema()
// root
//  |-- name: string
//  |-- age: long
//  |-- city: string

// Select & filter using DataFrame API
df.select("name", "age")
  .filter(df("age") > 25)
  .show()

// Using SQL syntax
df.createOrReplaceTempView("users")
spark.sql("""
  SELECT city, COUNT(*) as count, AVG(age) as avg_age
  FROM users
  WHERE age > 25
  GROUP BY city
  ORDER BY count DESC
""").show()

// Join with another DataFrame
val orders = spark.read.json("orders.json")
df.join(orders, df("id") === orders("user_id"))
  .groupBy("name")
  .agg(sum("amount").alias("total_spent"))
  .show()

🔗 Spark Integration with NoSQL Databases

🍃 MongoDB + Spark

// Read from MongoDB
val df = spark.read
  .format("mongodb")
  .option("uri", 
    "mongodb://host:27017/db.collection")
  .load()

// Process with Spark
val result = df
  .filter($"age" > 25)
  .groupBy($"city")
  .count()

// Write back to MongoDB
result.write
  .format("mongodb")
  .mode("append")
  .option("uri", 
    "mongodb://host:27017/db.results")
  .save()

Use Cases: Aggregation pipelines, ETL, real-time analytics on MongoDB data

💎 Cassandra + Spark

// Read from Cassandra
val df = spark.read
  .format("org.apache.spark.sql.cassandra")
  .options(Map(
    "table" -> "users",
    "keyspace" -> "analytics"
  ))
  .load()

// Filter pushed down to Cassandra
val filtered = df
  .filter($"country" === "USA")
  .select("name", "age")

filtered.write
  .format("org.apache.spark.sql.cassandra")
  .options(Map(
    "table" -> "processed_users",
    "keyspace" -> "results"
  ))
  .save()

Use Cases: Time-series analysis, batch processing, data migration

🤖 Spark MLlib: Machine Learning at Scale

Complete ML Pipeline Example

import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.feature.{VectorAssembler, StandardScaler}
import org.apache.spark.ml.Pipeline

// Load data from NoSQL
val data = spark.read
  .format("mongodb")
  .option("uri", "mongodb://localhost/ml.training")
  .load()

// Feature engineering
val assembler = new VectorAssembler()
  .setInputCols(Array("age", "income", "credit_score"))
  .setOutputCol("features")

val scaler = new StandardScaler()
  .setInputCol("features")
  .setOutputCol("scaledFeatures")

// Train model
val lr = new LogisticRegression()
  .setFeaturesCol("scaledFeatures")
  .setLabelCol("default")
  .setMaxIter(100)

// Create pipeline
val pipeline = new Pipeline()
  .setStages(Array(assembler, scaler, lr))

val model = pipeline.fit(data)

// Make predictions and save to NoSQL
val predictions = model.transform(testData)

predictions.write
  .format("mongodb")
  .option("uri", "mongodb://localhost/ml.predictions")
  .save()

🚀 Spark Performance Optimization

🎯 Partitioning Strategy

repartition(): Increase/decrease partitions
coalesce(): Reduce partitions efficiently
partitionBy(): Custom partitioning logic
Optimal: 2-4 partitions per core

💾 Caching & Persistence

cache(): Store in memory (default)
persist(): Custom storage level
MEMORY_AND_DISK: Spill to disk if needed
Cache reused DataFrames only

⚡ Broadcast Variables

Efficiently share large read-only data
Broadcast small lookup tables
Avoid shuffling large data
Example: broadcast joins

🔀 Shuffle Optimization

Minimize shuffle operations
Use reduceByKey over groupByKey
Configure shuffle partitions
spark.sql.shuffle.partitions=200

🎛️ Memory Configuration

spark.executor.memory
spark.driver.memory
spark.memory.fraction=0.6
Monitor GC overhead

📊 Catalyst Optimizer

Automatic query optimization
Predicate pushdown
Column pruning
Constant folding

🔐 Security in Apache Spark

Authentication & Authorization

Kerberos Integration

Enterprise authentication via Kerberos tickets

SASL Support

Secure communication between components

ACL Management

Fine-grained access control on data & operations

Encryption

TLS/SSL

Network encryption for all communications

At-Rest Encryption

Encrypt data stored on disk and in memory

Credential Management

Secure storage via Hadoop Credential Provider

📡 Spark Streaming: Real-Time Processing

The DStream Abstraction

DStream (Discretized Stream): A continuous stream of RDDs. Spark Streaming divides incoming stream into batches, treating each batch as an RDD.

Live Data Stream → Batches (e.g., 1 sec) → RDDs → Transformations → Output

📥 Data Sources for Streaming

Kafka

Fault-tolerant message queue, exactly-once semantics

val kafkaStream = KafkaUtils

  .createDirectStream(ssc, ...)

  .map(record => record._2)

TCP Socket

Simple socket connection for streaming

val socketStream = ssc

  .socketTextStream("localhost", 9999)

HDFS/S3

Monitor directory for new files

val fileStream = ssc

  .textFileStream("hdfs://path")

Kinesis / Pulsar

AWS managed or Apache messaging

val kinesisStream = 

  KinesisUtils.createStream(...)

📝 Complete Streaming Example

import org.apache.spark.streaming._
import org.apache.spark.streaming.kafka.KafkaUtils

// Create Spark Streaming context
val ssc = new StreamingContext(sparkConf, Seconds(2))

// Create Kafka stream
val kafkaStream = KafkaUtils
  .createDirectStream[String, String, StringDecoder, StringDecoder](
    ssc,
    Map("bootstrap.servers" -> "localhost:9092"),
    Set("events")
  )

// Parse JSON and filter
val events = kafkaStream
  .map(_._2)  // Get value only
  .map(JSON.parseFull(_))  // Parse JSON
  .filter(event => event("eventType") == "click")

// Window operations (count clicks per 10 seconds)
val clickCounts = events
  .map(event => ("clicks", 1))
  .reduceByKeyAndWindow(
    (a, b) => a + b,
    Seconds(10),  // window size
    Seconds(2)   // slide interval
  )

// Write to MongoDB every batch
clickCounts.foreachRDD { rdd =>
  rdd.toDF("metric", "count")
    .write
    .format("mongodb")
    .mode("append")
    .option("uri", "mongodb://localhost/analytics.metrics")
    .save()
}

// Start streaming
ssc.start()
ssc.awaitTermination()

⚙️ Key Streaming Operations

Operation	Purpose	Example
map()	Transform each element	dstream.map(x => x * 2)
window()	Operate on sliding windows	dstream.window(Seconds(30), Seconds(10))
updateStateByKey()	Maintain stateful data	Track running total per key
join()	Combine with another stream	stream1.join(stream2)
foreachRDD()	Execute code per batch	Save to DB, API calls

🌊 Apache Flink: Native Stream Processing

Why Flink? The Streaming-First Platform

Spark Streaming: Micro-batching (DStreams as sequence of RDDs)
Apache Flink: True streaming (continuous processing of events)

Flink processes data as it arrives, event-by-event, with sub-second latency

🏗️ Flink Architecture

JobManager

Coordinates execution
Manages slots
Handles failover
Master process

TaskManager

Executes tasks
Manages slots
Memory management
Worker processes

Slots

Parallel subtasks
Resource units
Isolated JVM
Task deployment

📊 Flink DataStream API

Complete Flink Streaming Example

import org.apache.flink.api.scala._
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer

// Set up streaming environment
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(4)

// Create Kafka source
val kafkaConsumer = new FlinkKafkaConsumer[String](
  "events",
  new SimpleStringSchema(),
  kafkaProperties
)

val stream = env
  .addSource(kafkaConsumer)
  .map(event => JSON.parse(event))
  .filter(event => event("type") == "purchase")

// Time-based windowing (tumbling window, 10 seconds)
val windowed = stream
  .map(event => (
    event("product_id").asInstanceOf[String],
    event("amount").asInstanceOf[Double]
  ))
  .keyBy(0)
  .timeWindow(Time.seconds(10))
  .sum(1)

// Side output for late data (watermark)
val lateOutputTag = new OutputTag[String]("late")

// Session windowing (events grouped if < 5 sec apart)
val sessions = stream
  .keyBy(event => event("user_id").asInstanceOf[String])
  .window(EventTimeSessionWindows.withGap(Time.seconds(5)))
  .apply(new WindowFunction[...])

// Write to multiple sinks
stream.addSink(new MongoDBSink("mongodb://localhost/raw_events"))

windowed.addSink(new ElasticsearchSink("elasticsearch://localhost/aggregates"))

// Execute
env.execute("Flink Streaming Job")

🎯 Key Flink Concepts

Event Time vs Processing Time

Event Time

When event occurred (in event data)

Processing Time

When system processes it (current clock)

Watermarks

Temporal progress indicator, handles late data

Windowing in Flink

Tumbling Windows

Fixed non-overlapping intervals (10 sec)

Sliding Windows

Overlapping windows (10 sec size, 5 sec slide)

Session Windows

Events grouped by inactivity gap

⚡ Spark Streaming vs Flink: Detailed Comparison

Aspect	Apache Spark	Apache Flink
Architecture	Micro-batching (DStreams)	True streaming (events)
Latency	Seconds (batch interval)	Sub-second (milliseconds)
Event Time	Supported (Spark 2.0+)	Native first-class citizen
Exactly-Once	With checkpointing	Guaranteed (snapshots)
State Management	Limited (updateStateByKey)	Sophisticated (backends)
Batch Integration	Native (same RDD/DF)	Via batch API
SQL Support	Yes (Spark SQL)	Via Flink SQL
ML Integration	MLlib streaming	Via external systems
Maturity	Very mature (2013+)	Growing (2015+)

🎯 When to Choose What?

✅ Choose Spark Streaming When:

Need unified batch + streaming
Machine learning important
Latency > 1 second acceptable
Team knows Spark well
SQL querying needed
Complex batch transformations

✅ Choose Flink When:

Sub-second latency critical
Complex event processing
Sophisticated state management
Event time semantics crucial
Dedicated streaming job
Complex windowing logic

🚀 Deployment Strategies & Integration with NoSQL

Lambda Architecture: Batch + Real-Time

      Data Source

      ↓

      ├─→ Speed Layer (Flink/Spark Streaming) → Real-time View (Redis/MongoDB)

      ├─→ Batch Layer (Spark Batch) → Batch View (HBase/Cassandra)

      ↓

      Serving Layer (Merge results from both)

Example: Real-Time Analytics Pipeline

// Kafka → Flink → Redis (real-time)
env.addSource(KafkaConsumer())
  .map(event => aggregate(event))
  .addSink(RedisSink("localhost:6379"))

// HDFS → Spark → Cassandra (batch)
spark.read.parquet("hdfs://data/raw")
  .groupBy("metric")
  .agg(sum("value"))
  .write
  .format("org.apache.spark.sql.cassandra")
  .save()

// Query layer merges both
SELECT * FROM redis_metrics
UNION ALL
SELECT * FROM cassandra_aggregates
ORDER BY timestamp DESC

🔬 Section 6: Research Frontiers & Future Directions

🎯 Priority Research Areas 2024-2025

1. 🤖 AI-Powered Optimization

ML for query optimization
Self-tuning databases
Adaptive indexing strategies
ML-based sharding policies

2. 📊 Hybrid Architectures

OLTP + OLAP fusion
HTAP systems
Real-time analytics
Lambda/Kappa architectures

3. 🔐 End-to-End Encryption

Homomorphic encryption
Query on encrypted data
Zero-knowledge proofs
Searchable encryption

4. 🌍 Edge & Fog Computing

Distributed processing near data
IoT database systems
Lightweight embedded NoSQL
Bandwidth optimization

5. ⛓️ Blockchain Integration

Smart contracts for data
Decentralized databases
Immutable audit trails
Distributed consensus

6. 🌿 Green Data Systems

Carbon-neutral computing
Energy-aware storage
Hardware optimization
Renewable-powered DCs

❓ Open Challenges in BigData

⚙️ Technical Challenges

Consistency vs Scalability

CAP theorem limits - find optimal tradeoffs

Operational Complexity

Managing 1000+ nodes with automation

Query Optimization

Distributed query planning at scale

🛡️ Security & Privacy

Data Privacy at Scale

Anonymization without utility loss

Secure Computation

Processing without decryption

Compliance Automation

Automated GDPR/CCPA compliance

🔮 Emerging Trends & Technologies

Trend	Description	Timeline
Vector Databases	Native support for embeddings & similarity search	2024-2025
AI-Optimized Hardware	Specialized accelerators for ML workloads	2025+
Serverless Data	Managed, auto-scaling without infrastructure	2025+
Quantum-Safe Encryption	Post-quantum cryptography for data security	2026+

🎓 Course Finished!

Modifié le: mardi 28 octobre 2025, 21:17

Gestion et Analyse des Méga-donnéesSecurity, Integrity & Modern Processing