UF-74636: NoSQL Databases

NoSQL Databases

From SQL Limitations to Big Data Excellence

📚 Comprehensive

🎨 Highly Visual

💡 Practical Examples

⚙️ Deep Technical

📘

Part 1

Foundations & History

SQL limitations exposed
Birth of NoSQL movement
Evolution and timeline

⚙️

Part 2

Core Concepts

CAP theorem deep dive
ACID vs BASE
Consistency models

🗂️

Part 3

Four Database Families

Key-Value stores
Document databases
Column-family stores

🕸️

Part 4

Graph Databases

Neo4j deep dive
Graph theory basics
Real-world graphs

Part 1: Foundations & The NoSQL Revolution

1️⃣ 1.1: The SQL Database Era (1970s-2000s)

📊 What Was the SQL World Like?

📋

1970

SQL Invented

💾

1980s-90s

SQL Dominance

🚀

2000s

Internet Scale

🔥

2007-2008

NoSQL Emerges

🌐

2010+

Modern Era

✅ Why SQL Was Perfect for 50 Years

🔐

ACID Transactions

Guaranteed consistency, perfect for banking and accounting

🧮

Complex Queries

SQL's power for multi-table analysis and reporting

📐

Structured Data

Fixed schema ensures data consistency and integrity

🏭

Enterprise Grade

Mature tools, wide adoption, standardized

2️⃣ 1.2: When SQL Started Breaking (The Problem)

📈 Problem 1: SCALE

The Numbers Grew

2000: Google processes billions of pages
2006: Facebook hits 100M users
2010: Twitter handles millions of tweets/day
2020: Billions of users globally

💥 SQL's Response: "Buy bigger servers" → Vertical scaling only → Exponential costs

Example: A large SQL database might cost $50K for a server. Double the data? Now you need a $100K+ server. Processing hits limits of single machine.

🔧 Problem 2: FLEXIBILITY

Requirements Changed Fast

Need to add new fields? ALTER TABLE locks the DB
Schema changes = downtime = lost revenue
Different users need different fields
Startup agility killed by rigid schemas

💥 SQL's Response: "Plan better" → Slow to change → Can't innovate

Example: Twitter wants to add emoji support. Add new column? 3-hour downtime. Millions of users can't tweet. Stock price drops.

🚨 The "Perfect Storm" of Problems

💾

Storage Explosion

Terabytes impossible on single server. Need distributed storage across many machines.

⚡

Performance Degradation

Complex JOINs on massive tables = queries taking minutes. Users leave.

🎨

Schema Rigidity

Startups can't pivot. Every change = database migration = expensive, time-consuming.

💸

Cost Explosion

Vertical scaling becomes prohibitively expensive. $100K+ per year for database servers.

3️⃣ 1.3: The Birth of NoSQL (2007-2009)

🔥 The NoSQL Catalyst Events

🏠 Google BigTable (2006)

Google publishes paper on their distributed database handling petabytes of data. Revolutionizes thinking about databases at scale.

Impact: Shows scale is possible without traditional SQL

🛍️ Amazon DynamoDB (2007)

Amazon releases paper on "Dynamo" - a highly available, scalable distributed datastore. Inspired by BigTable and Memcached concepts.

Impact: Practical key-value store design patterns

🍃 MongoDB Launched (2009)

First popular document-oriented database. JSON-like documents attract developers coming from dynamic languages. Easy to learn.

Impact: Developers love it. NoSQL becomes mainstream

🔗 Apache Cassandra / HBase (2008-2009)

Open-source implementations of distributed databases. Cassandra from Facebook, HBase inspired by BigTable. Democratize scale technology.

Impact: Everyone can build at scale now

💡 Why These Solutions Won

⚡

Extreme Performance

Optimized for specific access patterns. Sacrifice generality for blazing speed on common operations.

📈

Horizontal Scalability

Add more cheap servers. Distribute data across cluster. Linear cost scaling instead of exponential.

🔧

Flexible Schema

No schema enforcement. Applications define structure dynamically. Perfect for agile development.

💰

Cost Effective

Run on cheap commodity hardware. Open source options free. No licensing fees to pay.

Part 2: Core Concepts - Understanding the Trade-offs

⚖️ 2.1: The CAP Theorem - The Fundamental Trade-off

🎯 Eric Brewer's CAP Theorem (2000)

"In any distributed system, you can guarantee only 2 of these 3 properties:"

P
Pick 2 of 3

C Consistency

Definition: All nodes see the same data at the same time. No stale reads.

Every read returns latest write
ACID database like PostgreSQL
Strict synchronization
Can be slow when replicas far apart

A Availability

Definition: System is always up and responding. Every request gets a response.

99.99% uptime guarantee
Always returns data (even if stale)
No timeouts or errors
Prioritizes responsiveness

P Partition Tolerance

Definition: System tolerates network failures. Works even if nodes can't communicate.

Survives network splits
Distributed across regions
No single point of failure
Modern internet requires this

⚠️ Modern Truth: You MUST have Partition Tolerance in distributed systems (networks fail). So the choice is really Consistency OR Availability

🔍 CAP Theorem in Real Systems

🔒 CP Systems (Consistency + Partition)

Trade-off: Can't guarantee availability during network partition

Examples

🔷 PostgreSQL + Replication
🔷 MongoDB with strong consistency
🔷 Traditional SQL databases
🔷 HBase (configured for consistency)

Use When:

✅ Data accuracy is critical (banking, inventory)
✅ Network is reliable (single datacenter)
✅ Down time acceptable during partition

⚡ AP Systems (Availability + Partition)

Trade-off: Can't guarantee consistency during network partition

Examples

🔶 Cassandra
🔶 DynamoDB
🔶 Redis (eventually consistent)
🔶 Riak, Memcached

Use When:

✅ Uptime is critical (social media)
✅ Can tolerate stale data temporarily
✅ Global distribution required
✅ Downtime = lost revenue

🧪 2.2: ACID vs BASE - Consistency Models

Two Philosophies of Data Integrity

🔐 ACID Properties (Traditional Databases)

Property	Meaning	Example
A Atomicity	All or nothing. Either all operations complete or none do.	Transfer $100: subtract from Account A, add to Account B. Both happen or neither.
C Consistency	Data moves from one valid state to another. Invariants maintained.	Total money in system always same before & after transfer
I Isolation	Concurrent transactions don't see partial results of other transactions.	Reading Account A during transfer sees either $100 or $0, never $50
D Durability	Once committed, data stays committed even if server crashes.	Server crashes after "COMMIT" - data is still there on restart

⚡ BASE Properties (Modern Databases)

Property	Meaning	Example
BA Basically Available	System responds even during partial failures. Tries to serve requests even in degraded state.	3 replicas, one dies. System still serves from other two.
S Soft State	State may change without input. Replicas may be inconsistent temporarily.	Write to replica A, read from replica B immediately = stale read possible
E Eventual Consistency	Given enough time without new updates, all replicas converge to same state.	Post on Twitter: some friends see it immediately, others after 1-2 seconds. All eventually see it.

📊 2.3: Consistency Levels - A Spectrum

Consistency isn't binary. It's a spectrum from strongest to weakest. Different databases offer different levels.

Strong

All reads see latest write

Causal

Preserve causal relationships

Session

Consistent within same session

Weak

Eventually consistent

Eventual

No guarantees, best effort

🔒 Strong Consistency

Every read returns most recent write.

Used: PostgreSQL, MySQL
Cost: Slower writes
Best for: Banking, financial data

🧬 Causal Consistency

Causally related events seen in order.

Used: Some NoSQL systems
Cost: More coordination overhead
Best for: Social media posts & comments

👤 Session Consistency

User's own writes always visible within their session.

Used: DynamoDB, Cassandra configured
Cost: Session tracking
Best for: Web applications

📉 Weak Consistency

No immediate consistency guarantees.

Used: Cache systems, memcached
Cost: Application must handle
Best for: Caching, performance

🌊 Eventual Consistency

Eventually converges, timeline unclear.

Used: Cassandra, Riak, DNS
Cost: Acceptable delay
Best for: Distributed systems, global scale

Part 3: The Four NoSQL Database Families

NoSQL databases are divided into 4 distinct families, each optimized for different data structures and access patterns. Understanding each family is crucial for choosing the right tool.

🔑

Key-Value Stores

Simple hash maps at scale

Examples: Redis, Memcached, DynamoDB
Perfect for: Caching, sessions, counters
Speed: Ultra-fast lookups

📄

Document Databases

JSON/BSON documents as first-class citizens

Examples: MongoDB, CouchDB, Firebase
Perfect for: Web/mobile apps, content management
Flexibility: Dynamic schemas

📊

Column-Family Stores

Wide-column distributed tables

Examples: Cassandra, HBase
Perfect for: Time-series, analytics
Scale: Petabytes across clusters

🕸️

Graph Databases

Relationships as first-class data

Examples: Neo4j, ArangoDB
Perfect for: Social networks, recommendations
Power: Relationship traversals

🔑 Family 1: Key-Value Stores

📖 Core Concept

The simplest NoSQL model: key → value mapping. Like a distributed hash table or dictionary. Access data by exact key lookup in O(1) time.

Data Model Visualization

Key: "user:1001"
Value: {"name": "Amine", "email": "amine@example.com", "age": 28}
          
Key: "session:xyz789"
Value: {"userId": 1001, "loginTime": "2025-10-21T20:00:00Z", "ip": "192.168.1.1"}
          
Key: "product:5678"
Value: {"name": "Laptop", "price": 999.99, "stock": 50}
          

🚀 DEEP DIVE: Redis - The King of Key-Value

What is Redis?

In-memory: Data stored in RAM for ultra-fast access
Persistent: Can dump to disk for durability
Single-threaded: No concurrency issues, atomic operations
Rich data types: Strings, lists, sets, hashes, sorted sets
Pub/Sub: Messaging capabilities built-in

Common Use Cases

💾 Caching: Database query results
👤 Sessions: User login information
🏆 Leaderboards: Real-time rankings
📊 Counters: Page views, likes
🔔 Notifications: Message queues
🌊 Rate limiting: API throttling

// STRING operations
SET mykey "Hello"        → Store string
GET mykey                → "Hello"
APPEND mykey " World"   → "Hello World"
STRLEN mykey             → 11

// COUNTER (atomic increment)
INCR page:views          → 1
INCR page:views          → 2
INCRBY page:views 10    → 12

// LIST operations (queue/stack)
LPUSH queue task1        → Add to left
RPUSH queue task2        → Add to right
LRANGE queue 0 -1     → [task1, task2]
LPOP queue               → task1 (remove & return)

// SET operations (unique values)
SADD users:online amine   → Add member
SADD users:online ahmed     → Add member
SMEMBERS users:online    → {amine, ahmed}
SISMEMBER users:online amine → true

// SORTED SET (leaderboard)
ZADD leaderboard 100 amine  → alice: 100 points
ZADD leaderboard 200 ahmed    → bob: 200 points
ZREVRANGE leaderboard 0 1 WITHSCORES
→ ahmed (200), amine (100)

// HASH operations (objects)
HSET user:1001 name amine   → Set field
HSET user:1001 age 28     → Set field
HGETALL user:1001         → {name: amine, age: 28}

// EXPIRATION (key disappears after timeout)
SETEX temp_data 3600 "value"  → Expires in 1 hour
TTL temp_data            → 3599 (seconds remaining)

⚡ Performance Characteristics

Read/Write Latency:
<1 millisecond per operation

Throughput:
100,000+ ops/sec per core

Data Size:
Limited by available RAM

Persistence:
RDB snapshots or AOF logs

Memcached

Pure caching: No persistence
Simple: Get/set/delete only
Distributed: Consistent hashing
Use: Database query cache
TTL: Auto-expire old items

AWS DynamoDB

Serverless: Fully managed by AWS
Scalable: Unlimited capacity
Global: Multi-region replication
Features: Indexes, streams
Cost: Pay-per-request or provisioned

📄 Family 2: Document Databases

📖 Core Concept

Store semi-structured data as JSON/BSON documents. Each document can have different fields. Collections group related documents. Natural fit for object-oriented programming.

Document Structure Example

          {
  "_id": ObjectId(),

  "name": "Amine",

  "email": "amine@example.com",

  "age": 28,

  "address": {

    "street": "12 cité St",

    "city": "Saida",

    "country": "Algeria"

  },

  "hobbies": ["reading", "coding", "hiking"],

  "createdAt": 2025-10-21T20:00:00Z

}

🍃 DEEP DIVE: MongoDB - Most Popular Document DB

Key Features

Flexible schema: Add fields dynamically
Powerful queries: Rich query language
Indexing: B-tree indexes for speed
Aggregation: Pipeline processing
Transactions: ACID on single document
Replication: Replica sets built-in

Perfect For

📱 Web apps: Rapid iteration
📄 Content systems: Blog posts, articles
🛒 E-commerce: Products, orders
🔔 Real-time feeds: Social networks
📊 Analytics: Event tracking
🗃️ Data aggregation: Heterogeneous data

// CREATE (Insert)
db.users.insertOne({
  name: "Amine",
  email: "amine@example.com",
  age: 28
})

// READ (Query)
db.users.findOne({ name: "Amine" })
db.users.find({ age: { $gte: 25 } }) // age >= 25
db.users.find({ hobbies: "coding" })  // contains value

// UPDATE
db.users.updateOne(
  { name: "Amine" },
  { $set: { age: 29 } }
)

// DELETE
db.users.deleteOne({ name: "Amine" })

// AGGREGATION (complex queries)
db.users.aggregate([
  { $match: { age: { $gte: 25 } } },
  { $group: { _id: null, avgAge: { $avg: "$age" } } },
  { $sort: { avgAge: -1 } }
])

// CREATE INDEX for fast queries
db.users.createIndex({ email: 1 })
db.users.createIndex({ age: 1, city: 1 })  // compound

📊 Family 3: Column-Family Stores

📖 Core Concept

Store data in columns instead of rows. Optimized for analytics and time-series data. Compress similar data. Scale to petabytes across thousands of servers.

Traditional Row Storage

Row 1: Amine | 28  | Saida
Row 2: Ahmed   | 32  | Oran
Row 3: Mohamed | 25  | Alger

Query: Get all names
→ Scan all rows & columns

Column-Family Storage

Names:  Amine, Ahmed, Mohamed
Ages:   28, 32, 25
Cities: Saida, Oran, Alger

Query: Get all names
→ Read only names column

🔗 DEEP DIVE: Apache Cassandra

Key Characteristics

Distributed: Data spread across many servers
Highly Available: No single point of failure
Fault-tolerant: Survives node failures
Scalable: Linear scaling with nodes
Fast writes: Optimized for write-heavy
Eventual consistency: BASE model

Architecture Visualization

🖥️ Node 1 (Keyspace: users)
↕️ Ring topology (gossip protocol)
🖥️ Node 2 (Keyspace: users)
↕️ Replication Factor = 3
🖥️ Node 3 (Keyspace: users)

Perfect For

📊 Time-series: Metrics, logs
📈 Analytics: Aggregate data
🌍 Global scale: Multi-region
📝 Immutable data: Append-only

Companies Using

📱 Netflix (billions of events)
📱 Uber (location tracking)
📱 Apple (music history)
📱 Instagram (feeds)

🕸️ Family 4: Graph Databases

📖 Core Concept

Store data as nodes (entities) and relationships (edges). Relationships are first-class citizens, not afterthoughts. Query relationships instantly without expensive JOINs. Perfect for connected data.

Social Network Graph Example

Nodes: Amine, Ahmed, Moh (people)
Relationships: FRIENDS, KNOWS, WORKS_WITH (connections with properties)

⚡ DEEP DIVE: Neo4j - The Graph Database

Key Features

ACID transactions: Full consistency
Cypher query language: Intuitive, readable
Property graphs: Nodes and edges have properties
Indexes: Fast node and relationship lookup
Clustering: High availability
Real-time: Instant relationship queries

Perfect For

🤝 Social networks: Friends, followers
📍 Recommendations: Similar users/products
🔐 Fraud detection: Suspicious patterns
🗺️ Route planning: Shortest paths
🏢 Org structures: Hierarchies
📊 Knowledge graphs: Connected facts

Cypher Query Language Examples

// CREATE nodes
CREATE (amine:Person { name: 'Amine', age: 28 })
CREATE (ahmed:Person { name: 'Ahmed', age: 32 })

// CREATE relationships
MATCH (amine:Person {name: 'Amine'}), (ahmed:Person {name: 'Ahmed'})
CREATE (amine)-[:FRIENDS_WITH {since: 2020}]->(ahmed)

// QUERY: Find all friends of Amine
MATCH (amine:Person {name: 'Amine'})-[:FRIENDS_WITH]->(friend)
RETURN friend.name

// QUERY: Find friends of friends (2 hops)
MATCH (amine:Person {name: 'Amine'})-[:FRIENDS_WITH*2]->(friendOfFriend)
RETURN friendOfFriend.name

// QUERY: Find shortest path between two people
MATCH path=shortestPath(
  (amine:Person {name: 'Alice'})-[*]->(moh:Person {name: 'Moh'})
)
RETURN path

// QUERY: Recommendation engine - People who like what Alice likes
MATCH (amine:Person {name: 'Amine'})-[:LIKES]->(movie)←[:LIKES](person)
WHERE person.name <> 'Amine'
RETURN person.name, count(*) as common_likes
ORDER BY common_likes DESC

// UPDATE relationship
MATCH (amine)-[r:FRIENDS_WITH]-(ahmed)
SET r.strength = 9

// DELETE
MATCH (amine)-[r:FRIENDS_WITH]-(ahmed)
DELETE r

Graph Algorithms for Advanced Analysis

PageRank

What: Importance of nodes by incoming relationships
Use: Google search ranking algorithm
Example: Which person is most connected?

Shortest Path

What: Quickest route between nodes
Use: Navigation, social connections
Example: How many steps from Alice to Carol?

Community Detection

What: Groups of tightly connected nodes
Use: Social groups, clusters
Example: Which friends hang out together?

Centrality

What: Most important nodes in network
Use: Influencers, bottlenecks
Example: Who's the connector between groups?

Part 4: Real-World Applications & Decision Framework

🏆 4.1: Real-World Case Studies

📺

Netflix

Problem

Recommend movies to 230M+ users. Need instant recommendations from massive dataset.

Solution Architecture

Cassandra: Store user viewing history (petabytes)
Spark: Batch compute recommendation algorithms
Redis: Cache hot recommendations
Elasticsearch: Search for content

Result

80% of watched content from recommendations = $1B+ annual savings

🚗

Uber

Problem

Match 15M daily trips instantly across 70+ countries. Real-time pricing and ETA.

Solution Architecture

PostgreSQL: Trip data, transactions
Redis: Real-time driver locations
HBase: Historical data warehouse
Neo4j: City network graphs for routing

Result

40% faster matchmaking, 15% efficiency increase, millions daily

💼

Problem

Store 930M+ profiles with complex relationships. Find connections instantly.

Solution Architecture

Espresso (custom): Distributed document store
Kafka: Real-time activity streams
Voldemort: Key-value cache layer
Graph DB: Connection recommendations

Result

Sub-100ms latency for millions of searches

🎯 4.2: Database Selection Decision Framework

🤔 Ask These Questions

1 How much data?

GB → PostgreSQL fine
TB → Consider sharding
PB → NoSQL needed
Global → Distributed required

2 Data consistency?

Critical → SQL (ACID)
Important → NoSQL + logic
Loose → NoSQL (BASE)
Cache → Redis

3 Query patterns?

Complex → SQL
Key lookups → Key-Value
JSON objects → Document
Relationships → Graph
Time-series → Column-Family

4 Latency requirements?

<10ms → Redis/Memory
<100ms → NoSQL
<1s → SQL acceptable
Batch → Any (optimize later)

📊 Quick Decision Tree

START HERE
↓
Data relationships important?
NO RELATIONSHIPS
Lookup by key?

                YES

                → Redis/
Memcached
              

                NO

                → MongoDB/
DynamoDB
              
YES, RELATIONSHIPS
Transaction critical?

                YES

                → PostgreSQL/
MySQL
              

                NO

                → Neo4j
              

📋 Complete Database Comparison

Database	Type	Best For	Consistency	Scale	Latency
PostgreSQL	SQL	Complex queries, ACID	🟢 Strong	TB (with sharding)	10-100ms
Redis	Key-Value	Caching, sessions	🟡 Weak	GB (RAM)	<1ms
MongoDB	Document	Web apps, rapid dev	🟢 Strong	TB+ (sharded)	1-10ms
Cassandra	Column-Family	Time-series, analytics	🟡 Eventual	PB+ (unlimited)	1-10ms
Neo4j	Graph	Relationships, recommendations	🟢 ACID	TB (relationships)	1-100ms

Part 5: Hands-On Labs & Exercises

🔬 Practical Exercises

Lab 1: Build a Caching Layer

Objective: Implement Redis caching for a blog API to reduce database hits by 90%

Create API endpoint that fetches blog posts
Check Redis cache first
If miss, query database and cache result (5 min TTL)
Measure improvement in response time
Implement cache invalidation on post update

Lab 2: Design MongoDB Schema

Objective: Model an e-commerce application with flexible product data

Design collections for Products, Orders, Users
Handle varying product attributes (book ≠ laptop)
Create indexes for common queries
Write aggregation pipeline for bestsellers
Load 1M+ products and measure performance

Lab 3: Build a Social Graph

Objective: Create Neo4j social network with recommendations

Create Person nodes with profiles
Create FRIENDS relationships
Find shortest path between users
Implement "friends of friends" feature
Write recommendation query for new connections

آخر تعديل: الثلاثاء، 21 أكتوبر 2025، 10:14 PM

Gestion et Analyse des Méga-donnéesNoSQL Databases

Part 1

Part 2

Part 3

Part 4

Part 1: Foundations & The NoSQL Revolution

1️⃣ 1.1: The SQL Database Era (1970s-2000s)

📊 What Was the SQL World Like?

1970

1980s-90s

2000s

2007-2008

2010+

✅ Why SQL Was Perfect for 50 Years

ACID Transactions

Complex Queries

Structured Data

Enterprise Grade

2️⃣ 1.2: When SQL Started Breaking (The Problem)

📈 Problem 1: SCALE

The Numbers Grew

🔧 Problem 2: FLEXIBILITY

Requirements Changed Fast

🚨 The "Perfect Storm" of Problems

Storage Explosion

Performance Degradation

Schema Rigidity

Cost Explosion

3️⃣ 1.3: The Birth of NoSQL (2007-2009)

🔥 The NoSQL Catalyst Events

🏠 Google BigTable (2006)

🛍️ Amazon DynamoDB (2007)

🍃 MongoDB Launched (2009)

🔗 Apache Cassandra / HBase (2008-2009)

💡 Why These Solutions Won

Extreme Performance

Horizontal Scalability

Flexible Schema

Cost Effective

Part 2: Core Concepts - Understanding the Trade-offs

⚖️ 2.1: The CAP Theorem - The Fundamental Trade-off

🎯 Eric Brewer's CAP Theorem (2000)

C Consistency

A Availability

P Partition Tolerance

🔍 CAP Theorem in Real Systems

🔒 CP Systems (Consistency + Partition)

Examples

Use When:

⚡ AP Systems (Availability + Partition)

Examples

Use When:

🧪 2.2: ACID vs BASE - Consistency Models

Two Philosophies of Data Integrity

🔐 ACID Properties (Traditional Databases)

⚡ BASE Properties (Modern Databases)

📊 2.3: Consistency Levels - A Spectrum

Strong

Causal

Session

Weak

Eventual

🔒 Strong Consistency

🧬 Causal Consistency

👤 Session Consistency

📉 Weak Consistency

🌊 Eventual Consistency

Part 3: The Four NoSQL Database Families

Key-Value Stores

Document Databases

Column-Family Stores

Graph Databases

🔑 Family 1: Key-Value Stores

📖 Core Concept

Data Model Visualization

🚀 DEEP DIVE: Redis - The King of Key-Value

What is Redis?

Common Use Cases

⚡ Performance Characteristics

Memcached

Gestion et Analyse des Méga-données
NoSQL Databases