Gestion et Analyse des Méga-données
Mini-Projects: NoSQL, Big Data & Machine Learning
🚀 Mini-Projects: Master 2
9 Projects Combining NoSQL, Big Data & Machine Learning
Choose ONE project • Work in groups • Combine NoSQL + Spark + ML • Present results
📋 Project Overview
📊 What is a Mini-Project?
A comprehensive hands-on project integrating NoSQL databases, distributed processing (Spark/MapReduce), and machine learning algorithms. Build real-world data pipelines!
👥 Team Requirements
Groups of 2-3 students. Pick ONE project from the 9 options. Each team works independently on their chosen problem.
⏱️ Timeline & Deliverables
Deliver: Working code, documentation, presentation, and live demo with results.
User Behavior Clustering
Technologies: MongoDB, PySpark, MLlib
🎯 Objective
Collect user activity logs, preprocess with Spark, and cluster users using KMeans to identify behavioral segments.
📚 Skills You'll Learn
- MongoDB aggregation pipelines
- PySpark DataFrames & transformations
- Feature scaling and normalization
- KMeans clustering & evaluation
📊 Deliverables
- MongoDB database with 100K+ logs
- Spark pipeline notebook
- Cluster profiles & insights
- Visualization of segments
Product Recommendation Engine
Technologies: Neo4j, GraphFrames, ALS
🎯 Objective
Build social network in Neo4j, extract to Spark, and use collaborative filtering to recommend products.
📚 Skills You'll Learn
- Cypher query language (Neo4j)
- GraphFrames in Spark
- ALS (Alternating Least Squares)
- Recommendation evaluation metrics
📊 Deliverables
- Neo4j graph with users/products
- Rating matrix & ALS model
- Top-N recommendations
- Accuracy metrics (RMSE, MAE)
Real-Time Sentiment Analysis
Technologies: Redis Streams, Spark Streaming, NLP
🎯 Objective
Stream tweets into Redis, process in real-time with Spark, and classify sentiment using Logistic Regression.
📚 Skills You'll Learn
- Redis Streams for ingestion
- Spark Streaming architecture
- Text preprocessing (tokenization)
- Sentiment classification model
📊 Deliverables
- Twitter API integration
- Real-time streaming pipeline
- Sentiment prediction model
- Live dashboard with results
Fraud Detection System
Technologies: Cassandra, PySpark, Anomaly Detection
🎯 Objective
Store transactions in Cassandra, use Spark for ETL & feature engineering, detect fraud with Isolation Forest.
📚 Skills You'll Learn
- Cassandra data modeling (CQL)
- Spark ETL pipelines
- Feature engineering techniques
- Isolation Forest algorithm
📊 Deliverables
- Cassandra transaction store
- Feature extraction notebook
- Anomaly detection model
- Fraud patterns analysis
IoT Device Forecasting
Technologies: MongoDB, Spark, Linear Regression
🎯 Objective
Simulate IoT sensors, store time-series in MongoDB, extract features with Spark, forecast values.
📚 Skills You'll Learn
- Time-series data modeling
- Feature lag creation
- Linear regression in Spark
- Forecasting evaluation
📊 Deliverables
- IoT sensor simulator
- MongoDB time-series DB
- Regression model (MAE, RMSE)
- Forecast visualization
Search Engine for Catalog
Technologies: Elasticsearch, Spark, TF-IDF
🎯 Objective
Index product catalog in Elasticsearch, implement ranking with Spark (TF-IDF, word embeddings).
📚 Skills You'll Learn
- Elasticsearch indexing & queries
- TF-IDF vectorization
- Text similarity algorithms
- Ranking optimization
📊 Deliverables
- Elasticsearch index
- Search API endpoint
- Relevance ranking model
- Search performance metrics
📚 3 More Projects Available
🛒 Project 7: Market Basket Analysis
Load transactions into Cassandra, use Spark to perform frequent pattern mining (FP-
Growth) to identify customer segments.
Skills: Spark MLlib FP-Growth, Cassandra modeling
👑 Project 8: Social Influence Ranking
Store social interactions in Neo4j, use Spark GraphFrames/PageRank to find top influ-
encers and visualize network subgraphs.
Skills: Cypher export, PageRank, Graph visualization.
📰 Project 9: News Classification
Ingest news streams into Redis, classify articles in real time with Spark (e.g., Naive Bayes),
and build a live topic dashboard.
Skills: Streaming ETL, text classification, Redis Streams
✅ Project Requirements
Technical Requirements
- ✓ NoSQL Database: One of: MongoDB, Neo4j, Redis, Cassandra, or Elasticsearch
- ✓ Big Data Processing: PySpark, Spark, or MapReduce
- ✓ ML Component: MLlib, scikit-learn, or TensorFlow for prediction/clustering
- ✓ Minimum Data: 100,000+ records or realistic simulation
- ✓ Code Quality: Documented, tested, production-ready
Submission Deliverables
- 📝 Code Repository: GitHub with full source code
- 📄 Documentation: README, architecture, setup guide
- 📊 Analysis Report: Results, insights, performance metrics
- 🎬 Presentation: 15-20 min slides + live demo
🚀 Getting Started
1️⃣
Form Your Team
Recruit 2-3 team members. Choose complementary skills (DB, backend, ML).
2️⃣
Pick Your Project
Read descriptions. Choose based on interests and available resources.
3️⃣
Set Up Environment
Install databases, Spark, Python. Clone starter code if available.
4️⃣
Execute Pipeline
Load data, run processing, train model, collect metrics.
5️⃣
Document & Present
Write report, create slides, prepare demo, submit code.
6️⃣
Present Results
Show your work, discuss insights, answer questions from faculty.
📚 Recommended Resources
MongoDB Documentation
- Official MongoDB docs & tutorial
- Aggregation pipeline reference
- MongoDB University free courses
Apache Spark
- PySpark API reference
- MLlib documentation
- DataCamp Spark courses
scikit-learn
- ML algorithms documentation
- Model evaluation metrics
- Tutorial examples & notebooks