📊 Chapter 4: Big Data Analytics

Big Data Analytics - Issues, Machine Learning, and Advanced Analytics

🎯 Chapter Overview

In this chapter, we explore the challenges and opportunities of Big Data Analytics, ranging from machine learning techniques to modern applications in data, stream, text, and web analytics.

2.5 Quintillion
Bytes created daily
90%
Data created since 2010
40-50%
Big data project ROI

⚠️ Issues in Big Data Analytics

Major Challenges

🔴 Reality check: 70-80% of time is spent on data preparation and cleaning, not the analysis itself!

Typical Analysis Pipeline

1

Collection

Gather data from multiple sources

2

Cleaning

Remove anomalies and duplicates

3

Transformation

Convert to required format

4

Analysis

Extract patterns and insights

5

Visualization

Communicate results

6

Action

Make data-driven decisions

Time Distribution in an Analytics Project

Heterogeneous Data Sources

Source Type Characteristics Typical Volume Main Challenge
Structured Data Databases, CSV, Excel Low to medium Integration
Semi-structured Data JSON, XML, logs Medium Heterogeneous parsing
Unstructured Data Images, videos, audio, text Very high Meaning extraction
IoT Data Sensors, mobile devices Huge (Real-time) Latency and synchronization
Social Media Data Twitter, Facebook, Instagram Exabytes/day Noise and spam

🔍 Types of Analytics

Analytics Maturity Matrix

Analytics Maturity Levels

Descriptive Analytics (Diagnostic)

Question: "What happened?"

Retrospective analysis to understand historical trends.

Techniques:

Example: "Q3 revenue was 15% higher than Q2"

Diagnostic Analytics

Question: "Why did it happen?"

Causal analysis to identify contributing factors.

Techniques:

Example: "The increase is due to a 20% price reduction + Q3 marketing campaign"

Predictive Analytics

Question: "What will happen?"

Uses historical data to forecast future trends.

Techniques:

Example: "Based on trends, sales will increase by 25% in Q4"

Real-world Use Cases:

Prescriptive Analytics

Question: "What should we do?"

Recommends optimal actions to achieve goals.

Techniques:

Example: "Increase prices by 8-12% in high-demand regions to maximize profit"

⚠️ Important: Most organizations only use descriptive and diagnostic analytics. Few reach predictive, and even fewer reach prescriptive!

🤖 Machine Learning and Automated Learning

What is Machine Learning?

Machine Learning is a subset of AI where systems learn from data without being explicitly programmed.

Types of Machine Learning:

1. Supervised Learning

The algorithm learns from labeled data (input → output).

Type Objective Algorithms Use Case
Regression Predict a continuous value Linear Regression, SVR, Random Forest Price, temperature, sales prediction
Classification Predict a category Logistic Regression, SVM, XGBoost Spam detection, medical diagnosis

2. Unsupervised Learning

The algorithm finds patterns in data without labels.

Type Objective Algorithms Use Case
Clustering Group similar data K-Means, DBSCAN, Hierarchical Customer segmentation, document grouping
Dimensionality Reduction Reduce number of variables PCA, t-SNE, Autoencoders Visualization, data compression

3. Reinforcement Learning

The algorithm learns through interactions and rewards/penalties.

Applications: Video games (AlphaGo), autonomous robots, resource optimization

Standard ML Process

1

Problem Definition

2

Collection & Prep

3

Feature Engineering

4

Model Selection

5

Training

6

Evaluation

7

Tuning

8

Deployment

Key Evaluation Metrics

Common ML Challenges

  • Overfitting: Model learns noise, not patterns
  • Underfitting: Model too simple for patterns
  • Data Leakage: Test info leaks into train set
  • Class Imbalance: Unbalanced data (99% vs 1%)
  • Feature Engineering: 70% of time, critical choices

🌊 Stream Analytics (Data Streams)

What is Stream Analytics?

Real-time analysis of continuously arriving data (unlike traditional batch processing).

Use Cases

Specific Challenges

Challenge Description Solution
Latency Data arrives fast, decisions must be made quickly Optimized architecture, edge computing, caching
Out-of-order events Events arrive out of order due to network Watermarks, allow lateness, event time
State management Maintaining complex state over long term Serialization, RocksDB, external stores
Exactly-once guarantee Each event processed exactly once Checkpointing, idempotent operations
Scalability Processing billions of events/day Parallelization, partitioning, distribution

Popular Frameworks

Framework Latency Language Optimal Use Case
Apache Spark Streaming 1-2 seconds Python, Scala, Java Micro-batch analytics, real-time analytics
Apache Flink 100ms Python, Scala, Java True streaming, real-time events
Apache Kafka Streams 100ms Java Lightweight, stream processing, event sourcing
Kinesis (AWS) 1 second Multi-language Cloud-native, AWS integrated
Storm (Twitter) 100ms Multi-language Legacy, less popular now

Typical Stream Architecture

Source → Message Queue → Processing → Storage → Visualization

Example: IoT Sensors → Kafka → Spark Streaming → MongoDB → Dashboard

📝 Text Analytics (NLP)

What is Text Analytics?

Extraction of meaning from unstructured text data using Natural Language Processing (NLP).

Key Applications

Standard NLP Pipeline

1

Tokenization

Split text into words/sentences

2

Stemming/Lemmatization

Reduce to canonical form

3

Remove Stopwords

Remove non-informative words

4

Vectorization

Convert to numbers (TF-IDF, embeddings)

5

Feature Extraction

Create features for ML

6

Model Training

Train on specific task

Modern Techniques

💡 Practical Example - Sentiment Analysis:

Customer: "The product is excellent! Fast delivery!" → Positive (95%)
Customer: "Disappointing, mediocre quality, absent customer service" → Negative (92%)
Customer: "It's OK, neither good nor bad" → Neutral (70%)

NLP Challenges

🌐 Web Analytics

What is Web Analytics?

Measurement, collection, analysis, and reporting of user behavior data on websites and web applications.

Key Metrics

Metric Definition Importance
Users/Visitors Unique user count (daily/monthly) Audience growth
Sessions Number of visits (user can have multiple sessions) Repeated engagement
Bounce Rate % of visitors who leave after a single page Content quality/UX
Average Session Duration Average time spent per session Engagement
Pages per Session Average pages visited per session Interest/navigation
Conversion Rate % of visitors who complete desired action ROI
CAC Customer Acquisition Cost (marketing cost / new customers) Marketing profitability
LTV Lifetime Value (total revenue - cost) Long-term profitability

Web Analysis Techniques

1. Cohort Analysis

Analyze groups of users created during a similar period.

Example: Users created in January 2024 have 40% retention after 3 months, vs 35% for February. Why?

2. Funnel Analysis

Track user progression through steps of a process.

Example: Landing Page (100%) → Sign Up (45%) → Activation (22%) → Subscription (8%). Where do we lose the most?

3. A/B Testing

Compare two versions (A vs B) to see which performs better.

Example: "Buy Now" button (blue) vs "Get Your Copy" (red). Red converts 12% better!

4. Churn Analysis

Identify and understand why users leave.

Example: Users who didn't complete profile within 24h have 80% churn vs 15% for completed ones.

Web Analytics Tools

Multi-channel Attribution

How to assign credit when a customer interacts via multiple channels before converting?

🛠️ Analysis Tools & Technologies

Typical Tech Stack

📊 Big Data Processing

Spark, Hadoop, Flink, Presto

💾 Data Warehousing

Snowflake, BigQuery, Redshift, Presto

🤖 Machine Learning

scikit-learn, TensorFlow, PyTorch, XGBoost

📈 Visualization

Tableau, Looker, Power BI, Metabase

📡 Stream Processing

Kafka, Kinesis, Pulsar, RabbitMQ

📚 Notebooks

Jupyter, Databricks, Zeppelin

Popular Languages

Language Use Case Pros Cons
Python Data science, ML, analytics Simple, excellent libs (pandas, scikit), popularity Slow in CPU-intensive, GIL
SQL Data queries, exploration Standardized, intuitive, fast on DB Limited for complex ML
Scala Spark, big data Performant, functional programming, Spark native Learning curve, less popular than Python
R Statistics, visualization Perfect for stats, ggplot2, tidyverse Slow, variable package quality, niche
Java Production, big data Performant, ecosystem, Hadoop/Spark Verbose, learning curve

Stack Recommendations

For most organizations:
  • Exploration: Python (Jupyter) + SQL
  • ML: Python (scikit-learn, PyTorch, XGBoost)
  • Big Data: Spark (Python or Scala)
  • Streaming: Kafka + Spark/Flink
  • Warehouse: Snowflake or BigQuery
  • Viz: Tableau or Looker

📋 Chapter 4 Summary

Key Takeaways:
  • 70-80% of time is in data preparation, not analysis!
  • Analytics progresses: Descriptive → Diagnostic → Predictive → Prescriptive
  • ML requires good data quality and feature engineering
  • Stream analytics is critical for real-time cases
  • NLP transforms unstructured text into insights
  • Web analytics measures engagement and marketing ROI
  • Modern stack: Python + Spark + Kafka + Snowflake + Tableau