Big Data Analytics - Issues, Machine Learning, and Advanced Analytics
In this chapter, we explore the challenges and opportunities of Big Data Analytics, ranging from machine learning techniques to modern applications in data, stream, text, and web analytics.
Gather data from multiple sources
Remove anomalies and duplicates
Convert to required format
Extract patterns and insights
Communicate results
Make data-driven decisions
| Source Type | Characteristics | Typical Volume | Main Challenge |
|---|---|---|---|
| Structured Data | Databases, CSV, Excel | Low to medium | Integration |
| Semi-structured Data | JSON, XML, logs | Medium | Heterogeneous parsing |
| Unstructured Data | Images, videos, audio, text | Very high | Meaning extraction |
| IoT Data | Sensors, mobile devices | Huge (Real-time) | Latency and synchronization |
| Social Media Data | Twitter, Facebook, Instagram | Exabytes/day | Noise and spam |
Retrospective analysis to understand historical trends.
Techniques:
Example: "Q3 revenue was 15% higher than Q2"
Causal analysis to identify contributing factors.
Techniques:
Example: "The increase is due to a 20% price reduction + Q3 marketing campaign"
Uses historical data to forecast future trends.
Techniques:
Example: "Based on trends, sales will increase by 25% in Q4"
Real-world Use Cases:
Recommends optimal actions to achieve goals.
Techniques:
Example: "Increase prices by 8-12% in high-demand regions to maximize profit"
Machine Learning is a subset of AI where systems learn from data without being explicitly programmed.
Types of Machine Learning:
The algorithm learns from labeled data (input → output).
| Type | Objective | Algorithms | Use Case |
|---|---|---|---|
| Regression | Predict a continuous value | Linear Regression, SVR, Random Forest | Price, temperature, sales prediction |
| Classification | Predict a category | Logistic Regression, SVM, XGBoost | Spam detection, medical diagnosis |
The algorithm finds patterns in data without labels.
| Type | Objective | Algorithms | Use Case |
|---|---|---|---|
| Clustering | Group similar data | K-Means, DBSCAN, Hierarchical | Customer segmentation, document grouping |
| Dimensionality Reduction | Reduce number of variables | PCA, t-SNE, Autoencoders | Visualization, data compression |
The algorithm learns through interactions and rewards/penalties.
Real-time analysis of continuously arriving data (unlike traditional batch processing).
| Challenge | Description | Solution |
|---|---|---|
| Latency | Data arrives fast, decisions must be made quickly | Optimized architecture, edge computing, caching |
| Out-of-order events | Events arrive out of order due to network | Watermarks, allow lateness, event time |
| State management | Maintaining complex state over long term | Serialization, RocksDB, external stores |
| Exactly-once guarantee | Each event processed exactly once | Checkpointing, idempotent operations |
| Scalability | Processing billions of events/day | Parallelization, partitioning, distribution |
| Framework | Latency | Language | Optimal Use Case |
|---|---|---|---|
| Apache Spark Streaming | 1-2 seconds | Python, Scala, Java | Micro-batch analytics, real-time analytics |
| Apache Flink | 100ms | Python, Scala, Java | True streaming, real-time events |
| Apache Kafka Streams | 100ms | Java | Lightweight, stream processing, event sourcing |
| Kinesis (AWS) | 1 second | Multi-language | Cloud-native, AWS integrated |
| Storm (Twitter) | 100ms | Multi-language | Legacy, less popular now |
Source → Message Queue → Processing → Storage → Visualization
Example: IoT Sensors → Kafka → Spark Streaming → MongoDB → Dashboard
Extraction of meaning from unstructured text data using Natural Language Processing (NLP).
Split text into words/sentences
Reduce to canonical form
Remove non-informative words
Convert to numbers (TF-IDF, embeddings)
Create features for ML
Train on specific task
Customer: "The product is excellent! Fast delivery!" → Positive (95%)
Customer: "Disappointing, mediocre quality, absent customer service" → Negative (92%)
Customer: "It's OK, neither good nor bad" → Neutral (70%)
Measurement, collection, analysis, and reporting of user behavior data on websites and web applications.
| Metric | Definition | Importance |
|---|---|---|
| Users/Visitors | Unique user count (daily/monthly) | Audience growth |
| Sessions | Number of visits (user can have multiple sessions) | Repeated engagement |
| Bounce Rate | % of visitors who leave after a single page | Content quality/UX |
| Average Session Duration | Average time spent per session | Engagement |
| Pages per Session | Average pages visited per session | Interest/navigation |
| Conversion Rate | % of visitors who complete desired action | ROI |
| CAC | Customer Acquisition Cost (marketing cost / new customers) | Marketing profitability |
| LTV | Lifetime Value (total revenue - cost) | Long-term profitability |
Analyze groups of users created during a similar period.
Track user progression through steps of a process.
Compare two versions (A vs B) to see which performs better.
Identify and understand why users leave.
How to assign credit when a customer interacts via multiple channels before converting?
Spark, Hadoop, Flink, Presto
Snowflake, BigQuery, Redshift, Presto
scikit-learn, TensorFlow, PyTorch, XGBoost
Tableau, Looker, Power BI, Metabase
Kafka, Kinesis, Pulsar, RabbitMQ
Jupyter, Databricks, Zeppelin
| Language | Use Case | Pros | Cons |
|---|---|---|---|
| Python | Data science, ML, analytics | Simple, excellent libs (pandas, scikit), popularity | Slow in CPU-intensive, GIL |
| SQL | Data queries, exploration | Standardized, intuitive, fast on DB | Limited for complex ML |
| Scala | Spark, big data | Performant, functional programming, Spark native | Learning curve, less popular than Python |
| R | Statistics, visualization | Perfect for stats, ggplot2, tidyverse | Slow, variable package quality, niche |
| Java | Production, big data | Performant, ecosystem, Hadoop/Spark | Verbose, learning curve |