Message Queues and Event-Driven Architecture
Core Question
When systems are tightly coupled and traffic spikes, how do you ensure the critical path remains stable? Message queues are the "buffer" and "decoupler" of modern distributed systems. This article uses real-world cases (restaurant queuing, package sorting, flash sale systems) to deeply understand the design philosophy and engineering practices of message queues.
1. Why "Message Queues"?
1.1 A Real-World Case: The Evolution of Taobao's Order System
In 2012, Taobao's order system suffered a severe outage. At midnight on Double 11, traffic flooded in instantly. The order service directly called the inventory service, payment service, logistics service... the entire chain collapsed like dominoes.
The architecture at the time (tight coupling):
User places order → Order service → Sync call inventory service → Sync call payment service → Sync call logistics service
↓ ↓ ↓
Response 200ms Response 500ms Response 300msFatal Problems with Tight Coupling
- Total response time = 200 + 500 + 300 = 1000ms (user waits 1 second)
- Inventory service down → Order service also goes down (thread pool exhausted)
- Payment service slows down → Entire chain is dragged down
- Cannot scale horizontally → Can only scale vertically (expensive and limited)
Improved architecture (introducing message queues):
User places order → Order service → Send "order created" message → Immediate return (50ms)
↓
Message queue (Kafka)
↓
┌─────────────┬─────────────┬─────────────┐
▼ ▼ ▼ ▼
Inventory Payment Logistics Notification
service service service service
(async deduct) (async process) (async create) (async send)Improvements After Changes
- User response time = 50ms (20x experience improvement)
- Inventory service down → Messages stay in queue, continue processing after recovery
- Payment service slows down → Doesn't affect order creation
- Can scale horizontally → Just add more consumer instances
1.2 Everyday Analogies for Message Queues
Restaurant Queuing System
Imagine going to a popular restaurant:
- No queuing system: Customers must stand at the window waiting; limited window space, long lines behind, restaurant under pressure
- With queuing system: After ordering, you get a number; you can sit down first, pick up food when your number is called
A message queue is the software system's "queuing system":
- Producer (the person ordering) → Puts messages (orders) into the queue
- Queue (the number dispenser) → Temporarily stores messages
- Consumer (the chef) → Processes messages at their own pace
After the traffic peak passes, the system keeps processing the backlog at full speed until the queue is empty. This is peak shaving.
2. What Is a Message Queue? (Definition + Core Three Elements)
2.1 What Is a "Message Queue"?
Terminology
Message Queue (MQ) is a container for storing messages. Producers put messages in, consumers take messages out for processing. It enables "asynchronous communication" — the sender doesn't need to wait for the receiver to finish processing.
Synchronous vs Asynchronous:
- Synchronous: Like a phone call — the other party must answer to communicate
- Asynchronous: Like sending a text — you send it, they read it when available
It's like calling a friend (synchronous) vs sending them a message (asynchronous).
2.2 The Three Core Elements of a Message Queue
Element 1: Producer
Responsibility: Create and send messages to the queue.
Analogy: The producer is like a "sender," delivering letters (messages) to the post office (queue).
Key Design Points
- Send method: Synchronous send (reliable but blocking) vs asynchronous send (high performance but needs callback handling)
- Message confirmation: Wait for Broker confirmation (At Least Once) vs fire-and-forget (At Most Once)
- Failure handling: Retry strategy, local log backup, dead letter queue
Element 2: Consumer
Responsibility: Get messages from the queue and process them.
Analogy: The consumer is like a "recipient," taking letters (messages) from the mailbox (queue) and processing them.
Key Design Points
- Consumption mode: Push mode (Broker actively pushes) vs Pull mode (consumer actively pulls)
- Consumption confirmation: Auto ACK (efficient but may lose messages) vs manual ACK (reliable but needs timeout handling)
- Concurrency control: Single-threaded sequential consumption vs multi-threaded parallel consumption
- Failure handling: Retry strategy, dead letter queue, compensation mechanism
Element 3: Broker (Message Broker)
Responsibility: Receive, store, and forward messages.
Analogy: The Broker is like a "post office" or "package sorting station," responsible for receiving, sorting, and delivering letters.
Key Design Points
- Storage model: In-memory storage (low latency) vs disk storage (high reliability)
- Replication strategy: Primary-secondary replication, multi-replica synchronization
- High availability: Cluster deployment, automatic failover
- Scalability: Partitions, Sharding
3. Core Problem 1: How to Decouple Systems and Avoid "Pulling One Thread and Moving the Whole System"?
3.1 The Tragedy of Tight Coupling: One Service Goes Down, Everything Falls
Scenario recreation: An e-commerce platform's early architecture
Order service directly calls downstream services:
┌─────────────┐
│ Order │
│ Service │
└──────┬──────┘
│
├───────────┬───────────┬───────────┐
▼ ▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│Inventory │ │Payment │ │Logistics │ │SMS │
│Service │ │Service │ │Service │ │Service │
│ 200ms │ │ 500ms │ │ 300ms │ │ 100ms │
└──────────┘ └──────────┘ └──────────┘ └──────────┘Pain Point Analysis
| Pain Point | Specific Manifestation | Consequence |
|---|---|---|
| Cascading failure | Inventory service goes down, order service sync call times out | Order service thread pool exhausted, cannot process new requests |
| Response latency | Must wait for all downstream service responses | User waits over 1 second, terrible experience |
| Difficult to extend | Adding a points service requires modifying order service code | Longer release cycles, increased risk |
| Resource waste | Order service must wait for SMS service | Database connections occupied for long periods |
3.2 Decoupling Solution: Introduce Message Queue as "Middle Layer"
Architecture after decoupling:
Order service only sends messages, doesn't care who consumes:
┌─────────────┐
│ Order │ ──Send "order created" message──┐
│ Service │ │
└─────────────┘ ▼
┌───────────────────┐
│ Message Queue │
│ (Kafka/RabbitMQ) │
│ - Reliable store │
│ - Multi-replica │
│ - Order guarantee│
└─────────┬─────────┘
│
┌───────────────────────┼───────────────────────┐
│ │ │
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Inventory │ │ Payment │ │ Logistics │
│ Service │ │ Service │ │ Service │
│ Subscribe │ │ Subscribe │ │ Subscribe │
│ order events│ │ order events│ │ order events│
└──────────────┘ └──────────────┘ └──────────────┘Benefits of Decoupling
| Dimension | Before Decoupling | After Decoupling |
|---|---|---|
| Fault isolation | Inventory down = Order down | Inventory down, messages stay in queue, consumed after recovery |
| Response time | 1000ms (synchronous wait) | 50ms (return after sending message) |
| Extensibility | New service requires changing order code | New service just subscribes to topic |
| System complexity | Order service tightly depends on downstream | Order service only depends on message queue |
3.3 The Essence of Decoupling: From "Direct Calls" to "Event-Driven"
Paradigm shift:
Traditional thinking (imperative):
"Order service commands inventory service: Deduct inventory for me!"
↓ Direct call
↓ High coupling, callee must be online
↓ Caller needs to know callee's interface
Event-driven thinking (declarative):
"Order service declares: Order has been created. Whoever cares, handle it."
↓ Send event to message queue
↓ Decoupled, consumers can be offline
↓ Producer doesn't need to know consumers exist4. Core Problem 2: How to Handle Traffic Spikes with Peak Shaving?
4.1 Flash Sale Scenario: How to Handle 100K QPS Smoothly?
Scenario recreation: An e-commerce platform's Double 11 flash sale, expected peak 100K QPS, but the database can only handle 1,000 QPS.
Consequences of direct impact:
User requests ──→ App server ──→ Database
100K/s 100K/s 1K/s (limit)
↓
Connection pool exhausted
Response timeout
Database crash
↓
Cascading failure (all services depending on DB go down)Terminology
QPS (Queries Per Second): Queries per second, a metric for measuring system throughput.
100K QPS means 100,000 requests per second, like 100,000 people rushing into a store simultaneously.
4.2 Peak Shaving Solution: Message Queue as "Reservoir"
Architecture design:
┌───────────────────────────────────────────────────────────────────────┐
│ Flash Sale System Architecture │
├───────────────────────────────────────────────────────────────────────┤
│ │
│ Layer 1: Gateway (hard rate limiting) │
│ ┌───────────────────────────────────────────────────────────────┐ │
│ │ - Token bucket: 100K/s → 10K/s (drop 90% of requests) │ │
│ │ - CDN caches static resources (product detail pages) │ │
│ │ - CAPTCHA / queue page (first layer of peak shaving) │ │
│ └───────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ Layer 2: Service (soft rate limiting) │
│ ┌───────────────────────────────────────────────────────────────┐ │
│ │ - Nginx rate limiting: 10K/s → 5K/s │ │
│ │ - Redis pre-deduct inventory (atomic operation): │ │
│ │ * Use Lua script for atomicity │ │
│ │ * Insufficient stock → return "Sold out" directly │ │
│ │ - Generate order token (queue voucher) │ │
│ └───────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ Layer 3: Message Queue (core peak shaving) │
│ ┌───────────────────────────────────────────────────────────────┐ │
│ │ Kafka/RocketMQ: │ │
│ │ - Batch write: 5K/s → 1K/s (matching DB capacity) │ │
│ │ - Message persistence: disk write guarantees no message loss │ │
│ │ - Multi-partition parallel consumption: boost throughput │ │
│ │ - Consumer offset management: support failure recovery │ │
│ │ │ │
│ │ Key metrics monitoring: │ │
│ │ - Produce Rate │ │
│ │ - Consume Rate │ │
│ │ - Lag (message backlog) │ │
│ └───────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ Layer 4: Consumer (async processing) │
│ ┌───────────────────────────────────────────────────────────────┐ │
│ │ Order processing consumers (multiple instances): │ │
│ │ - Pull messages from Kafka (1K/s, matching DB capacity) │ │
│ │ - DB transaction: create order + deduct inventory │ │
│ │ - Update order status to "Created" │ │
│ │ - Send order creation success notification (email/SMS/push) │ │
│ │ - Confirm message consumption (ACK) │ │
│ │ │ │
│ │ Consumer scaling strategy: │ │
│ │ - When Lag > 10,000, auto-scale up consumer instances │ │
│ │ - When Lag < 1,000, scale down consumer instances (save cost) │ │
│ └───────────────────────────────────────────────────────────────┘ │
│ │
└───────────────────────────────────────────────────────────────────────┘After the traffic peak passes, the system keeps processing the backlog at full speed until the queue is empty. This is peak shaving.
4.3 Mathematical Principles of Peak Shaving
Traffic smoothing effect:
Original traffic (spike): Smoothed traffic:
100K/s │ ╱╲ 1K/s │████████████████
│ ╱ ╲ │
│ ╱ ╲ │
1K/s│╱ ╲ 0/s │
└─────────────── └────────────────
0s 1s 2s 0s 20s
Original: 100K/s peak, lasting 1 second
Smoothed: 1K/s constant rate, lasting 100 secondsKey formulas:
Queue length = Producer rate × Duration - Consumer rate × Duration
= 100,000 × 1 - 1,000 × 1
= 99,000 messages (peak queue backlog)
Time to consume all messages = Queue length / Consumer rate
= 99,000 / 1,000
= 99 seconds5. Core Problem 3: How to Ensure Messages Are Not Lost, Not Duplicated, and In Order?
5.1 Message Reliability: Three Lines of Defense
Messages can be lost at three stages: during producer sending, during Broker storage, and during consumer processing.
Three Lines of Defense
Defense 1: Producer ACK
- When sending a message, wait for the Broker to confirm receipt
- If no confirmation received, retry or log locally
Defense 2: Broker Persistence
- Write messages to disk, not just in memory
- Multi-replica synchronization to ensure no data loss
Defense 3: Consumer ACK
- After processing a message, manually confirm (ACK)
- If processing fails, don't confirm; Broker will redeliver
5.2 How to Handle Duplicate Message Consumption?
Message duplication can occur in the following scenarios:
- Producer retry: Producer sends message but doesn't receive ACK, retries sending the same message
- Consumer ACK timeout: Consumer finishes processing but ACK times out, Broker redelivers
- Network jitter: Consumer ACK doesn't reach Broker, Broker considers message unconsumed
- Consumer restart: After consumer restarts, re-consumes the same batch of messages
Idempotency
Idempotency: Executing the same operation multiple times produces the same result as executing it once.
Everyday idempotency examples:
- Idempotent: Pressing an elevator button (pressing 10 times or once, the elevator still comes)
- Non-idempotent: Bank transfer (transferring $10, executing twice transfers $20)
Technical solution: Generate a unique ID for each message; check if already processed before handling.
6. Practice: How to Choose a Message Queue?
6.1 Comparison of Four Mainstream Message Queues
| Feature | RabbitMQ | Kafka | RocketMQ | Redis Stream |
|---|---|---|---|---|
| Positioning | Traditional MQ | Distributed log stream | E-commerce-grade MQ | Lightweight queue |
| Throughput | ~10K/s | ~1M/s | ~100K/s | ~50K/s |
| Latency | Microseconds | Milliseconds | Milliseconds | Milliseconds |
| Reliability | High (persistence) | High (multi-replica) | High (sync flush) | Medium (AOF) |
| Message replay | Not supported | Supported | Supported | Supported |
| Transactional messages | Supported (weak) | Not supported | Supported (strong) | Not supported |
| Delayed messages | Supported | Not supported | Supported | Not supported |
| Use cases | Traditional enterprise apps | Logs, big data | E-commerce, finance | Small-scale apps |
Selection Recommendations
Decision tree:
Choosing a message queue:
│
├─ Need transactional messages (distributed transactions)?
│ ├─ Yes → RocketMQ (first choice) or RabbitMQ
│ └─ No → continue
│
├─ Need to process massive logs/real-time streams?
│ ├─ Yes → Kafka (first choice)
│ └─ No → continue
│
├─ QPS > 10K/s?
│ ├─ Yes → RocketMQ or Kafka
│ └─ No → continue
│
├─ Need complex routing (e.g., header matching)?
│ ├─ Yes → RabbitMQ
│ └─ No → continue
│
├─ Already have Redis infrastructure?
│ ├─ Yes → Redis Stream (quick start)
│ └─ No → RabbitMQ (full-featured, moderate learning curve)7. Summary: Message Queue Design Principles
7.1 Core Principles Review
| Principle | Meaning | Practice Points |
|---|---|---|
| Decoupling | Services don't directly depend on each other | Communicate via message queue; consumer failure doesn't affect producer |
| Peak shaving | Smooth traffic fluctuations | Message queue as reservoir; consumers process at constant rate |
| Reliability | Messages not lost | Producer ACK + Broker persistence + Consumer ACK |
| Idempotency | Duplicate consumption has no effect | Business-level idempotency guarantees (unique keys, state machines) |
| Ordering | Message order guarantee | Single-partition ordering or consumer-side sorting |
7.2 Design Checklist
Before introducing a message queue, ask yourself:
- [ ] Do you really need a message queue? (Simple async can use thread pools)
- [ ] Is message loss acceptable? (Determines reliability level)
- [ ] Will message duplication affect the business? (Determines idempotency investment)
- [ ] Is message order important? (Determines partition strategy)
- [ ] What's the consumer processing capacity? (Determines queue size and alert thresholds)
- [ ] How to handle consumption failures? (Determines retry and dead letter strategies)
8. Glossary
| Term | Full Name | Description |
|---|---|---|
| MQ | Message Queue | Middleware for asynchronous communication, decoupling producers and consumers. |
| Producer | - | The party that sends messages. |
| Consumer | - | The party that receives and processes messages. |
| Broker | - | The server program that stores and forwards messages. |
| Topic | - | Logical categorization of messages (e.g., "orders"). |
| Queue | - | Physical container storing messages. |
| Partition | - | A Kafka concept; one Topic can be split into multiple Partitions for higher concurrency. |
| ACK | Acknowledgment | Consumer confirms to Broker after processing a message. |
| Pub/Sub | Publish/Subscribe | A messaging pattern where one message can be received by multiple consumers. |
| P2P | Point-to-Point | A messaging pattern where one message can only be received by one consumer. |
| DLQ | Dead Letter Queue | Stores messages that cannot be consumed. |
| Idempotence | - | Multiple executions produce the same result. |
| Throughput | - | Number of messages processed per unit time. |
| Latency | - | Time difference from message send to receipt. |
| Persistence | - | Messages written to disk, not just stored in memory. |
| Replication | - | Messages copied to multiple nodes for high availability. |
| Transaction Message | - | Guarantees consistency between local transaction and message sending. |
| Backpressure | - | When consumers can't keep up, they notify producers to slow down. |
| Offset | - | The consumer's consumption position within a partition. |
| Rebalance | - | Reassigning partitions when consumer group members change. |