Message Queues and Event-Driven Architecture

Core Question

When systems are tightly coupled and traffic spikes, how do you ensure the critical path remains stable? Message queues are the "buffer" and "decoupler" of modern distributed systems. This article uses real-world cases (restaurant queuing, package sorting, flash sale systems) to deeply understand the design philosophy and engineering practices of message queues.

1. Why "Message Queues"?

1.1 A Real-World Case: The Evolution of Taobao's Order System

In 2012, Taobao's order system suffered a severe outage. At midnight on Double 11, traffic flooded in instantly. The order service directly called the inventory service, payment service, logistics service... the entire chain collapsed like dominoes.

The architecture at the time (tight coupling):

User places order → Order service → Sync call inventory service → Sync call payment service → Sync call logistics service
                    ↓                    ↓                    ↓
                 Response 200ms       Response 500ms       Response 300ms

Fatal Problems with Tight Coupling

Total response time = 200 + 500 + 300 = 1000ms (user waits 1 second)
Inventory service down → Order service also goes down (thread pool exhausted)
Payment service slows down → Entire chain is dragged down
Cannot scale horizontally → Can only scale vertically (expensive and limited)

Improved architecture (introducing message queues):

User places order → Order service → Send "order created" message → Immediate return (50ms)
                              ↓
                        Message queue (Kafka)
                              ↓
        ┌─────────────┬─────────────┬─────────────┐
        ▼             ▼             ▼             ▼
   Inventory      Payment       Logistics     Notification
   service        service       service       service
   (async deduct) (async process) (async create) (async send)

Improvements After Changes

User response time = 50ms (20x experience improvement)
Inventory service down → Messages stay in queue, continue processing after recovery
Payment service slows down → Doesn't affect order creation
Can scale horizontally → Just add more consumer instances

1.2 Everyday Analogies for Message Queues

Restaurant Queuing System

Imagine going to a popular restaurant:

No queuing system: Customers must stand at the window waiting; limited window space, long lines behind, restaurant under pressure
With queuing system: After ordering, you get a number; you can sit down first, pick up food when your number is called

A message queue is the software system's "queuing system":

Producer (the person ordering) → Puts messages (orders) into the queue
Queue (the number dispenser) → Temporarily stores messages
Consumer (the chef) → Processes messages at their own pace

Processing capacity (Consumer)200 req/s

Maximum backend processing speed

Queue capacity (Queue Size)2000

Maximum requests the message queue can buffer

Current inbound traffic

100 req/s

Queue backlog

0 msgs

Actual processing rate

0 req/s

Rejected requests (rate limited)

0 req

Inbound traffic (user requests)Processed traffic (system load)Queue backlog

💡

Core principle: When inbound traffic (blue) exceeds processing capacity (green line), extra requests are stored in the message queue (orange area).
After the traffic peak passes, the system keeps processing the backlog at full speed until the queue is empty. This is peak shaving.

2. What Is a Message Queue? (Definition + Core Three Elements)

2.1 What Is a "Message Queue"?

Terminology

Message Queue (MQ) is a container for storing messages. Producers put messages in, consumers take messages out for processing. It enables "asynchronous communication" — the sender doesn't need to wait for the receiver to finish processing.

Synchronous vs Asynchronous:

Synchronous: Like a phone call — the other party must answer to communicate
Asynchronous: Like sending a text — you send it, they read it when available

It's like calling a friend (synchronous) vs sending them a message (asynchronous).

2.2 The Three Core Elements of a Message Queue

Element 1: Producer

Responsibility: Create and send messages to the queue.

Analogy: The producer is like a "sender," delivering letters (messages) to the post office (queue).

Key Design Points

Send method: Synchronous send (reliable but blocking) vs asynchronous send (high performance but needs callback handling)
Message confirmation: Wait for Broker confirmation (At Least Once) vs fire-and-forget (At Most Once)
Failure handling: Retry strategy, local log backup, dead letter queue

Element 2: Consumer

Responsibility: Get messages from the queue and process them.

Analogy: The consumer is like a "recipient," taking letters (messages) from the mailbox (queue) and processing them.

Key Design Points

Consumption mode: Push mode (Broker actively pushes) vs Pull mode (consumer actively pulls)
Consumption confirmation: Auto ACK (efficient but may lose messages) vs manual ACK (reliable but needs timeout handling)
Concurrency control: Single-threaded sequential consumption vs multi-threaded parallel consumption
Failure handling: Retry strategy, dead letter queue, compensation mechanism

Element 3: Broker (Message Broker)

Responsibility: Receive, store, and forward messages.

Analogy: The Broker is like a "post office" or "package sorting station," responsible for receiving, sorting, and delivering letters.

Key Design Points

Storage model: In-memory storage (low latency) vs disk storage (high reliability)
Replication strategy: Primary-secondary replication, multi-replica synchronization
High availability: Cluster deployment, automatic failover
Scalability: Partitions, Sharding

3. Core Problem 1: How to Decouple Systems and Avoid "Pulling One Thread and Moving the Whole System"?

3.1 The Tragedy of Tight Coupling: One Service Goes Down, Everything Falls

Scenario recreation: An e-commerce platform's early architecture

Order service directly calls downstream services:
┌─────────────┐
│  Order       │
│  Service     │
└──────┬──────┘
       │
       ├───────────┬───────────┬───────────┐
       ▼           ▼           ▼           ▼
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│Inventory │ │Payment   │ │Logistics │ │SMS       │
│Service   │ │Service   │ │Service   │ │Service   │
│  200ms   │ │  500ms   │ │  300ms   │ │  100ms   │
└──────────┘ └──────────┘ └──────────┘ └──────────┘

Pain Point Analysis

Pain Point	Specific Manifestation	Consequence
Cascading failure	Inventory service goes down, order service sync call times out	Order service thread pool exhausted, cannot process new requests
Response latency	Must wait for all downstream service responses	User waits over 1 second, terrible experience
Difficult to extend	Adding a points service requires modifying order service code	Longer release cycles, increased risk
Resource waste	Order service must wait for SMS service	Database connections occupied for long periods

3.2 Decoupling Solution: Introduce Message Queue as "Middle Layer"

Architecture after decoupling:

Order service only sends messages, doesn't care who consumes:

┌─────────────┐
│  Order       │ ──Send "order created" message──┐
│  Service     │                                │
└─────────────┘                                ▼
                                    ┌───────────────────┐
                                    │   Message Queue    │
                                    │  (Kafka/RabbitMQ)  │
                                    │   - Reliable store │
                                    │   - Multi-replica  │
                                    │   - Order guarantee│
                                    └─────────┬─────────┘
                                              │
              ┌───────────────────────┼───────────────────────┐
              │                       │                       │
              ▼                       ▼                       ▼
       ┌──────────────┐      ┌──────────────┐      ┌──────────────┐
       │  Inventory   │      │  Payment     │      │  Logistics   │
       │  Service     │      │  Service     │      │  Service     │
       │  Subscribe   │      │  Subscribe   │      │  Subscribe   │
       │  order events│      │  order events│      │  order events│
       └──────────────┘      └──────────────┘      └──────────────┘

🔗System Decoupling DemoFrom tight coupling to loose coupling

❌ The fatal problem of tight coupling

Order service

Create order

Call inventory service

Call points service

Call notification service

Notification service

Send SMS/email

⚠️Strong dependency: notification failure blocks order creation

⚠️Slow response: total time = 300ms + 500ms + 400ms = 1200ms

⚠️Hard to extend: adding a service requires changing order code

💡Core idea:Synchronous calls create strong dependencies and slow responses; asynchronous messaging decouples systems, improves response speed, and scales more easily.

Benefits of Decoupling

Dimension	Before Decoupling	After Decoupling
Fault isolation	Inventory down = Order down	Inventory down, messages stay in queue, consumed after recovery
Response time	1000ms (synchronous wait)	50ms (return after sending message)
Extensibility	New service requires changing order code	New service just subscribes to topic
System complexity	Order service tightly depends on downstream	Order service only depends on message queue

3.3 The Essence of Decoupling: From "Direct Calls" to "Event-Driven"

Paradigm shift:

Traditional thinking (imperative):
"Order service commands inventory service: Deduct inventory for me!"
  ↓ Direct call
  ↓ High coupling, callee must be online
  ↓ Caller needs to know callee's interface

Event-driven thinking (declarative):
"Order service declares: Order has been created. Whoever cares, handle it."
  ↓ Send event to message queue
  ↓ Decoupled, consumers can be offline
  ↓ Producer doesn't need to know consumers exist

4. Core Problem 2: How to Handle Traffic Spikes with Peak Shaving?

4.1 Flash Sale Scenario: How to Handle 100K QPS Smoothly?

Scenario recreation: An e-commerce platform's Double 11 flash sale, expected peak 100K QPS, but the database can only handle 1,000 QPS.

Consequences of direct impact:

User requests ──→ App server ──→ Database
  100K/s         100K/s          1K/s (limit)
                              ↓
                         Connection pool exhausted
                         Response timeout
                         Database crash
                              ↓
                         Cascading failure (all services depending on DB go down)

Terminology

QPS (Queries Per Second): Queries per second, a metric for measuring system throughput.

100K QPS means 100,000 requests per second, like 100,000 people rushing into a store simultaneously.

4.2 Peak Shaving Solution: Message Queue as "Reservoir"

Architecture design:

┌───────────────────────────────────────────────────────────────────────┐
│                     Flash Sale System Architecture                    │
├───────────────────────────────────────────────────────────────────────┤
│                                                                       │
│  Layer 1: Gateway (hard rate limiting)                               │
│  ┌───────────────────────────────────────────────────────────────┐   │
│  │  - Token bucket: 100K/s → 10K/s (drop 90% of requests)       │   │
│  │  - CDN caches static resources (product detail pages)         │   │
│  │  - CAPTCHA / queue page (first layer of peak shaving)         │   │
│  └───────────────────────────────────────────────────────────────┘   │
│                            │                                          │
│                            ▼                                          │
│  Layer 2: Service (soft rate limiting)                               │
│  ┌───────────────────────────────────────────────────────────────┐   │
│  │  - Nginx rate limiting: 10K/s → 5K/s                         │   │
│  │  - Redis pre-deduct inventory (atomic operation):             │   │
│  │    * Use Lua script for atomicity                              │   │
│  │    * Insufficient stock → return "Sold out" directly           │   │
│  │  - Generate order token (queue voucher)                        │   │
│  └───────────────────────────────────────────────────────────────┘   │
│                            │                                          │
│                            ▼                                          │
│  Layer 3: Message Queue (core peak shaving)                          │
│  ┌───────────────────────────────────────────────────────────────┐   │
│  │  Kafka/RocketMQ:                                               │   │
│  │  - Batch write: 5K/s → 1K/s (matching DB capacity)           │   │
│  │  - Message persistence: disk write guarantees no message loss  │   │
│  │  - Multi-partition parallel consumption: boost throughput      │   │
│  │  - Consumer offset management: support failure recovery        │   │
│  │                                                                 │   │
│  │  Key metrics monitoring:                                        │   │
│  │  - Produce Rate                                                 │   │
│  │  - Consume Rate                                                 │   │
│  │  - Lag (message backlog)                                        │   │
│  └───────────────────────────────────────────────────────────────┘   │
│                            │                                          │
│                            ▼                                          │
│  Layer 4: Consumer (async processing)                                │
│  ┌───────────────────────────────────────────────────────────────┐   │
│  │  Order processing consumers (multiple instances):              │   │
│  │  - Pull messages from Kafka (1K/s, matching DB capacity)       │   │
│  │  - DB transaction: create order + deduct inventory              │   │
│  │  - Update order status to "Created"                             │   │
│  │  - Send order creation success notification (email/SMS/push)    │   │
│  │  - Confirm message consumption (ACK)                            │   │
│  │                                                                 │   │
│  │  Consumer scaling strategy:                                     │   │
│  │  - When Lag > 10,000, auto-scale up consumer instances          │   │
│  │  - When Lag < 1,000, scale down consumer instances (save cost) │   │
│  └───────────────────────────────────────────────────────────────┘   │
│                                                                       │
└───────────────────────────────────────────────────────────────────────┘

Processing capacity (Consumer)200 req/s

Maximum backend processing speed

Queue capacity (Queue Size)2000

Maximum requests the message queue can buffer

Current inbound traffic

100 req/s

Queue backlog

0 msgs

Actual processing rate

0 req/s

Rejected requests (rate limited)

0 req

Inbound traffic (user requests)Processed traffic (system load)Queue backlog

💡

4.3 Mathematical Principles of Peak Shaving

Traffic smoothing effect:

Original traffic (spike):              Smoothed traffic:

100K/s │    ╱╲                  1K/s │████████████████
       │   ╱  ╲                      │
       │  ╱    ╲                     │
   1K/s│╱        ╲               0/s │
       └───────────────               └────────────────
       0s   1s   2s                   0s              20s

Original: 100K/s peak, lasting 1 second
Smoothed: 1K/s constant rate, lasting 100 seconds

Key formulas:

Queue length = Producer rate × Duration - Consumer rate × Duration
            = 100,000 × 1 - 1,000 × 1
            = 99,000 messages (peak queue backlog)

Time to consume all messages = Queue length / Consumer rate
                             = 99,000 / 1,000
                             = 99 seconds

5. Core Problem 3: How to Ensure Messages Are Not Lost, Not Duplicated, and In Order?

5.1 Message Reliability: Three Lines of Defense

Messages can be lost at three stages: during producer sending, during Broker storage, and during consumer processing.

Three Lines of Defense

Defense 1: Producer ACK

When sending a message, wait for the Broker to confirm receipt
If no confirmation received, retry or log locally

Defense 2: Broker Persistence

Write messages to disk, not just in memory
Multi-replica synchronization to ensure no data loss

Defense 3: Consumer ACK

After processing a message, manually confirm (ACK)
If processing fails, don't confirm; Broker will redeliver

🛡️Message Reliability DemoThree defense lines keep messages from being lost

Line 1

Producer ACK

📤

Producer

Send message

📨

Message

✓

ACK confirmation

📦

Broker

Receive and store

Send message

💡 If no ACK is received, the producer retries or records a local log.

Line 2

Broker persistence

⚡

Memory storage

Fast, but lost on restart

❌ High risk

💾

Disk storage

Persisted to avoid loss

✅ Recommended

🔄 Multi-replica sync

Messages are synced to 3 nodes, so data is not lost even if 1 node fails.

Storage mode

✅ Message persisted and safe

Line 3

Consumer ACK

Pull message

Fetch message from Broker

→

Process message

Run business logic

→

Manual ACK

Confirm after processing

Auto ACK

Efficient but can lose messages

⚠️ Not recommended

Manual ACK

Reliable, ACK after processing

✅ Recommended

Simulate consumption

💡 If processing fails, no ACK is sent and the Broker redelivers.

🎯

All three defense lines are required:Producer ACK → Broker persistence → Consumer ACK

5.2 How to Handle Duplicate Message Consumption?

Message duplication can occur in the following scenarios:

Producer retry: Producer sends message but doesn't receive ACK, retries sending the same message
Consumer ACK timeout: Consumer finishes processing but ACK times out, Broker redelivers
Network jitter: Consumer ACK doesn't reach Broker, Broker considers message unconsumed
Consumer restart: After consumer restarts, re-consumes the same batch of messages

Idempotency

Idempotency: Executing the same operation multiple times produces the same result as executing it once.

Everyday idempotency examples:

Idempotent: Pressing an elevator button (pressing 10 times or once, the elevator still comes)
Non-idempotent: Bank transfer (transferring $10, executing twice transfers $20)

Technical solution: Generate a unique ID for each message; check if already processed before handling.

🔄Idempotence DemoRepeated consumption does not create side effects

❌ Non-idempotent operation: bank transfer

Repeated consumption can debit multiple times

Sender

Balance: ¥1000

💰

Transfer ¥100

Receiver

Balance: ¥500

Idempotence protection

Disabled

Processing log

No logs yet. Click the button to start.

❌ No idempotence protection

Debit ¥100

Duplicate consumption causes multiple debits

✅ With idempotence protection

Debit ¥100

Duplicate requests are filtered, so debit happens once

🎯

Core idempotence principle: Generate a unique ID for each message and check whether it was processed before doing the operation.

6. Practice: How to Choose a Message Queue?

6.1 Comparison of Four Mainstream Message Queues

Feature	RabbitMQ	Kafka	RocketMQ	Redis Stream
Positioning	Traditional MQ	Distributed log stream	E-commerce-grade MQ	Lightweight queue
Throughput	~10K/s	~1M/s	~100K/s	~50K/s
Latency	Microseconds	Milliseconds	Milliseconds	Milliseconds
Reliability	High (persistence)	High (multi-replica)	High (sync flush)	Medium (AOF)
Message replay	Not supported	Supported	Supported	Supported
Transactional messages	Supported (weak)	Not supported	Supported (strong)	Not supported
Delayed messages	Supported	Not supported	Supported	Not supported
Use cases	Traditional enterprise apps	Logs, big data	E-commerce, finance	Small-scale apps

Selection Recommendations

Decision tree:

Choosing a message queue:
│
├─ Need transactional messages (distributed transactions)?
│  ├─ Yes → RocketMQ (first choice) or RabbitMQ
│  └─ No → continue
│
├─ Need to process massive logs/real-time streams?
│  ├─ Yes → Kafka (first choice)
│  └─ No → continue
│
├─ QPS > 10K/s?
│  ├─ Yes → RocketMQ or Kafka
│  └─ No → continue
│
├─ Need complex routing (e.g., header matching)?
│  ├─ Yes → RabbitMQ
│  └─ No → continue
│
├─ Already have Redis infrastructure?
│  ├─ Yes → Redis Stream (quick start)
│  └─ No → RabbitMQ (full-featured, moderate learning curve)

7. Summary: Message Queue Design Principles

7.1 Core Principles Review

Principle	Meaning	Practice Points
Decoupling	Services don't directly depend on each other	Communicate via message queue; consumer failure doesn't affect producer
Peak shaving	Smooth traffic fluctuations	Message queue as reservoir; consumers process at constant rate
Reliability	Messages not lost	Producer ACK + Broker persistence + Consumer ACK
Idempotency	Duplicate consumption has no effect	Business-level idempotency guarantees (unique keys, state machines)
Ordering	Message order guarantee	Single-partition ordering or consumer-side sorting

7.2 Design Checklist

Before introducing a message queue, ask yourself:

[ ] Do you really need a message queue? (Simple async can use thread pools)
[ ] Is message loss acceptable? (Determines reliability level)
[ ] Will message duplication affect the business? (Determines idempotency investment)
[ ] Is message order important? (Determines partition strategy)
[ ] What's the consumer processing capacity? (Determines queue size and alert thresholds)
[ ] How to handle consumption failures? (Determines retry and dead letter strategies)

8. Glossary

Term	Full Name	Description
MQ	Message Queue	Middleware for asynchronous communication, decoupling producers and consumers.
Producer	-	The party that sends messages.
Consumer	-	The party that receives and processes messages.
Broker	-	The server program that stores and forwards messages.
Topic	-	Logical categorization of messages (e.g., "orders").
Queue	-	Physical container storing messages.
Partition	-	A Kafka concept; one Topic can be split into multiple Partitions for higher concurrency.
ACK	Acknowledgment	Consumer confirms to Broker after processing a message.
Pub/Sub	Publish/Subscribe	A messaging pattern where one message can be received by multiple consumers.
P2P	Point-to-Point	A messaging pattern where one message can only be received by one consumer.
DLQ	Dead Letter Queue	Stores messages that cannot be consumed.
Idempotence	-	Multiple executions produce the same result.
Throughput	-	Number of messages processed per unit time.
Latency	-	Time difference from message send to receipt.
Persistence	-	Messages written to disk, not just stored in memory.
Replication	-	Messages copied to multiple nodes for high availability.
Transaction Message	-	Guarantees consistency between local transaction and message sending.
Backpressure	-	When consumers can't keep up, they notify producers to slow down.
Offset	-	The consumer's consumption position within a partition.
Rebalance	-	Reassigning partitions when consumer group members change.

Message Queues and Event-Driven Architecture ​

1. Why "Message Queues"? ​

1.1 A Real-World Case: The Evolution of Taobao's Order System ​

1.2 Everyday Analogies for Message Queues ​

2. What Is a Message Queue? (Definition + Core Three Elements) ​

2.1 What Is a "Message Queue"? ​

2.2 The Three Core Elements of a Message Queue ​

Element 1: Producer ​

Element 2: Consumer ​

Element 3: Broker (Message Broker) ​

3. Core Problem 1: How to Decouple Systems and Avoid "Pulling One Thread and Moving the Whole System"? ​

3.1 The Tragedy of Tight Coupling: One Service Goes Down, Everything Falls ​

3.2 Decoupling Solution: Introduce Message Queue as "Middle Layer" ​

3.3 The Essence of Decoupling: From "Direct Calls" to "Event-Driven" ​

4. Core Problem 2: How to Handle Traffic Spikes with Peak Shaving? ​

4.1 Flash Sale Scenario: How to Handle 100K QPS Smoothly? ​

4.2 Peak Shaving Solution: Message Queue as "Reservoir" ​

4.3 Mathematical Principles of Peak Shaving ​

5. Core Problem 3: How to Ensure Messages Are Not Lost, Not Duplicated, and In Order? ​

5.1 Message Reliability: Three Lines of Defense ​

5.2 How to Handle Duplicate Message Consumption? ​

6. Practice: How to Choose a Message Queue? ​

6.1 Comparison of Four Mainstream Message Queues ​

7. Summary: Message Queue Design Principles ​

7.1 Core Principles Review ​

7.2 Design Checklist ​

8. Glossary ​

Message Queues and Event-Driven Architecture

1. Why "Message Queues"?

1.1 A Real-World Case: The Evolution of Taobao's Order System

1.2 Everyday Analogies for Message Queues

2. What Is a Message Queue? (Definition + Core Three Elements)

2.1 What Is a "Message Queue"?

2.2 The Three Core Elements of a Message Queue

Element 1: Producer

Element 2: Consumer

Element 3: Broker (Message Broker)

3. Core Problem 1: How to Decouple Systems and Avoid "Pulling One Thread and Moving the Whole System"?

3.1 The Tragedy of Tight Coupling: One Service Goes Down, Everything Falls

3.2 Decoupling Solution: Introduce Message Queue as "Middle Layer"

3.3 The Essence of Decoupling: From "Direct Calls" to "Event-Driven"

4. Core Problem 2: How to Handle Traffic Spikes with Peak Shaving?

4.1 Flash Sale Scenario: How to Handle 100K QPS Smoothly?

4.2 Peak Shaving Solution: Message Queue as "Reservoir"

4.3 Mathematical Principles of Peak Shaving

5. Core Problem 3: How to Ensure Messages Are Not Lost, Not Duplicated, and In Order?

5.1 Message Reliability: Three Lines of Defense

5.2 How to Handle Duplicate Message Consumption?

6. Practice: How to Choose a Message Queue?

6.1 Comparison of Four Mainstream Message Queues

7. Summary: Message Queue Design Principles

7.1 Core Principles Review

7.2 Design Checklist

8. Glossary