Load Balancing & Gateway

🎯 Core Question

When a single server can't handle the load, how do you "smartly" distribute traffic across multiple server instances? Load balancing is the "dispatcher" of modern distributed systems. This article uses real-world analogies (bubble tea shop checkout, package sorting, traffic control) to deeply explore the design philosophy and engineering practices of load balancing.

1. Why "Load Balancing"?

1.1 A Real-World Case: The Architecture Evolution of a Website

A startup encountered severe performance issues as its user base grew rapidly:

Scenario:

Phase 1: Single Server
Users → Server (1 vCPU, 2 GB RAM)
       ↓
  1,000 DAU → Peak: 1,000 concurrent users
       ↓
  Problem: CPU at 100%, slow responses, frequent crashes

⚠️ The Fatal Flaws of a Single Server

Performance bottleneck: CPU at 100%, response time > 5 seconds
Single point of failure: If the server goes down, the entire site is unavailable
Scaling difficulty: Only vertical scaling is possible (adding CPU, RAM), which is expensive and limited

Improved Architecture (with Load Balancing):

Phase 2: Multiple Servers + Load Balancer
Users → Load Balancer (Nginx)
       ↓
     ├→ Server 1 (1 vCPU, 2 GB RAM)
     ├→ Server 2 (1 vCPU, 2 GB RAM)
     └→ Server 3 (1 vCPU, 2 GB RAM)

✨ Improvements

Better performance: 3 servers processing in parallel, response time < 1 second
High availability: If one server fails, others continue serving
Horizontal scaling: Need more capacity? Just add more servers

1.2 Load Balancing in Everyday Terms

The Bubble Tea Shop Counter

Imagine you run a popular bubble tea shop:

1 counter: Customers queue up, those at the back get impatient, negative reviews follow
3 counters: Staff direct customers to different counters, efficiency triples

The load balancer is the "counter assignment person":

Users (customers) → make requests
Load balancer (assignment person) → distributes requests to different servers
Servers (counters) → handle requests

传统架构单点

🖥️

Web Server

负载: 95% 🔥

→

负载均衡架构分布式

⚖️L4 Load Balancer

🖥️

📦四层负载均衡 (L4)

工作原理

基于传输层信息（IP地址+端口）进行流量分发。不关心应用层内容，只做"快递分拣"，因此性能极高。

典型产品

LVS (Linux Virtual Server)HAProxy (TCP模式)AWS NLBAzure Load Balancer

适用场景

需要极高吞吐量的场景
TCP/UDP流量分发
不需要内容识别的场景
微服务间通信

性能对比一览

类型

处理层

性能

灵活性

成本

硬件负载均衡

L4/L7

$$$$$

四层负载均衡

L4 (传输层)

七层负载均衡

L7 (应用层)

$$$

软件负载均衡

L4/L7

2. What Is Load Balancing?

2.1 Layer 4 Load Balancing (L4): Only Looking at the Address

Operates at the transport layer (TCP/UDP) — like a delivery driver who only looks at your address (IP + port) without caring about who you are or what you do.

Characteristics:

Extremely fast: Simple address forwarding without parsing packet content
Use cases: Database connections, Redis caching, long-connection game servers
Representative products: LVS (Linux Virtual Server), AWS NLB, Azure Load Balancer

How It Works

Client request → L4 Load Balancer → Backend server
                    ↓
           Only looks at IP + Port
                    ↓
           Fast forwarding (no content inspection)

2.2 Layer 7 Load Balancing (L7): Inspecting the Package Contents

Operates at the application layer (HTTP/HTTPS) — like a delivery driver who not only checks the address but also opens the package to inspect its contents before deciding how to deliver.

Characteristics:

Intelligent routing: Can perform fine-grained routing based on URL paths, HTTP headers, Cookies, etc.
Advanced features: SSL termination, content caching, compression, WAF security
Use cases: Web applications, API gateways, microservice architectures
Representative products: Nginx, HAProxy, AWS ALB, Envoy

How It Works

Client request → L7 Load Balancer → Parses HTTP content
                    ↓
          Inspects URL, Header, Cookie
                    ↓
          Intelligent routing to specific server

2.3 L4 vs L7 Comparison

Dimension	L4 Load Balancing	L7 Load Balancing
OSI Layer	Transport Layer (TCP/UDP)	Application Layer (HTTP/HTTPS)
Routing Basis	IP address + port	URL, Header, Cookie, Body
Processing Speed	Extremely fast (kernel-space)	Fast (user-space parsing)
Feature Richness	Basic forwarding	SSL termination, caching, compression, WAF
Typical Scenarios	Databases, gaming, long connections	Web apps, API gateways, microservices
Representative Products	LVS, AWS NLB	Nginx, HAProxy, AWS ALB

3. Core Problem 1: How to Prevent "Broken" Servers from Continuing to Serve?

3.1 Health Checks: Don't Let "Sick" Servers Drag Down the System

Imagine one of your checkout counters breaks, but the assignment person doesn't know and keeps sending customers there. The queue grows longer, and customers grow angrier.

Health checks are the "sentinels" that prevent this scenario. They periodically "examine" each server, immediately removing any "sick" ones from the pool and bringing them back once they "recover."

3.2 Active Health Checks vs Passive Health Checks

Active Health Check: The load balancer actively "knocks on the door" asking the server, "Are you still there?"

Periodically sends probe requests (e.g., HTTP /health, TCP ping)
Considers the server unhealthy if the response times out or returns an error code
Advantage: Accurate and reliable detection
Disadvantage: Generates additional probe traffic

Passive Health Check: The load balancer "observes" the response patterns of real business traffic

Tracks response times and error rates of actual requests
Considers the server unhealthy after multiple consecutive failures
Advantage: No additional traffic generated
Disadvantage: Requires sufficient traffic samples to make a determination

Threshold Settings

Metric	Healthy Threshold	Unhealthy Threshold	Notes
HTTP Status Code	200-399	400+ or timeout	4xx/5xx are all considered failures
TCP Connection	Successfully established	Connection timeout	Checks whether the port is reachable
Response Time	< 500 ms	> 2000 ms	Timeout typically set to 2-5 seconds
Consecutive Failures	-	3 times	Avoids false positives from transient blips
Check Interval	-	5 s	Too frequent increases load

💡 Common Pitfall: Thresholds Set Too "Sensitive"

A team set the health check response time threshold to 100ms, but their application's average response time fluctuated between 80-120ms. As a result, servers were frequently marked as "unhealthy," causing traffic to oscillate between healthy and unhealthy states, and overall system availability actually dropped.

The right approach: Thresholds should be set to 2-3x the P99 response time, leaving enough buffer for normal fluctuations.

4. Core Problem 2: How to Ensure "Returning Customers" Always Get the Same "Cashier"?

4.1 Session Persistence: Let "Returning Customers" Always Find the Same "Cashier"

Imagine you're a regular at a bubble tea shop, and the same staff member serves you every time. She knows your preferences (half sugar, no ice) and serves you quickly and thoughtfully. But if you get a new person every time, you have to repeat the same requests over and over — a huge efficiency loss.

Session persistence (sticky sessions) solves this problem: ensuring that requests from the same user are always routed to the same backend server.

应用场景：

👤

用户A

👥

用户B

👨‍💼

用户C

请求

↓

⚖️负载均衡器

🍪

Cookie 插入

通过HTTP Cookie保持会话

会话映射表

sess_abc123→Server 1

sess_def456→Server 2

sess_ghi789→Server 1

↓

🖥️

Server 1

10.0.1.10

✓

选中

🖥️

Server 2

10.0.1.11

✓

🖥️

Server 3

10.0.1.12

✗

↑

三种会话保持机制对比

🍪Cookie 插入

✓不受客户端IP变化影响

✓首次请求即可保持会话

✗客户端需支持Cookie

✗存在Cookie被禁用的风险

#️⃣IP Hash

✓无需客户端支持任何机制

✓无状态，LB重启不影响会话

✗客户端IP变化会丢失会话

✗难以做到真正的负载均衡

📝粘性会话

✓结合Cookie和IP两种方式优势

✓支持会话复制和故障转移

✗实现复杂，需要应用支持

✗会话复制带来性能开销

4.2 Comparison of Three Session Persistence Mechanisms

Mechanism	How It Works	Advantages	Disadvantages	Use Cases
Cookie Insert	LB inserts a cookie in the response; subsequent requests carry this cookie	Unaffected by IP changes; persists from the first request	Client must support cookies; may be disabled	Shopping carts, login sessions
IP Hash	Hashes the client IP and maps it to a specific server	No client-side support needed; stateless	Session lost if IP changes; hard to distribute evenly	Cookie-free environments, WebSocket
Sticky Session Table	LB maintains a mapping table of sessions to servers	Supports session replication and failover	Consumes LB memory; requires additional synchronization	Scenarios with strict high-availability requirements

💡 Usage Recommendations

Cookie Insert: Preferred recommendation; best compatibility
IP Hash: Only for special scenarios like WebSocket
Sticky Session Table: Combined with cookies to provide failover capability

5. Core Problem 3: How to Achieve Zero-Downtime Deployment?

5.1 Blue-Green Deployment: "One-Click Switch" for Zero-Downtime Releases

Core idea: Maintain two identical production environments (blue and green) simultaneously, but only one serves live traffic at any given time.

🔵

蓝环境

v1.0.0

100% 流量

🟢

绿环境

v1.1.0

0% 流量

用户流量

👤

↓

⚖️

负载均衡器

当前指向: 🔵 蓝环境

↓

🔵蓝环境v1.0.0

🖥️B1●

🖥️B2●

🖥️B3●

🟢绿环境v1.1.0

🖥️G1●

🖥️G2●

🖥️G3●

蓝绿部署流程

绿环境部署

在绿环境部署新版本，进行冒烟测试

→

切换流量

将负载均衡器指向绿环境，流量瞬间切换

→

监控观察

观察绿环境运行状态，确认无异常

→

蓝环境升级

在蓝环境部署新版本，为下次切换做准备

蓝绿部署优缺点

✅优点

零停机时间：流量切换在毫秒级完成，用户无感知
快速回滚：发现问题可立即切回原环境，风险可控
完整的预发布测试：新环境可完整测试后再接管流量
数据一致性：无需处理新旧版本同时运行时的兼容问题

❌缺点

资源成本高：需要同时维护两套完整环境，服务器成本翻倍
数据库兼容性挑战：如果涉及数据库Schema变更，需要特别处理兼容性
预热问题：新环境启动后可能需要时间预热缓存、连接池等
不适合有状态服务：对于长连接、会话保持要求高的场景处理复杂

Workflow:

Initial state: Blue environment runs v1.0 (production), green environment stands by.
Deploy new version: Deploy v1.1 to the green environment and run internal smoke tests.
Switch traffic: Point the load balancer to the green environment; traffic instantly switches to v1.1.
Monitor: Observe the green environment's behavior and confirm no anomalies.
Keep old version: Keep the blue environment on v1.0 for a period (e.g., 24 hours) as a safety net for rapid rollback.

✨ Pros and Cons

Pros	Cons
✅ Zero downtime; switch completes in milliseconds	❌ High resource cost; requires maintaining two environments
✅ Fast rollback; switch back immediately if issues are found	❌ Database schema changes require special compatibility handling
✅ New environment can be fully tested before taking over traffic	❌ Not suitable for stateful services (e.g., WebSocket long connections)

5.2 Canary Release: "Small Steps, Fast Iteration" Canary Strategy

The canary release is named after the historical "coal mine canary" — miners brought canaries into the mines; if the canary showed signs of distress, it indicated toxic gas leakage, and miners would evacuate immediately. In software releases, a canary release means exposing a small subset of users to the new version first, observing for issues, and then gradually expanding the rollout.

流量分配比例拖动滑块调整新旧版本流量占比

稳定版 v1.0.090%

金丝雀 v1.1.010%

实时流量模拟总请求: 0 | 稳定版: 0 | 金丝雀: 0

用户请求

→

负载均衡器

⚖️

Canary:10%

→

后端服务

稳定版 v1.0.0

📦S1

📦S2

📦S3

金丝雀 v1.1.0

🧪C1

🧪C2

金丝雀发布最佳实践

📊渐进式放量

1% → 5% → 10% → 25% → 50% → 100%
每个阶段观察至少15-30分钟
关键指标：错误率、延迟、吞吐量

🎯精准用户选择

内部员工/测试用户先行
按地域：选择特定区域用户
按用户属性：VIP用户或普通用户
按设备类型：iOS/Android/Web

🛡️自动回滚机制

错误率超过阈值自动回滚
P99延迟异常触发告警
关键业务指标下降自动回滚
一键回滚：30秒内恢复旧版本

📈监控与指标

基础设施：CPU、内存、磁盘、网络
应用指标：QPS、错误率、延迟分布
业务指标：转化率、订单量、收入
用户体验：页面加载时间、交互延迟

Core idea:

Small traffic first: Route 1% of traffic to the new version servers initially.
Observe metrics: Continuously monitor error rates, latency, and key business metrics.
Gradual rollout: If everything is normal, gradually increase the proportion to 5%, 10%, 25%, 50%, and 100%.
Rapid rollback: If any anomaly is detected, immediately switch all traffic back to the old version.

💡 Advantages of Canary Releases

Advantage	Description
🎯 Controlled risk	Even if the new version has severe bugs, only a small number of users are affected
📊 Real-world validation	Validated in the real production environment, more reliable than staging
🚀 Fast iteration	Teams can release new features more confidently and frequently
💰 Resource-friendly	Doesn't require two complete environments like blue-green deployment

6. Core Problem 4: How to Make the System "Breathe" on Its Own?

6.1 Auto Scaling: Let the System "Flexibly Schedule" Like a Restaurant

Imagine you run a restaurant:

Lunch peak: You need 10 servers, but at 3 PM during the afternoon lull, you only need 2
If you always keep 10: labor costs explode
If you always keep only 2: customers during peak hours can't wait and all leave

Auto Scaling lets the system "flexibly schedule" like a restaurant — automatically adding servers when busy and removing them when idle.

扩容指标：

实时监控实时

💻CPU使用率

45%

扩容阈值: 70%缩容阈值: 30%

🧠内存使用率

60%

扩容阈值: 75%缩容阈值: 40%

⚡QPS

650req/s

扩容阈值: 1000/s目标: 800/s

🖥️运行实例

3个实例

最小: 2最大: 10

扩缩容历史最近 5 次操作

📈

扩容: 2 → 3 实例

CPU使用率超过70%