Skip to content

Load Balancing & Gateway

🎯 Core Question

When a single server can't handle the load, how do you "smartly" distribute traffic across multiple server instances? Load balancing is the "dispatcher" of modern distributed systems. This article uses real-world analogies (bubble tea shop checkout, package sorting, traffic control) to deeply explore the design philosophy and engineering practices of load balancing.


1. Why "Load Balancing"?

1.1 A Real-World Case: The Architecture Evolution of a Website

A startup encountered severe performance issues as its user base grew rapidly:

Scenario:

Phase 1: Single Server
Users → Server (1 vCPU, 2 GB RAM)

  1,000 DAU → Peak: 1,000 concurrent users

  Problem: CPU at 100%, slow responses, frequent crashes

⚠️ The Fatal Flaws of a Single Server

  • Performance bottleneck: CPU at 100%, response time > 5 seconds
  • Single point of failure: If the server goes down, the entire site is unavailable
  • Scaling difficulty: Only vertical scaling is possible (adding CPU, RAM), which is expensive and limited

Improved Architecture (with Load Balancing):

Phase 2: Multiple Servers + Load Balancer
Users → Load Balancer (Nginx)

     ├→ Server 1 (1 vCPU, 2 GB RAM)
     ├→ Server 2 (1 vCPU, 2 GB RAM)
     └→ Server 3 (1 vCPU, 2 GB RAM)

✨ Improvements

  • Better performance: 3 servers processing in parallel, response time < 1 second
  • High availability: If one server fails, others continue serving
  • Horizontal scaling: Need more capacity? Just add more servers

1.2 Load Balancing in Everyday Terms

The Bubble Tea Shop Counter

Imagine you run a popular bubble tea shop:

  • 1 counter: Customers queue up, those at the back get impatient, negative reviews follow
  • 3 counters: Staff direct customers to different counters, efficiency triples

The load balancer is the "counter assignment person":

  • Users (customers) → make requests
  • Load balancer (assignment person) → distributes requests to different servers
  • Servers (counters) → handle requests
负载均衡器类型
从四层到七层,从硬件到软件的演进
传统架构单点
🖥️
Web Server
负载: 95% 🔥
负载均衡架构分布式
⚖️L4 Load Balancer
🖥️
S1
🖥️
S2
🖥️
S3
📦四层负载均衡 (L4)
工作原理

基于传输层信息(IP地址+端口)进行流量分发。不关心应用层内容,只做"快递分拣",因此性能极高。

典型产品
LVS (Linux Virtual Server)HAProxy (TCP模式)AWS NLBAzure Load Balancer
适用场景
  • 需要极高吞吐量的场景
  • TCP/UDP流量分发
  • 不需要内容识别的场景
  • 微服务间通信
性能对比一览
类型
处理层
性能
灵活性
成本
硬件负载均衡
L4/L7
$$$$$
四层负载均衡
L4 (传输层)
$$
七层负载均衡
L7 (应用层)
$$$
软件负载均衡
L4/L7
$

2. What Is Load Balancing?

2.1 Layer 4 Load Balancing (L4): Only Looking at the Address

Operates at the transport layer (TCP/UDP) — like a delivery driver who only looks at your address (IP + port) without caring about who you are or what you do.

Characteristics:

  • Extremely fast: Simple address forwarding without parsing packet content
  • Use cases: Database connections, Redis caching, long-connection game servers
  • Representative products: LVS (Linux Virtual Server), AWS NLB, Azure Load Balancer
How It Works
Client request → L4 Load Balancer → Backend server

           Only looks at IP + Port

           Fast forwarding (no content inspection)

2.2 Layer 7 Load Balancing (L7): Inspecting the Package Contents

Operates at the application layer (HTTP/HTTPS) — like a delivery driver who not only checks the address but also opens the package to inspect its contents before deciding how to deliver.

Characteristics:

  • Intelligent routing: Can perform fine-grained routing based on URL paths, HTTP headers, Cookies, etc.
  • Advanced features: SSL termination, content caching, compression, WAF security
  • Use cases: Web applications, API gateways, microservice architectures
  • Representative products: Nginx, HAProxy, AWS ALB, Envoy
How It Works
Client request → L7 Load Balancer → Parses HTTP content

          Inspects URL, Header, Cookie

          Intelligent routing to specific server

2.3 L4 vs L7 Comparison

DimensionL4 Load BalancingL7 Load Balancing
OSI LayerTransport Layer (TCP/UDP)Application Layer (HTTP/HTTPS)
Routing BasisIP address + portURL, Header, Cookie, Body
Processing SpeedExtremely fast (kernel-space)Fast (user-space parsing)
Feature RichnessBasic forwardingSSL termination, caching, compression, WAF
Typical ScenariosDatabases, gaming, long connectionsWeb apps, API gateways, microservices
Representative ProductsLVS, AWS NLBNginx, HAProxy, AWS ALB

3. Core Problem 1: How to Prevent "Broken" Servers from Continuing to Serve?

3.1 Health Checks: Don't Let "Sick" Servers Drag Down the System

Imagine one of your checkout counters breaks, but the assignment person doesn't know and keeps sending customers there. The queue grows longer, and customers grow angrier.

Health checks are the "sentinels" that prevent this scenario. They periodically "examine" each server, immediately removing any "sick" ones from the pool and bringing them back once they "recover."

3.2 Active Health Checks vs Passive Health Checks

Active Health Check: The load balancer actively "knocks on the door" asking the server, "Are you still there?"

  • Periodically sends probe requests (e.g., HTTP /health, TCP ping)
  • Considers the server unhealthy if the response times out or returns an error code
  • Advantage: Accurate and reliable detection
  • Disadvantage: Generates additional probe traffic

Passive Health Check: The load balancer "observes" the response patterns of real business traffic

  • Tracks response times and error rates of actual requests
  • Considers the server unhealthy after multiple consecutive failures
  • Advantage: No additional traffic generated
  • Disadvantage: Requires sufficient traffic samples to make a determination
Threshold Settings
MetricHealthy ThresholdUnhealthy ThresholdNotes
HTTP Status Code200-399400+ or timeout4xx/5xx are all considered failures
TCP ConnectionSuccessfully establishedConnection timeoutChecks whether the port is reachable
Response Time< 500 ms> 2000 msTimeout typically set to 2-5 seconds
Consecutive Failures-3 timesAvoids false positives from transient blips
Check Interval-5 sToo frequent increases load

💡 Common Pitfall: Thresholds Set Too "Sensitive"

A team set the health check response time threshold to 100ms, but their application's average response time fluctuated between 80-120ms. As a result, servers were frequently marked as "unhealthy," causing traffic to oscillate between healthy and unhealthy states, and overall system availability actually dropped.

The right approach: Thresholds should be set to 2-3x the P99 response time, leaving enough buffer for normal fluctuations.


4. Core Problem 2: How to Ensure "Returning Customers" Always Get the Same "Cashier"?

4.1 Session Persistence: Let "Returning Customers" Always Find the Same "Cashier"

Imagine you're a regular at a bubble tea shop, and the same staff member serves you every time. She knows your preferences (half sugar, no ice) and serves you quickly and thoughtfully. But if you get a new person every time, you have to repeat the same requests over and over — a huge efficiency loss.

Session persistence (sticky sessions) solves this problem: ensuring that requests from the same user are always routed to the same backend server.

会话保持机制
Cookie、IP哈希与粘性会话的技术对比
应用场景:
👤
用户A
👥
用户B
👨‍💼
用户C
请求
⚖️负载均衡器
🍪
Cookie 插入
通过HTTP Cookie保持会话
会话映射表
sess_abc123Server 1
sess_def456Server 2
sess_ghi789Server 1
🖥️
Server 1
10.0.1.10
选中
🖥️
Server 2
10.0.1.11
🖥️
Server 3
10.0.1.12
三种会话保持机制对比
🍪Cookie 插入
不受客户端IP变化影响
首次请求即可保持会话
客户端需支持Cookie
存在Cookie被禁用的风险
#️⃣IP Hash
无需客户端支持任何机制
无状态,LB重启不影响会话
客户端IP变化会丢失会话
难以做到真正的负载均衡
📝粘性会话
结合Cookie和IP两种方式优势
支持会话复制和故障转移
实现复杂,需要应用支持
会话复制带来性能开销

4.2 Comparison of Three Session Persistence Mechanisms

MechanismHow It WorksAdvantagesDisadvantagesUse Cases
Cookie InsertLB inserts a cookie in the response; subsequent requests carry this cookieUnaffected by IP changes; persists from the first requestClient must support cookies; may be disabledShopping carts, login sessions
IP HashHashes the client IP and maps it to a specific serverNo client-side support needed; statelessSession lost if IP changes; hard to distribute evenlyCookie-free environments, WebSocket
Sticky Session TableLB maintains a mapping table of sessions to serversSupports session replication and failoverConsumes LB memory; requires additional synchronizationScenarios with strict high-availability requirements

💡 Usage Recommendations

  • Cookie Insert: Preferred recommendation; best compatibility
  • IP Hash: Only for special scenarios like WebSocket
  • Sticky Session Table: Combined with cookies to provide failover capability

5. Core Problem 3: How to Achieve Zero-Downtime Deployment?

5.1 Blue-Green Deployment: "One-Click Switch" for Zero-Downtime Releases

Core idea: Maintain two identical production environments (blue and green) simultaneously, but only one serves live traffic at any given time.

蓝绿部署
零停机发布的经典策略,两套环境瞬间切换
🔵
蓝环境
v1.0.0
100% 流量
🟢
绿环境
v1.1.0
0% 流量
用户流量
👤
👤
👤
👤
👤
⚖️
负载均衡器
当前指向: 🔵 蓝环境
🔵蓝环境v1.0.0
🖥️B1
🖥️B2
🖥️B3
🟢绿环境v1.1.0
🖥️G1
🖥️G2
🖥️G3
蓝绿部署流程
1
绿环境部署
在绿环境部署新版本,进行冒烟测试
2
切换流量
将负载均衡器指向绿环境,流量瞬间切换
3
监控观察
观察绿环境运行状态,确认无异常
4
蓝环境升级
在蓝环境部署新版本,为下次切换做准备
蓝绿部署优缺点
优点
  • 零停机时间:流量切换在毫秒级完成,用户无感知
  • 快速回滚:发现问题可立即切回原环境,风险可控
  • 完整的预发布测试:新环境可完整测试后再接管流量
  • 数据一致性:无需处理新旧版本同时运行时的兼容问题
缺点
  • 资源成本高:需要同时维护两套完整环境,服务器成本翻倍
  • 数据库兼容性挑战:如果涉及数据库Schema变更,需要特别处理兼容性
  • 预热问题:新环境启动后可能需要时间预热缓存、连接池等
  • 不适合有状态服务:对于长连接、会话保持要求高的场景处理复杂

Workflow:

  1. Initial state: Blue environment runs v1.0 (production), green environment stands by.
  2. Deploy new version: Deploy v1.1 to the green environment and run internal smoke tests.
  3. Switch traffic: Point the load balancer to the green environment; traffic instantly switches to v1.1.
  4. Monitor: Observe the green environment's behavior and confirm no anomalies.
  5. Keep old version: Keep the blue environment on v1.0 for a period (e.g., 24 hours) as a safety net for rapid rollback.

✨ Pros and Cons

ProsCons
✅ Zero downtime; switch completes in milliseconds❌ High resource cost; requires maintaining two environments
✅ Fast rollback; switch back immediately if issues are found❌ Database schema changes require special compatibility handling
✅ New environment can be fully tested before taking over traffic❌ Not suitable for stateful services (e.g., WebSocket long connections)

5.2 Canary Release: "Small Steps, Fast Iteration" Canary Strategy

The canary release is named after the historical "coal mine canary" — miners brought canaries into the mines; if the canary showed signs of distress, it indicated toxic gas leakage, and miners would evacuate immediately. In software releases, a canary release means exposing a small subset of users to the new version first, observing for issues, and then gradually expanding the rollout.

金丝雀发布
灰度发布策略,小流量先行验证新版本
流量分配比例拖动滑块调整新旧版本流量占比
稳定版 v1.0.090%
金丝雀 v1.1.010%
实时流量模拟 总请求: 0 | 稳定版: 0 | 金丝雀: 0
用户请求
负载均衡器
⚖️
Canary:10%
后端服务
稳定版 v1.0.0
📦S1
📦S2
📦S3
金丝雀 v1.1.0
🧪C1
🧪C2
金丝雀发布最佳实践
📊渐进式放量
  • 1% → 5% → 10% → 25% → 50% → 100%
  • 每个阶段观察至少15-30分钟
  • 关键指标:错误率、延迟、吞吐量
🎯精准用户选择
  • 内部员工/测试用户先行
  • 按地域:选择特定区域用户
  • 按用户属性:VIP用户或普通用户
  • 按设备类型:iOS/Android/Web
🛡️自动回滚机制
  • 错误率超过阈值自动回滚
  • P99延迟异常触发告警
  • 关键业务指标下降自动回滚
  • 一键回滚:30秒内恢复旧版本
📈监控与指标
  • 基础设施:CPU、内存、磁盘、网络
  • 应用指标:QPS、错误率、延迟分布
  • 业务指标:转化率、订单量、收入
  • 用户体验:页面加载时间、交互延迟

Core idea:

  1. Small traffic first: Route 1% of traffic to the new version servers initially.
  2. Observe metrics: Continuously monitor error rates, latency, and key business metrics.
  3. Gradual rollout: If everything is normal, gradually increase the proportion to 5%, 10%, 25%, 50%, and 100%.
  4. Rapid rollback: If any anomaly is detected, immediately switch all traffic back to the old version.

💡 Advantages of Canary Releases

AdvantageDescription
🎯 Controlled riskEven if the new version has severe bugs, only a small number of users are affected
📊 Real-world validationValidated in the real production environment, more reliable than staging
🚀 Fast iterationTeams can release new features more confidently and frequently
💰 Resource-friendlyDoesn't require two complete environments like blue-green deployment

6. Core Problem 4: How to Make the System "Breathe" on Its Own?

6.1 Auto Scaling: Let the System "Flexibly Schedule" Like a Restaurant

Imagine you run a restaurant:

  • Lunch peak: You need 10 servers, but at 3 PM during the afternoon lull, you only need 2
  • If you always keep 10: labor costs explode
  • If you always keep only 2: customers during peak hours can't wait and all leave

Auto Scaling lets the system "flexibly schedule" like a restaurant — automatically adding servers when busy and removing them when idle.

自动扩缩容
基于CPU、内存、QPS的智能弹性伸缩
扩容指标:
实时监控 实时
💻CPU使用率
45%
扩容阈值: 70%缩容阈值: 30%
🧠内存使用率
60%
扩容阈值: 75%缩容阈值: 40%
QPS
650req/s
扩容阈值: 1000/s目标: 800/s
🖥️运行实例
3个实例
最小: 2最大: 10
1
2
3
4
5
6
7
8
9
10
扩缩容历史最近 5 次操作
📈
扩容: 2 → 3 实例
CPU使用率超过70%
10:23
📉
缩容: 4 → 3 实例
CPU使用率低于30%
09:15
📈
扩容: 3 → 4 实例
QPS达到1000/s
08:42
📈
扩容: 2 → 3 实例
内存使用率超过75%
07:30
📉
缩容: 5 → 4 实例
流量下降
06:20
自动扩缩容最佳实践
⏱️
冷却时间
设置适当的冷却时间(通常3-5分钟),避免扩缩容操作过于频繁导致的震荡
📊
多指标综合
不要依赖单一指标,结合CPU、内存、QPS、连接数等多维度进行综合判断
🎯
目标利用率
设置合理的资源目标利用率(如70%),预留足够的缓冲应对突发流量
快速扩容
扩容操作应该比缩容更激进,确保系统能快速应对流量增长

6.2 Choosing Scaling Metrics

The core question of auto scaling is: When should you add machines? When should you remove them?

Common decision metrics:

MetricScale-Up ThresholdScale-Down ThresholdUse Case
CPU Utilization> 70%< 30%Compute-intensive applications
Memory Utilization> 75%< 40%Memory-intensive applications
QPS (Queries/sec)> 1000/s< 400/sAPI gateways, web services
Connection Count> 5000< 1000Databases, message queues
Custom Business MetricsDepends on businessDepends on businessSpecific business scenarios

💡 Scaling Strategy Pitfalls and Solutions

Pitfall 1: Scaling responds too slowly; the traffic surge already crashes the system

During a major e-commerce promotion, the team set CPU > 80% as the scale-up trigger, but metric collection had a 1-minute delay, and new instance startup took 3 minutes. Traffic arrived too fast — before scaling completed, the servers were already overwhelmed.

Solutions:

  • Scale up proactively: Predict traffic peaks based on historical data and start scaling 30 minutes in advance
  • Multi-level thresholds: Set 60% as a warning (start warming new instances), 70% for formal scaling, 80% for emergency scaling
  • Fast scaling: Use containerized deployment so new instances start within 30 seconds (vs. 3-5 minutes for VMs)

Pitfall 2: Scaling is too aggressive; costs explode

A startup set an aggressive auto-scaling policy: scale up if CPU > 50%. As a result, a normal business fluctuation triggered scaling, and the server count ballooned from 5 to 30. The end-of-month cloud bill terrified the CTO.

Solutions:

  • Set a scale-up cooldown period: After one scale-up, wait at least 5 minutes before scaling again
  • Set a maximum instance count: max = current instance count × 2, to prevent unlimited growth
  • Distinguish spikes from trends: Only scale up if the threshold is exceeded for 3 consecutive periods, to avoid triggering on single-point spikes

Pitfall 3: Scaling down too fast; newly added machines are removed immediately

A team set CPU < 30% as the scale-down trigger. After scaling up, traffic was still settling, and CPU briefly dropped to 25%, triggering a scale-down. Right after scaling down, CPU spiked back to 80%, triggering another scale-up — the system oscillated wildly in a "scale-up, scale-down, scale-up" loop.

Solutions:

  • More conservative scale-down: Scale-up threshold at 70%, scale-down threshold at 25%, leaving enough buffer in between
  • Longer scale-down cooldown: Wait at least 10 minutes after scaling up before scaling down
  • Gradual scale-down: Remove only 1 instance at a time, observe, then decide whether to continue

7. Practical Guide: How to Choose a Load Balancer?

7.1 Comparison of Mainstream Load Balancers

FeatureNginxHAProxyEnvoyCloud Provider LB
PositioningHigh-performance reverse proxy / LBOpen-source load balancerCloud-native proxyManaged load balancer
PerformanceExtremely high (C, event-driven)High (event-driven)High (C++/Rust)Extremely high
Feature RichnessBasic LB, static files, cachingRich LB algorithmsAdvanced routing, observabilityFull-featured
ConfigurationConfig file (nginx.conf)Config file (haproxy.cfg)API / config fileUI console
ExtensibilityC modules / Lua scriptsLua scriptsWASM / FiltersPlugins
Use CasesStatic assets, L7 LB, SSL terminationL7 LB, high availabilityService mesh, multi-cloudQuick start

💡 Selection Guide

Decision Tree:

Choose a load balancer:

├─ Only need basic L4 load balancing?
│  ├─ Yes → LVS (open-source, free) or cloud provider NLB
│  └─ No → Continue

├─ Need service mesh or multi-cloud deployment?
│  ├─ Yes → Envoy
│  └─ No → Continue

├─ Need extremely complex configuration and plugins?
│  ├─ Yes → HAProxy
│  └─ No → Continue

├─ Need high performance + simple configuration?
│  ├─ Yes → Nginx (first choice)
│  └─ Continue

├─ Want managed operations?
│  ├─ Yes → Cloud provider LB (AWS ALB, Alibaba Cloud SLB)
│  └─ Self-host Nginx

8. Summary: Core Mindset of Load Balancing

8.1 Core Principles Recap

PrincipleMeaningKey Practice Points
LayeringL4 handles "package sorting" (fast but simple)L4 for databases, gaming; L7 for web, APIs
RedundancySingle points of failure are the enemyImprove availability through multi-instance, multi-region deployment
GradualismDon't release new versions with "one big cut"Blue-green deployment for zero downtime; canary for controlled risk
ElasticityThe system should "breathe" like a living organismAutomatically scale up when busy, scale down when idle

8.2 Design Checklist

Before introducing load balancing, ask yourself the following questions:

  • [ ] Is load balancing really needed? (Is single-machine performance truly insufficient?)
  • [ ] Choose L4 or L7? (Based on business scenario)
  • [ ] How to handle session persistence? (Cookie, IP hash, session table)
  • [ ] How to implement health checks? (Active, passive, threshold settings)
  • [ ] How to achieve zero downtime? (Blue-green deployment, canary)
  • [ ] How to implement elasticity? (Scaling metrics, cooldown periods, max instance count)

9. Glossary

TermChinese TranslationExplanation
Load Balancer负载均衡器A device or software that distributes traffic across multiple backend servers
L4 Load Balancing四层负载均衡Load balancing based on the transport layer (TCP/UDP)
L7 Load Balancing七层负载均衡Load balancing based on the application layer (HTTP/HTTPS)
Health Check健康检查A mechanism that periodically checks the health status of backend servers
Session Persistence会话保持Ensures requests from the same user are always routed to the same server
Sticky Session粘性会话Another term for Session Persistence
Blue-Green Deployment蓝绿部署A zero-downtime release strategy using two environments that switch over
Canary Release金丝雀发布A canary release strategy that validates with a small traffic portion first
Auto Scaling自动扩缩容Automatically increasing or decreasing the number of servers based on load
Horizontal Scaling水平扩展Increasing server count to improve processing capacity
Vertical Scaling垂直扩展Upgrading individual machine specs (CPU, RAM) to improve processing capacity
Multi-Region多区域Deploying services across multiple geographic regions
Active-Active多活Multiple regions simultaneously serving traffic
Active-Standby主备Only one region serves traffic; others are on standby
Data Replication数据同步Cross-region data replication mechanism
RTO恢复时间目标Recovery Time Objective — the target time within which a system must recover after failure
RPO恢复点目标Recovery Point Objective — the acceptable amount of data loss after a system failure