Rate Limiting and Backpressure Control

Introduction

On Double 11 at midnight, hundreds of millions of users flood in simultaneously — can the servers handle it? Every system has a processing capacity limit. When request volume exceeds what the system can bear, without control, the result is that nobody can use the service. Rate limiting and backpressure are the two lines of defense that protect systems from being "overwhelmed."

What will you learn in this article?

After reading this chapter, you will gain:

Why rate limiting is necessary: Understand why you need to proactively reject some requests to protect the system
Rate limiting algorithms: Master the principles and differences of three core algorithms — token bucket, leaky bucket, and sliding window
Backpressure mechanisms: Understand handling strategies when upstream speed exceeds downstream speed
Multi-layer rate limiting: Learn the multi-layer rate limiting architecture from client to gateway to service
Practical skills: Know which rate limiting strategy to choose for which scenario

Chapter	Content	Core Concepts
Chapter 1	Why rate limiting is needed	Cascading failure, service protection
Chapter 2	Rate limiting algorithms	Token bucket, leaky bucket, sliding window
Chapter 3	Backpressure control	Buffer, drop strategy, elastic scaling
Chapter 4	Multi-layer rate limiting architecture	Client, gateway, server side
Chapter 5	Practice and selection	Nginx, Redis, Sentinel

0. The Big Picture: Why "Reject" Users?

This sounds counterintuitive — shouldn't we serve every user well? But the reality is: if you don't reject some requests, all requests will fail.

Imagine a restaurant that can only seat 100 people, and suddenly 1,000 people rush in. Without rate limiting, the result isn't that all 1,000 get to eat — it's that the kitchen crashes, the servers are overwhelmed, and nobody gets fed. The right approach is to queue and limit at the door, letting 100 people in first while the rest wait.

Core Goals of Rate Limiting

Protect the system: Prevent overload from causing complete service unavailability
Fair allocation: Ensure accepted requests can be processed normally
Graceful degradation: Rate-limited requests receive a clear 429 status code, rather than a timeout or 500 error

1. Rate Limiting Algorithms: Three Classic Approaches

The core question of rate limiting is: within a unit of time, what is the maximum number of requests allowed through? Different algorithms make different trade-offs in precision, burst traffic handling, and implementation complexity.

Passed0

Rejected0

Tokens left5

Token bucket

Adds tokens to the bucket at a fixed rate. Each request consumes one token, and extra tokens are discarded when the bucket is full. It allows bursts when stored tokens are available.

Algorithm	Principle	Burst Traffic	Precision	Implementation Complexity
Token bucket	Tokens added at a fixed rate; requests consume tokens	Allowed (when bucket has surplus)	High	Medium
Leaky bucket	Requests queue up; processed at a fixed rate	Not allowed (fully smoothed)	High	Medium
Sliding window	Counts requests within a time window	Partially allowed	Fairly high	Low
Fixed window	Counts by fixed time window	May burst at boundaries	Low	Lowest

Which Algorithm to Choose?

API rate limiting: Token bucket is most commonly used, allowing reasonable burst traffic
Traffic shaping: Leaky bucket suits scenarios requiring constant output rate
Simple counting: Sliding window is easy to implement, suitable for most web applications

2. Backpressure Control: When Upstream Is Faster Than Downstream

Rate limiting solves the problem of "too many external requests," while backpressure solves the problem of "internal component speed mismatch."

When a producer generates data faster than a consumer can process it, the intermediate buffer keeps growing, eventually leading to memory overflow or data loss. Backpressure mechanisms allow consumers to "notify upstream to slow down."

Produce rate:6/s

Consume rate:3/s

Producer

6/s

Buffer (0/20)

Running normally

Consumer

3/s

Backpressure strategies:

Drop strategy

Drop new data directly when the buffer is full

Example: log collection, real-time metrics

Blocking strategy

Make producers wait when the buffer is full

Example: Go channels, Java BlockingQueue

Sampling strategy

Process only part of the data and skip the rest

Example: downsampling high-frequency sensor data

Elastic scaling

Dynamically increase the number of consumers

Example: Kubernetes HPA autoscaling

Four Backpressure Strategies

Drop: When the buffer is full, discard new or old data; suitable for scenarios with high real-time requirements but tolerable data loss
Block: Pause the producer until the consumer finishes processing; suitable for scenarios where data cannot be lost
Sample: Only process a portion of the data; suitable for high-frequency data streams
Elastic Scaling: Dynamically increase the number of consumers; suitable for cloud-native environments

3. Multi-Layer Rate Limiting Architecture

In production environments, rate limiting at a single point is not enough — you need multi-layer protection, with each layer solving problems at a different granularity.

Layer	Location	Rate Limiting Granularity	Tools
Client	Frontend/App	Button debounce, request throttling	lodash.throttle, debounce
CDN/WAF	Edge nodes	IP-level, region-level	Cloudflare Rate Limiting
API Gateway	Entry gateway	Route-level, user-level	Nginx limit_req, Kong
Server side	Inside application	Interface-level, resource-level	Sentinel, Resilience4j
Database	Storage layer	Connection count, QPS	Connection pool configuration, slow query circuit breaking

HTTP Specification for Rate Limiting

Rate-limited requests should return a 429 Too Many Requests status code with response headers including:

Retry-After: How long the client should wait before retrying (seconds or date)
X-RateLimit-Limit: Rate limit ceiling
X-RateLimit-Remaining: Remaining quota
X-RateLimit-Reset: Quota reset time

4. Practical Selection

Scenario	Recommended Solution	Notes
Nginx entry rate limiting	`limit_req_zone`	Based on leaky bucket algorithm, simple configuration
Distributed rate limiting	Redis + Lua script	Token bucket or sliding window, multi-instance shared counting
Java microservices	Sentinel / Resilience4j	Supports circuit breaking, degradation, hotspot rate limiting
Node.js API	express-rate-limit	Easy to use, supports Redis storage
Go services	golang.org/x/time/rate	Standard library token bucket implementation

Summary

Rate limiting and backpressure are two critical lines of defense for protecting system stability. Rate limiting controls the rate of incoming external traffic, while backpressure coordinates the processing speed of internal components.

Key takeaways from this chapter:

Necessity of rate limiting: Without rejecting some requests, all requests will fail
Three core algorithms: Token bucket (allows bursts), leaky bucket (fully smoothed), sliding window (simple and precise)
Backpressure mechanisms: Four strategies — drop, block, sample, scale
Multi-layer protection: From client to database, each layer solves problems at a different granularity
429 specification: Return standard status code and rate limit headers when rate-limited

Rate Limiting and Backpressure Control ​

0. The Big Picture: Why "Reject" Users? ​

1. Rate Limiting Algorithms: Three Classic Approaches ​

2. Backpressure Control: When Upstream Is Faster Than Downstream ​

3. Multi-Layer Rate Limiting Architecture ​

4. Practical Selection ​

Summary ​

Further Reading ​