Skip to content

System Design Methodology

Introduction

System design is not about sketching architecture diagrams on a whim — it's a structured methodology. Whether it's a system design interview question or real-world architecture design, both follow a similar thinking framework: first understand the problem, then estimate the scale, then design the solution, and finally dive deep into optimization.

What will you learn from this article?

After reading this chapter, you will gain:

  • Design Process: Master the four-step framework for system design
  • Capacity Estimation: Learn the art of "back-of-envelope estimation"
  • Common Patterns: Get familiar with core patterns like caching, database sharding, and message queues
  • Trade-off Thinking: Understand the trade-off mindset in architecture design
  • Practical Case Studies: Understand the design process through cases like URL shorteners and feed systems
ChapterContentCore Concepts
Chapter 1Four-Step Design MethodRequirements clarification, capacity estimation, architecture design, deep optimization
Chapter 2Capacity EstimationQPS, storage, bandwidth, back-of-envelope estimation
Chapter 3Core Design PatternsCaching, database sharding, message queues, CDN
Chapter 4Trade-off ThinkingConsistency vs. availability, performance vs. cost
Chapter 5Classic Case StudiesURL shortener, feed system, flash sale system

1. The Four-Step System Design Method

System design is not about drawing architecture diagrams right away. Whether in an interview or in practice, you should follow a structured process.

Four-Step System Design Method
Click each step to inspect the details
1
Clarify requirements
~5 min
2
Estimate capacity
~5 min
3
Design architecture
~15 min
4
Deep dive
~10 min
Clarify requirements
Do not rush into drawing architecture diagrams. First clarify the problem, scale, core features, and non-functional requirements.
What are the core features? (MVP scope)
Expected user scale? DAU / QPS
Read/write ratio?
Data volume? How much data must be stored?
Availability target? How many nines?
Latency target? What P99 latency is acceptable?
Example: designing a URL shortener
URL shortener: create short links (write) and redirect (read), roughly 100:1 read/write ratio, 100 million redirects per day, links never expire.

Why Clarify Requirements First?

Many people start drawing diagrams as soon as they get the prompt, only to design a system that is "correct but not what the interviewer wanted." Spending 5 minutes clarifying requirements can prevent 30 minutes of rework later.

Common clarification questions:

  • What are the core features of the system? (Don't design every feature)
  • What is the user scale? (Determines whether distribution is needed)
  • What is the read/write ratio? (Determines caching strategy)
  • How long does data need to be retained? (Determines the storage solution)

2. Capacity Estimation: The Art of Back-of-Envelope Calculations

"Back-of-envelope estimation" is a core skill in system design. You don't need precise calculations — just knowing the order of magnitude is enough.

Back-of-the-Envelope Estimator
Enter basic numbers to estimate system capacity requirements
Daily requests
2000.0 ten thousand
Average QPS
231
Peak QPS
693
Daily bandwidth
102.4 GB
Peak bandwidth
3.5 MB/s
Common estimation references
1 day86,400 seconds
1 month~2.5M seconds
QPS 1000~1 eight-core server
100M/day~1,200 QPS
Single MySQL node~5,000 QPS
Single Redis node~100,000 QPS

Quick Reference for Common Conversions

MagnitudeConversionMemory Trick
1 day86,400 seconds≈ 100K seconds
100M requests/day≈ 1,200 QPSDivide by 100K
1 KB × 100M≈ 100 GB100M small records
1 MB × 1M≈ 1 TB1M images

The 80/20 Rule in Estimation

Most systems follow the 80/20 rule: 20% of the data handles 80% of the requests. This means:

  • Cache size ≈ Total data volume × 20%
  • Hot key QPS ≈ Total QPS × 80% concentrated on 20% of keys
  • Cache hit rate target ≈ 80%+ (below this suggests a caching strategy problem)

3. Core Design Patterns

Patterns that appear repeatedly in system design — mastering these will prepare you for most scenarios.

3.1 Caching Patterns

PatternRead PathWrite PathUse Cases
Cache-AsideCheck cache first; on miss, query DB and backfillWrite DB first, then invalidate cacheGeneral purpose, most commonly used
Read-ThroughCache layer automatically loads from DBSame as Cache-AsideRequires caching framework support
Write-BehindSame as Cache-AsideWrite to cache first, async write to DBWrite-heavy, can tolerate data loss

Why "Invalidate Cache" Instead of "Update Cache"?

Updating the cache is prone to data inconsistency in concurrent scenarios: threads A and B update simultaneously, A writes to DB first but B updates the cache first, resulting in B's stale value in the cache. Invalidating the cache causes the next read request to reload from DB, naturally avoiding this problem.

3.2 Database Sharding

When a single table exceeds tens of millions of rows, or when a single database's QPS hits a bottleneck, it's time to consider database sharding.

StrategyApproachAdvantagesDisadvantages
Vertical shardingSplit databases by business domainBusiness decoupling, independent scalingCross-database JOINs are difficult
Horizontal shardingSplit one table into multiple tables by ruleControllable data volume per tableShard key selection is critical
Vertical table splittingSplit large columns into a separate tableReduces I/O, improves query efficiencyRequires additional JOINs

Shard Key Selection Principles:

  • Choose the most frequently queried field (e.g., user_id)
  • Data distribution should be even to avoid hotspots
  • Try to keep the same user's data on the same shard (minimizes cross-shard queries)

3.3 Message Queues

Message queues are the "shock absorbers" of distributed systems. Their core roles are decoupling, async processing, and peak shaving.

ScenarioWithout QueueWith Queue
Send notification after orderOrder API calls notification service synchronously; notification failure causes order failureSend message after order success; notification service consumes asynchronously
Flash saleBurst traffic overwhelms the databaseRequests enter queue first; backend processes at its own pace
Data synchronizationService A calls Service B's API directlyService A publishes event; Service B subscribes and handles it

4. Trade-off Thinking: There Are No Silver Bullets

The essence of architecture design is trade-offs. Every decision has a cost — the key is understanding the cost and making choices appropriate for the current stage.

Trade-off DimensionOption AOption BDecision Basis
Consistency vs. AvailabilityStrong consistency (CP)High availability (AP)Can the business tolerate brief inconsistency?
Performance vs. CostFull cachingOn-demand cachingData volume and budget
Simplicity vs. FlexibilityMonolithic architectureMicroservicesTeam size and business complexity
Real-time vs. BatchStream processingBatch processingData timeliness requirements
Self-managed vs. HostedBuild your own MySQLUse cloud database RDSOperations capability and cost

Architecture Decision Records (ADR)

Every important architecture decision should be documented: what was the context, what options were considered, why this one was chosen, and what the trade-offs are. This isn't about assigning blame — it's about helping future teams understand "why it was designed this way."

The format is simple:

  • Title: Using X instead of Y
  • Context: What problem we encountered
  • Decision: What solution we chose
  • Rationale: Why we chose this
  • Consequences: The drawbacks and risks of this decision

Common Trade-off Mistakes

MistakeManifestationCorrect Approach
Premature optimizationSharding at 1,000 daily active usersStart with a single database; shard when you hit bottlenecks
Technology-driven"I want to use Kafka" instead of "I need async processing"Start from the problem, not the technology
Ignoring operations costChoosing the optimal solution that the team can't maintainSolutions must match team capability
Pursuing perfect consistencyUsing distributed transactions for every scenarioEventual consistency is sufficient for most scenarios

5. Classic Case Studies

Let's connect the methodology we've learned through three classic examples.

5.1 URL Shortener (TinyURL)

The URL shortener is a classic system design interview question — small but comprehensive.

Requirements Clarification:

  • Core features: Long URL → short URL (write), short URL → redirect (read)
  • Read/write ratio: approximately 100:1 (reads far outnumber writes)
  • Daily redirects: 100 million
  • Short URLs never expire

Capacity Estimation:

MetricCalculationResult
Write QPS100M / 100 / 86,400≈ 12 QPS
Read QPS100M / 86,400≈ 1,200 QPS
Peak read QPS1,200 × 3≈ 3,600 QPS
5-year storage1M/day × 365 × 5 × 100B≈ 18 GB
Cache (20%)18 GB × 20%≈ 3.6 GB

Architecture Design:

Write path: Client → API Server → ID Generator → Base62 Encode → Write to MySQL + Redis
Read path: Client → CDN → API Server → Redis lookup → 302 redirect
                                    ↓ (cache miss)
                                  MySQL query → backfill Redis

Key Design Decisions:

  • Short code generation: Snowflake distributed ID + Base62 encoding to avoid hash collisions
  • Caching strategy: Cache-Aside, CDN acceleration for hot short URLs
  • Database: Single table suffices (18GB is small), index by short code

5.2 Feed System

Social platform feeds (WeChat Moments, Twitter home timeline) are another classic question.

Core Challenge: When a user publishes a post, how do all their followers see it?

ApproachHow It WorksAdvantagesDisadvantages
Pull modelAggregate followees' posts in real time at read timeSimple writes, less storageSlow reads; high latency with many followees
Push modelWrite to all followers' inboxes at publish timeExtremely fast readsSevere write amplification for accounts with many followers
Hybrid (Push-Pull)Push for regular users, pull for celebritiesBalanced read/write performanceComplex implementation

Hybrid Push-Pull Approach:

  • Followers < 10K: Push to all followers' feed caches at publish time (push model)
  • Followers > 10K: Don't push; followers pull in real time when reading (pull model)
  • When a user opens their feed: Merge pushed content + real-time pulled celebrity content, sorted by time

5.3 Flash Sale System

The core challenge of a flash sale: instant ultra-high concurrency + inventory must not be oversold.

Traffic Characteristics:

  • Before the sale starts: Many users refresh the page waiting
  • At the moment of sale: QPS can be 100x or more above normal
  • After the sale ends: Traffic drops quickly

Layered Peak Shaving Strategy:

User request → CDN (static pages) → Gateway (rate limiting) → Message queue (peak shaving) → Inventory service (deduction)
LayerStrategyEffect
FrontendButton gray-out + random delay + CAPTCHAFilters bots, disperses requests
CDNStatic resource cachingReduces 90% of page requests
GatewayToken bucket rate limitingOnly allows traffic the system can handle
Message queueRequests queued, processed asynchronouslyPeak shaving, protects the database
Inventory serviceRedis pre-deduction + Lua atomic operationsPrevents overselling, millisecond response

Core Principles of Flash Sales

  1. Intercept upstream whenever possible: If you can block it at the CDN, don't let it reach the application layer
  2. Separate reads and writes: Product detail pages use cache; only orders go to the database
  3. Async processing: After the user clicks "buy," immediately return "queuing" and process in the background
  4. Fallback plans: Rate limiting, circuit breaking, degradation — every layer needs a Plan B

Summary

System design is a highly practical skill. The core lies in structured thinking and making trade-offs.

Key takeaways from this chapter:

  1. Four-Step Framework: Requirements clarification → capacity estimation → architecture design → deep optimization — every step is essential
  2. Back-of-Envelope Estimation: Precision isn't needed — just knowing the order of magnitude guides architecture decisions
  3. Core Patterns: Caching, database sharding, message queues, CDN, rate limiting, circuit breaking — these are the "building blocks" of system design
  4. Trade-off Thinking: There are no perfect solutions, only solutions appropriate for the current stage — document the rationale and cost of every decision
  5. Classic Cases: URL shorteners for fundamentals, feed systems for push-pull models, flash sales for high concurrency — mastering these three lets you reason by analogy

Further Reading