Concurrency, Async, and Multithreading

💡 Learning Guide: Concurrent programming is the "Achilles' heel" for many backend engineers — you get stumped in interviews, hit bugs in production, and have no clue where to start with performance tuning. This chapter revolves around one core question: When 100,000 users hit your service simultaneously, will your code collapse?

Before diving in, make sure you have these two foundational pieces:

What are CPU, Memory, and I/O: If you're fuzzy on these basics, brush up on operating system fundamentals.
What is blocking/non-blocking: If you're not yet comfortable with synchronous/asynchronous concepts, get a feel for them through hands-on programming first.

0. Introduction: Why Does Your Service "Freeze" During Peak Traffic?

Process / Thread / Coroutine Comparison

Multi-ProcessMulti-ThreadCoroutine

Memory Usage

400

Context Switches

Completed Tasks

Elapsed Time

CPU 1

CPU 2

CPU 3

CPU 4

Task Queue

Task 1

Task 2

Task 3

Task 4

Task 5

Task 6

Task 7

Task 8

Task 9

Task 10

Task 11

Task 12

Task 13

Task 14

Task 15

Task 16

Multi-Process Model

Each process has its own independent memory space, strong isolation but high overhead. Inter-process communication requires IPC mechanisms. Suitable for scenarios requiring strong isolation, such as browser tabs and sandbox programs.

Many developers encounter situations like these in practice:

The service responds lightning-fast in local testing, but becomes a "slideshow" once deployed;
You bought high-spec servers, yet CPU utilization never goes up;
During promotional peaks, the service "avalanches," forcing you to degrade or circuit-break.

Intuitively, we think: "The server isn't powerful enough." But most of the time, the problem isn't that the hardware is "too slow" — it's that we didn't design the concurrency model properly.

The Core Dilemma:

Without concurrent processing: requests queue up, user experience is terrible;
With reckless multithreading: lock contention and context-switching overhead actually degrade performance.

Faced with these challenges, simply "adding more machines" is no longer enough. We need a systematic approach to concurrency design that ensures both performance and stability under high concurrency. That's exactly what this chapter aims to address.

1. Core Concepts: Processes, Threads, and Coroutines — What's the Difference?

1.1 A Restaurant Analogy

Imagine you run a restaurant and need to serve many customers at once:

Concept	Restaurant Analogy	Technical Meaning
Process	An independent restaurant branch	Has its own memory space and resource allocation. It is the basic unit of OS resource allocation. A crash in one process does not affect others.
Thread	A chef within a branch	The basic unit of CPU scheduling, sharing memory space within a process. Threads in the same process can share data, but one thread crashing can bring down the entire process.
Coroutine	A chef's "cloning technique"	A user-space lightweight thread, scheduled by the program itself rather than the OS. Switching overhead is minimal; you can create millions of them.

1.2 In-Depth Comparison: The Essential Differences

Process Memory Isolation Demo

System Memory

Process Isolation

Each process has its own independent virtual address space. A crash in one process does not affect other processes. Click "Create Process" to start the demo.

Process: The "Container" of Resource Isolation

Key Characteristics:

Strong isolation: Each process has its own independent virtual address space
High overhead: Creation/switching requires OS intervention, taking ~1-10ms
Complex communication: Inter-Process Communication (IPC) requires special mechanisms (pipes, message queues, shared memory, etc.)

Use Cases:

Services requiring strong isolation (e.g., browser tabs, sandboxed programs)
Multi-language hybrid deployments
Service units that need independent restart/upgrade

Thread: The "Light Cavalry" of Shared Memory

Thread Scheduling Demo

FIFO (First Come First Served)Round RobinPriority Scheduling

Timeline

0ms

100ms

200ms

300ms

400ms

500ms

Completed Threads

Context Switches

0ms

Avg Wait Time

Throughput (threads/s)

Current Scheduling Algorithm: Round Robin (Time Slice)

Each thread takes turns executing for a time slice. When the slice expires, it switches to the next thread. Good responsiveness, suitable for interactive systems.

Key Characteristics:

Shared memory: Threads within the same process share code, data, and heap segments
Independent stack space: Each thread has its own stack (typically ~1MB)
Fast switching: Thread switching takes ~1-10μs, about 1000× faster than process switching
Synchronization required: Shared data must be protected with locks

Use Cases:

CPU-intensive tasks (computation, image processing)
Concurrent tasks that need to share a lot of data
Latency-sensitive background tasks

Coroutine: The User-Space "Green Thread"

Coroutine Lightweight Comparison Demo

Memory Usage ComparisonSwitching Overhead ComparisonCreation Speed Comparison

1000 coroutines

Thread Model

Memory Usage

1000 MB

Creation Time

100 ms

Context Switch

~1-10 us

Coroutine Model

Memory Usage

2000 MB

Creation Time

10 ms

Context Switch

~100 ns

Saves -100% Memory

Medium Scale

Using coroutines can save -100% of memory (about -1000MB), with 10x faster creation speed.

Key Characteristics:

User-space scheduling: Scheduled by the program/runtime library, not the OS
Extremely lightweight: Coroutine stacks are typically only a few KB; you can create millions
Extremely fast switching: Coroutine switching takes ~100ns, about 100× faster than thread switching
Non-preemptive: Coroutines voluntarily yield the CPU (cooperative multitasking)

Use Cases:

I/O-intensive high-concurrency services (web servers, gateways)
Scenarios requiring many long-lived connections (IM, game servers)
Streaming data processing, pipeline workflows

2. Case Study: An E-Commerce Promotion's "Concurrency Nightmare"

2.1 Lessons Learned: Evolution from "Single Machine" to "Distributed"

Let's look at the evolution story of a real e-commerce system:

Phase 1: The Single-Machine Era (1K DAU)

python

# A simple Flask application
from flask import Flask

app = Flask(__name__)

@app.route('/order')
def create_order():
    # Query inventory
    stock = db.query("SELECT stock FROM products WHERE id=1")
    if stock > 0:
        # Deduct inventory
        db.execute("UPDATE products SET stock = stock - 1 WHERE id=1")
        # Create order
        db.execute("INSERT INTO orders ...")
        return "Order created!"
    return "Out of stock!"

# Start: flask run

Problems:

Single process, single thread — can only handle one request at a time
Inventory deduction is not locked, leading to overselling under concurrency
Limited database connections; the connection pool is quickly exhausted

Phase 2: The Multi-Process Era (10K DAU)

python

# Deploy with Gunicorn multi-process
gunicorn -w 4 -k sync app:app

# 4 worker processes, each handling requests independently

New Problems:

4 processes all query inventory simultaneously, all see stock=1, all deduct successfully — 3 oversold items!
Need to introduce distributed locks

python

import redis

# Use Redis distributed lock
lock = redis_client.lock("stock_lock", timeout=10)
if lock.acquire():
    try:
        stock = db.query("SELECT stock FROM products WHERE id=1")
        if stock > 0:
            db.execute("UPDATE products SET stock = stock - 1 WHERE id=1")
    finally:
        lock.release()

Phase 3: The Coroutine Era (100K DAU)

python

# Use FastAPI + asyncio
from fastapi import FastAPI
import asyncio

app = FastAPI()

async def check_stock(product_id: int) -> int:
    # Async database query, non-blocking
    result = await db.fetch_one(
        "SELECT stock FROM products WHERE id = :id",
        {"id": product_id}
    )
    return result["stock"]

@app.get("/order")
async def create_order(product_id: int):
    # Concurrently check inventory and user info
    stock_task = check_stock(product_id)
    user_task = get_user_info(request.user_id)

    stock, user = await asyncio.gather(stock_task, user_task)

    if stock > 0:
        # Async inventory deduction
        await db.execute(
            "UPDATE products SET stock = stock - 1 WHERE id = :id",
            {"id": product_id}
        )
        return {"status": "success"}

    return {"status": "out_of_stock"}

# Start: uvicorn main:app --workers 4
# Each worker can handle thousands of concurrent coroutines

Advantages:

Thousands of concurrent connections within a single thread
Voluntarily yields CPU during I/O operations without blocking other requests
Extremely low memory footprint, ideal for high-concurrency long-connection scenarios

2.2 Concurrency Model Evolution Comparison Table

Phase	Concurrency Model	DAU Supported	Core Problem	Solution
Monolith	Single process, single thread	1K	Cannot handle concurrency	Introduce multi-process
Multi-Process	Multi-process synchronous	10K	Data races, overselling	Distributed locks
Multi-Threaded	Multi-threaded + locks	50K	Context-switching overhead, deadlocks	Thread pools, lock-free queues
Coroutine	Async I/O	100K+	Code complexity, debugging difficulty	Framework encapsulation, distributed tracing
Hybrid	Multi-process + coroutines	1M+	Architectural complexity	Service governance, elastic scaling

3. Deep Dive: How Various Concurrency Models Work

3.1 Process Model: Isolation and Communication

Memory Isolation Mechanism

Process Memory Isolation Demo

System Memory

Process Isolation

Each process has its own independent virtual address space. A crash in one process does not affect other processes. Click "Create Process" to start the demo.

Each process has its own independent virtual address space:

Process A Virtual Memory      Process B Virtual Memory
+----------------+        +----------------+
|  Kernel Space  |        |  Kernel Space  |  <-- Shared (read-only)
|  (shared)      |        |  (shared)      |
+----------------+        +----------------+
|  Stack         |        |  Stack         |  <-- Independent
|  (grows down)  |        |  (grows down)  |
+----------------+        +----------------+
|  Heap          |        |  Heap          |  <-- Independent
|  (grows up)    |        |  (grows up)    |
+----------------+        +----------------+
|  Data Segment  |        |  Data Segment  |  <-- Independent
|  (.bss/.data)  |        |  (.bss/.data)  |
+----------------+        +----------------+
|  Code Segment  |        |  Code Segment  |  <-- Independent
|  (.text)       |        |  (.text)       |
+----------------+        +----------------+

Inter-Process Communication (IPC) Methods

Method	Principle	Speed	Use Case
Pipe	Kernel buffer, unidirectional stream	Medium	Parent-child process communication
Message Queue	Kernel message linked list	Medium	Async message passing
Shared Memory	Same physical memory mapped	Fastest	Large data sharing
Semaphore	Kernel counter	-	Synchronization and mutual exclusion
Socket	Network protocol stack	Slower	Cross-machine communication
Signal	Soft interrupt	-	Event notification

3.2 Thread Model: Scheduling and Synchronization

Thread Scheduling Principles

Thread Scheduling Demo

FIFO (First Come First Served)Round RobinPriority Scheduling

Timeline

0ms

100ms

200ms

300ms

400ms

500ms

Completed Threads

Context Switches

0ms

Avg Wait Time

Throughput (threads/s)

Current Scheduling Algorithm: Round Robin (Time Slice)

Each thread takes turns executing for a time slice. When the slice expires, it switches to the next thread. Good responsiveness, suitable for interactive systems.

How the OS thread scheduler works:

Ready Queue                  Running                  Waiting Queue
+--------+                +--------+               +--------+
| Thread B|  <-- timeslice| Thread A|  <-- I/O req | Thread C|
| Thread D|      expires  |(running)|              | Thread E|
| Thread F|                +--------+              |(blocked)|
+--------+                                         +--------+
    |                                                  |
    v                                                  v
Scheduler picks next to run by priority    Moved back to ready queue when I/O completes

Common Thread Synchronization Mechanisms

Mechanism	Principle	Pros	Cons
Mutex	Binary state, exclusive access	Simple to implement	Poor performance under heavy contention
RWLock	Shared for reads, exclusive for writes	Efficient for read-heavy workloads	Complex implementation, risk of write starvation
Spinlock	Busy-waiting, doesn't release CPU	Efficient when wait time is short	Wastes CPU when wait time is long
Condition Variable	Wait for a specific condition to be met	Avoids busy-waiting	Must be used with a lock
Semaphore	Counter controls access count	Can limit concurrency	Error-prone if misused
Atomic Operations	CPU instruction-level atomicity	Lock-free, highest performance	Only works with simple data types
Lock-Free Queue	Implemented via CAS operations	Excellent performance under high concurrency	Complex implementation, ABA problem

3.3 Coroutine Model: User-Space Scheduling

Coroutine Lightweight Comparison Demo

Memory Usage ComparisonSwitching Overhead ComparisonCreation Speed Comparison

1000 coroutines

Thread Model

Memory Usage

1000 MB

Creation Time

100 ms

Context Switch

~1-10 us

Coroutine Model

Memory Usage

2000 MB

Creation Time

10 ms

Context Switch

~100 ns

Saves -100% Memory

Medium Scale

Using coroutines can save -100% of memory (about -1000MB), with 10x faster creation speed.

Core Advantages of Coroutines

Traditional Multithreading      vs              Coroutine Model

+------------+                       +------------+
|  Thread 1  |                       | Event Loop |
| (1MB stack)|                       | (Scheduler)|
+------------+                       +------------+
     |                                     |
     v                                     v
+------------+                       +------------+
|  Thread 2  |                       | Coroutine A|
| (1MB stack)|                       | (few KB)   |
+------------+                       +------------+
     |                                     |
     v                                     v
+------------+                       +------------+
|  Thread 3  |                       | Coroutine B|
| (1MB stack)|                       | (few KB)   |
+------------+                       +------------+

Overhead: N MB                       Overhead: N KB
Creation: ~10μs                      Creation: ~100ns
Switching: ~1μs                      Switching: ~100ns

How async/await Works

async/await Mechanism Demo

Show Details

Python asyncio Example

import asyncio

async def fetch_data(url):
    # await suspends, yields CPU
    response = await aiohttp.get(url)
    # Continue after I/O completes
    return response.json()

async def main():
    # Concurrent execution
    tasks = [fetch_data(url) for url in urls]
    results = await asyncio.gather(*tasks)

Execution Timeline

0ms

50ms

100ms

150ms

200ms

Event Loop

Scheduling

Task 1

Exec

I/O

Exec

I/O

Exec

I/O

Task 2

Exec

I/O

Exec

I/O

Exec

I/O

Task 3

Exec

I/O

Exec

I/O

Exec

I/O

Task 4

Exec

I/O

Exec

I/O

Exec

I/O

Task 5

Exec

I/O

Exec

I/O

Exec

I/O

Concurrent Tasks

Total Time

100ms

I/O Wait Time

60ms

CPU Utilization

40%

Advantages of async/await

When a task encounters an I/O operation (such as a network request), await yields the CPU, and the event loop schedules other tasks to execute. After I/O completes, the task resumes from the suspension point. This approach allows a single thread to handle thousands of concurrent tasks.

python

import asyncio

async def fetch_data(url):
    # When await is hit, the coroutine suspends and yields the CPU
    response = await aiohttp.get(url)
    # After I/O completes, the event loop wakes the coroutine, resuming from here
    return response.json()

async def main():
    # Create 3 coroutine tasks
    tasks = [
        fetch_data("https://api1.example.com"),
        fetch_data("https://api2.example.com"),
        fetch_data("https://api3.example.com")
    ]
    # Execute concurrently; total time ≈ the slowest request
    results = await asyncio.gather(*tasks)
    return results

# Start the event loop
asyncio.run(main())

Execution Flow:

Timeline -------------------------------------------------------------------->

Coroutine A: [Prepare]--[await suspend]=======[Response received]--[Process]
                     |
Coroutine B:         [Prepare]--[await suspend]=======[Response received]--[Process]
                                  |
Coroutine C:                      [Prepare]--[await suspend]=======[Response received]
                                               |
                                               ↓
                                         All I/O complete

Legend: [ ] = CPU execution, === = I/O waiting, | = coroutine switch

3.4 Event Loop: The "Heart" of Coroutines

Event Loop Demo

Call Stack

Stack Empty

Event Loop

Check

1Execute synchronous code in the call stack

2Execute all microtasks

3Render UI (if needed)

4Execute macrotask

Task Queue

Microtask Queue

Queue Empty

Macrotask Queue

Queue Empty

The event loop is the core mechanism of coroutine scheduling:

python

import selectors
import heapq

class EventLoop:
    def __init__(self):
        self.selector = selectors.DefaultSelector()
        self.ready = []  # Ready queue
        self.scheduled = []  # Scheduled task queue
        self.current = None

    def run(self):
        while True:
            # 1. Process scheduled tasks
            now = time.time()
            while self.scheduled and self.scheduled[0][0] <= now:
                _, callback = heapq.heappop(self.scheduled)
                self.ready.append(callback)

            # 2. Wait for I/O events
            timeout = 0 if self.ready else 0.1
            events = self.selector.select(timeout)

            for key, mask in events:
                callback = key.data
                self.ready.append(callback)

            # 3. Execute ready callbacks
            while self.ready:
                callback = self.ready.popleft()
                callback()

3.5 Concurrency vs. Parallelism: Not the Same Thing

Concurrency vs Parallelism Demo

Single-Core ConcurrentMulti-Core ParallelHybrid Mode

CPU Core (Single Core)

CPU 1

Idle

CPU 2

Idle

CPU 3

Idle

CPU 4

Idle

Task Execution

Task 1

40ms

Task 2

30ms

Task 3

50ms

Task 4

35ms

Concurrency vs Parallelism

🔄

Concurrency

Multiple tasks alternate execution, progressing simultaneously at a macro level

Examples: Single-core CPU multi-threading, coroutine scheduling, async I/O

⚡

Parallelism

Multiple tasks execute truly simultaneously

Examples: Multi-core CPU computing, GPU parallel computing, distributed processing

What Conditions Are Needed?

✓Concurrency: A single-core CPU is sufficient

✓Parallelism: Requires multi-core CPU or multiple machines

Concept	Meaning	Analogy	Requirements
Concurrency	Multiple tasks interleave execution, appearing to progress simultaneously	One person alternating between cooking multiple dishes	Single-core CPU is enough
Parallelism	Multiple tasks truly execute at the same time	Multiple people cooking different dishes simultaneously	Multi-core CPU or multiple machines

Illustration:

Single-Core CPU - Concurrency
Time →  1    2    3    4    5    6    7    8
Task A: [Run ][Run ]      [Run ][Run ]
Task B:      [Run ][Run ]      [Run ][Run ]

Two tasks interleave execution, appearing to progress "simultaneously"

========================================

Multi-Core CPU - Parallelism
Time →  1    2    3    4    5    6    7    8
Core 1: [Task A][Task A][Task A][Task A]
Core 2: [Task B][Task B][Task B][Task B]

Two tasks truly execute "simultaneously"

========================================

In reality, it's often: Concurrency + Parallelism
Time →  1    2    3    4    5    6    7    8
Core 1: [A1][A1][B1][B1][C1][C1][D1][D1]
Core 2: [A2][A2][B2][B2][C2][C2][D2][D2]

Multiple tasks are concurrently scheduled to different cores, then run in parallel on those cores

4. In Practice: Go Goroutines and Green Threads

4.1 Go's Concurrency Philosophy

Go Goroutine & GMP Scheduling Demo

OverviewGMP SchedulingChannel Communication

Global Queue (G)3

P (Processors) - 4 Total

P0Running

Local Queue

Bound to M0

P1Idle

Local Queue

P2Idle

Local Queue

G10

G11

G12

P3Idle

Local Queue

M (Machine Threads) - 4 Total

M0Running

M1Sleeping

M2Sleeping

M3Sleeping

GMP Scheduling Model

G (Goroutine): Tasks to be executed. M (Machine): OS threads that execute G. P (Processor): Logical processor providing execution context. G is first placed in P's local queue. After P binds to M, M fetches G from P for execution. When the local queue is empty, it steals tasks from the global queue or other P's.

Go's concurrency design philosophy: Don't communicate by sharing memory; share memory by communicating.

package main

import (
    "fmt"
    "time"
)

// Producer
func producer(ch chan<- int, id int) {
    for i := 0; i < 5; i++ {
        fmt.Printf("Producer %d sending: %d\n", id, i)
        ch <- i  // Send data to channel
        time.Sleep(100 * time.Millisecond)
    }
}

// Consumer
func consumer(ch <-chan int, id int) {
    for val := range ch {  // Receive data from channel
        fmt.Printf("Consumer %d received: %d\n", id, val)
    }
}

func main() {
    // Create a buffered channel
    ch := make(chan int, 10)

    // Start 2 producer goroutines
    for i := 0; i < 2; i++ {
        go producer(ch, i)
    }

    // Start 2 consumer goroutines
    for i := 0; i < 2; i++ {
        go consumer(ch, i)
    }

    // Wait for a while
    time.Sleep(3 * time.Second)
    close(ch)
}

4.2 Goroutine Scheduler: The GMP Model

Go's scheduler uses the GMP model:

Component	Meaning	Role
G (Goroutine)	Coroutine	The task to execute, lightweight (2KB stack, dynamically resizable)
M (Machine)	OS Thread	The carrier that actually executes G, 1:1 mapping with kernel threads
P (Processor)	Logical Processor	Scheduling context containing a runnable G queue; count defaults to the number of CPU cores

Scheduling Flow:

Global Queue
+----------------+
|  G1  |  G2  |  G3  |
+----------------+

P0 Local Queue      P1 Local Queue      P2 Local Queue      P3 Local Queue
+----------+       +----------+       +----------+       +----------+
| G4 | G5  |       | G6 | G7  |       | G8 | G9  |       | G10| G11 |
+----------+       +----------+       +----------+       +----------+
    |                     |                     |                     |
    v                     v                     v                     v
+----------+       +----------+       +----------+       +----------+
|    M0    |       |    M1    |       |    M2    |       |    M3    |
| (OS Thrd)|       | (OS Thrd)|       | (OS Thrd)|       | (OS Thrd)|
+----------+       +----------+       +----------+       +----------+

Scheduling Strategy:
1. Each P maintains a local G queue to reduce lock contention
2. P takes G from its local queue and hands it to M for execution
3. When the local queue is empty, "steal" half the G's from another P (Work Stealing)
4. The global queue serves as a fallback, checked periodically

5. Practical Code Templates

5.1 Python asyncio High-Concurrency Template

python

import asyncio
import aiohttp
from typing import List, Dict
import time

class AsyncHTTPClient:
    """High-performance HTTP client based on asyncio"""

    def __init__(self, max_connections: int = 100, timeout: int = 30):
        self.timeout = aiohttp.ClientTimeout(total=timeout)
        # Limit concurrent connections to avoid overwhelming the target service
        connector = aiohttp.TCPConnector(
            limit=max_connections,
            limit_per_host=10,  # Connection limit per host
            enable_cleanup_closed=True,
            force_close=True,
        )
        self.session = aiohttp.ClientSession(
            connector=connector,
            timeout=self.timeout,
        )

    async def fetch(self, url: str, method: str = 'GET', **kwargs) -> Dict:
        """Send a single request"""
        try:
            async with self.session.request(method, url, **kwargs) as response:
                return {
                    'url': url,
                    'status': response.status,
                    'data': await response.text(),
                    'error': None
                }
        except asyncio.TimeoutError:
            return {'url': url, 'status': None, 'data': None, 'error': 'Timeout'}
        except Exception as e:
            return {'url': url, 'status': None, 'data': None, 'error': str(e)}

    async def fetch_many(self, urls: List[str], concurrency: int = 10) -> List[Dict]:
        """Fetch multiple URLs concurrently, with a concurrency limit"""
        semaphore = asyncio.Semaphore(concurrency)

        async def fetch_with_limit(url):
            async with semaphore:
                return await self.fetch(url)

        # Execute all requests concurrently
        tasks = [fetch_with_limit(url) for url in urls]
        return await asyncio.gather(*tasks, return_exceptions=True)

    async def close(self):
        await self.session.close()


# Usage example
async def main():
    client = AsyncHTTPClient(max_connections=50)

    # List of URLs to fetch
    urls = [
        "https://api.github.com/users/github",
        "https://api.github.com/users/google",
        "https://api.github.com/users/microsoft",
        # ... more URLs
    ] * 10  # Simulate 300 requests

    start = time.time()
    results = await client.fetch_many(urls, concurrency=20)
    elapsed = time.time() - start

    # Summarize results
    success = sum(1 for r in results if r.get('status') == 200)
    failed = len(results) - success

    print(f"Total requests: {len(results)}")
    print(f"Success: {success}, Failed: {failed}")
    print(f"Elapsed: {elapsed:.2f}s")
    print(f"QPS: {len(results)/elapsed:.1f}")

    await client.close()

if __name__ == "__main__":
    asyncio.run(main())

5.2 Go High-Concurrency Service Template

package main

import (
	"context"
	"encoding/json"
	"fmt"
	"log"
	"net/http"
	"runtime"
	"time"

	"golang.org/x/sync/errgroup"
)

// Request/Response structures
type OrderRequest struct {
	UserID    int64   `json:"user_id"`
	ProductID int64   `json:"product_id"`
	Quantity  int     `json:"quantity"`
	Price     float64 `json:"price"`
}

type OrderResponse struct {
	OrderID   int64   `json:"order_id"`
	Status    string  `json:"status"`
	Total     float64 `json:"total"`
	CreatedAt string  `json:"created_at"`
}

// Simulated database operations
type Database struct {
	orders map[int64]*OrderResponse
	mutex  chan struct{}
}

func NewDatabase() *Database {
	db := &Database{
		orders: make(map[int64]*OrderResponse),
		mutex:  make(chan struct{}, 1), // Simulated mutex
	}
	return db
}

func (db *Database) CreateOrder(ctx context.Context, req *OrderRequest) (*OrderResponse, error) {
	// Acquire lock
	select {
	case db.mutex <- struct{}{}:
		defer func() { <-db.mutex }()
	case <-ctx.Done():
		return nil, ctx.Err()
	}

	// Simulate database operation latency
	select {
	case <-time.After(50 * time.Millisecond):
	case <-ctx.Done():
		return nil, ctx.Err()
	}

	order := &OrderResponse{
		OrderID:   time.Now().UnixNano(),
		Status:    "created",
		Total:     req.Price * float64(req.Quantity),
		CreatedAt: time.Now().Format(time.RFC3339),
	}
	db.orders[order.OrderID] = order
	return order, nil
}

// HTTP handler
type Handler struct {
	db *Database
}

func NewHandler(db *Database) *Handler {
	return &Handler{db: db}
}

func (h *Handler) CreateOrder(w http.ResponseWriter, r *http.Request) {
	// Set request timeout
	ctx, cancel := context.WithTimeout(r.Context(), 2*time.Second)
	defer cancel()

	var req OrderRequest
	if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
		http.Error(w, err.Error(), http.StatusBadRequest)
		return
	}

	order, err := h.db.CreateOrder(ctx, &req)
	if err != nil {
		if err == context.DeadlineExceeded {
			http.Error(w, "Request timeout", http.StatusGatewayTimeout)
			return
		}
		http.Error(w, err.Error(), http.StatusInternalServerError)
		return
	}

	w.Header().Set("Content-Type", "application/json")
	json.NewEncoder(w).Encode(order)
}

func (h *Handler) Health(w http.ResponseWriter, r *http.Request) {
	info := map[string]interface{}{
		"status":    "ok",
		"goroutine": runtime.NumGoroutine(),
		"cpu":       runtime.NumCPU(),
		"version":   runtime.Version(),
	}
	w.Header().Set("Content-Type", "application/json")
	json.NewEncoder(w).Encode(info)
}

// Batch processing example
func BatchProcess(ctx context.Context, items []int) ([]int, error) {
	g, ctx := errgroup.WithContext(ctx)
	g.SetLimit(10) // Limit concurrency to 10

	results := make([]int, len(items))

	for i, item := range items {
		i, item := i, item // Avoid closure capture pitfall
		g.Go(func() error {
			select {
			case <-ctx.Done():
				return ctx.Err()
			default:
				// Simulate processing
				time.Sleep(100 * time.Millisecond)
				results[i] = item * 2
				return nil
			}
		})
	}

	if err := g.Wait(); err != nil {
		return nil, err
	}
	return results, nil
}

func main() {
	// Initialize database
	db := NewDatabase()

	// Create handler
	handler := NewHandler(db)

	// Set up routes
	mux := http.NewServeMux()
	mux.HandleFunc("/order", handler.CreateOrder)
	mux.HandleFunc("/health", handler.Health)

	// Create server
	server := &http.Server{
		Addr:         ":8080",
		Handler:      mux,
		ReadTimeout:  5 * time.Second,
		WriteTimeout: 10 * time.Second,
		IdleTimeout:  120 * time.Second,
	}

	fmt.Println("Server starting on :8080")
	fmt.Printf("Go version: %s\n", runtime.Version())
	fmt.Printf("CPU cores: %d\n", runtime.NumCPU())

	if err := server.ListenAndServe(); err != nil {
		log.Fatal(err)
	}
}

6. Summary Comparison Tables

6.1 Core Concept Comparison

Feature	Process	Thread	Coroutine
Scheduler	OS	OS	User program / runtime
Switch Overhead	~1-10ms	~1-10μs	~100ns
Memory Footprint	~10MB+	~1MB	~2KB
Communication	IPC	Shared memory	Shared memory / Channel
Sync Required	No	Locks needed	Locks / Cooperative
Crash Impact	This process only	Entire process	Controllable
Use Case	Strong isolation, multi-tenant	CPU-intensive	I/O-intensive
Typical Languages	All languages	All languages	Go, Python, JS, Rust

6.2 Concurrency Model Selection Guide

Scenario	Recommended Model	Rationale
Web service gateway	Coroutine + async I/O	High concurrent connections, low memory footprint
Real-time communication	Coroutine + long connections	Maintain many WebSocket connections
Data processing pipelines	Multi-process + coroutines	Utilize multi-core, I/O non-blocking
Scientific computing	Multi-threaded / multi-process	CPU-intensive, needs parallel computation
Microservice architecture	Multi-process + coroutines	Inter-service isolation, internal high concurrency
Embedded systems	Coroutine / single-threaded	Resource-constrained, deterministic scheduling

6.3 Terminology Reference

English Term	Chinese Translation	Explanation
Process	进程	Basic unit of OS resource allocation, with independent memory space
Thread	线程	Basic unit of CPU scheduling, shares process memory space
Coroutine	协程	User-space lightweight thread, scheduled by the program itself
Concurrency	并发	Multiple tasks interleave execution, appearing to progress simultaneously
Parallelism	并行	Multiple tasks truly execute simultaneously, requires multi-core support
Context Switch	上下文切换	The process of the CPU switching from one task to another
Blocking I/O	阻塞 I/O	Initiates an I/O request and waits for completion; the thread is suspended
Non-blocking I/O	非阻塞 I/O	Initiates an I/O request and returns immediately without waiting for the result
Async I/O	异步 I/O	Notifies the caller via callback or notification mechanism when I/O completes
Event Loop	事件循环	Coroutine scheduling mechanism that continuously listens for events and dispatches them
Goroutine	Go 协程	Go language's lightweight thread implementation
Channel	通道	Go's mechanism for communication between goroutines
Mutex	互斥锁	Synchronization primitive for protecting shared resources
Semaphore	信号量	Controls the number of threads concurrently accessing a resource
Deadlock	死锁	Multiple threads waiting for each other to release resources, causing permanent blocking
Race Condition	竞态条件	Multiple threads accessing shared data simultaneously, leading to non-deterministic results
Thread Pool	线程池	Pre-create a set of threads and reuse them to reduce creation/destruction overhead
Work Stealing	工作窃取	Idle threads "steal" tasks from busy threads' queues for execution
Zero-copy	零拷贝	Data transferred between kernel space and user space without CPU copy
C10K Problem	C10K 问题	The challenge of handling 10,000 connections on a single machine
C10M Problem	C10M 问题	The ultimate challenge of handling 10 million connections on a single machine

7. Closing Thoughts

7.1 The Golden Rules of Concurrent Programming

Don't optimize prematurely: Make the code work correctly first, then consider performance optimization
Avoid shared state: "Don't communicate by sharing memory; share memory by communicating"
Let errors surface early: Concurrency bugs are often hard to reproduce; expose them as much as possible during testing
Limit concurrency: Unlimited concurrency is no protection at all; use semaphores or connection pools to cap it
Monitor and observe: Concurrent systems must have comprehensive monitoring to quickly locate issues

7.2 Learning Roadmap

Phase 1: Fundamental Understanding
    ├── Understand basic concepts of processes and threads
    ├── Learn synchronization primitives (locks, semaphores, condition variables)
    └── Write simple multi-threaded programs

Phase 2: Deep Dive into Principles
    ├── Understand memory models and visibility
    ├── Learn lock-free programming and atomic operations
    ├── Understand thread pools and work stealing
    └── Analyze deadlocks and race conditions

Phase 3: Advanced Applications
    ├── Master coroutines and async programming
    ├── Learn Go/Python/Rust concurrency models
    ├── Understand concurrency in distributed systems
    └── Performance tuning and capacity planning

Phase 4: Expert Level
    ├── Design high-concurrency system architectures
    ├── Solve complex concurrency bugs
    ├── Develop concurrency programming frameworks
    └── Share and spread concurrency knowledge

I hope this guide helps you build a systematic understanding of concurrent programming. Remember, concurrency is not the goal — it's the means. The real objective is to build high-performance, highly available services. Understand the principles, choose the right model, write solid code, and you'll go far on the concurrency journey.

Concurrency, Async, and Multithreading ​

0. Introduction: Why Does Your Service "Freeze" During Peak Traffic? ​

Process / Thread / Coroutine Comparison

Task Queue

1. Core Concepts: Processes, Threads, and Coroutines — What's the Difference? ​

1.1 A Restaurant Analogy ​

1.2 In-Depth Comparison: The Essential Differences ​

Process Memory Isolation Demo

Process: The "Container" of Resource Isolation ​

Thread: The "Light Cavalry" of Shared Memory ​

Thread Scheduling Demo

Current Scheduling Algorithm: Round Robin (Time Slice)

Coroutine: The User-Space "Green Thread" ​

Coroutine Lightweight Comparison Demo

Thread Model

Coroutine Model

2. Case Study: An E-Commerce Promotion's "Concurrency Nightmare" ​

2.1 Lessons Learned: Evolution from "Single Machine" to "Distributed" ​

Phase 1: The Single-Machine Era (1K DAU) ​

Phase 2: The Multi-Process Era (10K DAU) ​

Phase 3: The Coroutine Era (100K DAU) ​

2.2 Concurrency Model Evolution Comparison Table ​

3. Deep Dive: How Various Concurrency Models Work ​

3.1 Process Model: Isolation and Communication ​

Memory Isolation Mechanism ​

Process Memory Isolation Demo

Inter-Process Communication (IPC) Methods ​

3.2 Thread Model: Scheduling and Synchronization ​

Thread Scheduling Principles ​

Thread Scheduling Demo

Current Scheduling Algorithm: Round Robin (Time Slice)

Common Thread Synchronization Mechanisms ​

3.3 Coroutine Model: User-Space Scheduling ​

Coroutine Lightweight Comparison Demo

Thread Model

Coroutine Model

Core Advantages of Coroutines ​

How async/await Works ​

async/await Mechanism Demo

Execution Timeline

3.4 Event Loop: The "Heart" of Coroutines ​

Event Loop Demo

3.5 Concurrency vs. Parallelism: Not the Same Thing ​

Concurrency vs Parallelism Demo

4. In Practice: Go Goroutines and Green Threads ​

4.1 Go's Concurrency Philosophy ​

Go Goroutine & GMP Scheduling Demo

4.2 Goroutine Scheduler: The GMP Model ​

5. Practical Code Templates ​

5.1 Python asyncio High-Concurrency Template ​

5.2 Go High-Concurrency Service Template ​

6. Summary Comparison Tables ​

6.1 Core Concept Comparison ​

6.2 Concurrency Model Selection Guide ​

6.3 Terminology Reference ​

7. Closing Thoughts ​

7.1 The Golden Rules of Concurrent Programming ​

7.2 Learning Roadmap ​

Concurrency, Async, and Multithreading

0. Introduction: Why Does Your Service "Freeze" During Peak Traffic?

1. Core Concepts: Processes, Threads, and Coroutines — What's the Difference?

1.1 A Restaurant Analogy

1.2 In-Depth Comparison: The Essential Differences

Process: The "Container" of Resource Isolation

Thread: The "Light Cavalry" of Shared Memory

Coroutine: The User-Space "Green Thread"

2. Case Study: An E-Commerce Promotion's "Concurrency Nightmare"

2.1 Lessons Learned: Evolution from "Single Machine" to "Distributed"

Phase 1: The Single-Machine Era (1K DAU)

Phase 2: The Multi-Process Era (10K DAU)

Phase 3: The Coroutine Era (100K DAU)

2.2 Concurrency Model Evolution Comparison Table

3. Deep Dive: How Various Concurrency Models Work

3.1 Process Model: Isolation and Communication

Memory Isolation Mechanism

Inter-Process Communication (IPC) Methods

3.2 Thread Model: Scheduling and Synchronization

Thread Scheduling Principles

Common Thread Synchronization Mechanisms

3.3 Coroutine Model: User-Space Scheduling

Core Advantages of Coroutines

How async/await Works

3.4 Event Loop: The "Heart" of Coroutines

3.5 Concurrency vs. Parallelism: Not the Same Thing

4. In Practice: Go Goroutines and Green Threads

4.1 Go's Concurrency Philosophy

4.2 Goroutine Scheduler: The GMP Model

5. Practical Code Templates

5.1 Python asyncio High-Concurrency Template

5.2 Go High-Concurrency Service Template

6. Summary Comparison Tables

6.1 Core Concept Comparison

6.2 Concurrency Model Selection Guide

6.3 Terminology Reference

7. Closing Thoughts

7.1 The Golden Rules of Concurrent Programming

7.2 Learning Roadmap