Skip to content

Concurrency, Async, and Multithreading

💡 Learning Guide: Concurrent programming is the "Achilles' heel" for many backend engineers — you get stumped in interviews, hit bugs in production, and have no clue where to start with performance tuning. This chapter revolves around one core question: When 100,000 users hit your service simultaneously, will your code collapse?

Before diving in, make sure you have these two foundational pieces:

  • What are CPU, Memory, and I/O: If you're fuzzy on these basics, brush up on operating system fundamentals.
  • What is blocking/non-blocking: If you're not yet comfortable with synchronous/asynchronous concepts, get a feel for them through hands-on programming first.

0. Introduction: Why Does Your Service "Freeze" During Peak Traffic?

Process / Thread / Coroutine Comparison

Memory Usage
400
MB
Context Switches
0
Completed Tasks
0
Elapsed Time
0
ms
CPU 1
CPU 2
CPU 3
CPU 4
Task Queue
Task 1
Task 2
Task 3
Task 4
Task 5
Task 6
Task 7
Task 8
Task 9
Task 10
Task 11
Task 12
Task 13
Task 14
Task 15
Task 16

Many developers encounter situations like these in practice:

  • The service responds lightning-fast in local testing, but becomes a "slideshow" once deployed;
  • You bought high-spec servers, yet CPU utilization never goes up;
  • During promotional peaks, the service "avalanches," forcing you to degrade or circuit-break.

Intuitively, we think: "The server isn't powerful enough." But most of the time, the problem isn't that the hardware is "too slow" — it's that we didn't design the concurrency model properly.

The Core Dilemma:

  • Without concurrent processing: requests queue up, user experience is terrible;
  • With reckless multithreading: lock contention and context-switching overhead actually degrade performance.

Faced with these challenges, simply "adding more machines" is no longer enough. We need a systematic approach to concurrency design that ensures both performance and stability under high concurrency. That's exactly what this chapter aims to address.


1. Core Concepts: Processes, Threads, and Coroutines — What's the Difference?

1.1 A Restaurant Analogy

Imagine you run a restaurant and need to serve many customers at once:

ConceptRestaurant AnalogyTechnical Meaning
ProcessAn independent restaurant branchHas its own memory space and resource allocation. It is the basic unit of OS resource allocation. A crash in one process does not affect others.
ThreadA chef within a branchThe basic unit of CPU scheduling, sharing memory space within a process. Threads in the same process can share data, but one thread crashing can bring down the entire process.
CoroutineA chef's "cloning technique"A user-space lightweight thread, scheduled by the program itself rather than the OS. Switching overhead is minimal; you can create millions of them.

1.2 In-Depth Comparison: The Essential Differences

Process Memory Isolation Demo

System Memory

Process: The "Container" of Resource Isolation

Key Characteristics:

  • Strong isolation: Each process has its own independent virtual address space
  • High overhead: Creation/switching requires OS intervention, taking ~1-10ms
  • Complex communication: Inter-Process Communication (IPC) requires special mechanisms (pipes, message queues, shared memory, etc.)

Use Cases:

  • Services requiring strong isolation (e.g., browser tabs, sandboxed programs)
  • Multi-language hybrid deployments
  • Service units that need independent restart/upgrade

Thread: The "Light Cavalry" of Shared Memory

Thread Scheduling Demo

Timeline
0ms
100ms
200ms
300ms
400ms
500ms
0
Completed Threads
0
Context Switches
0ms
Avg Wait Time
0
Throughput (threads/s)
Current Scheduling Algorithm: Round Robin (Time Slice)

Each thread takes turns executing for a time slice. When the slice expires, it switches to the next thread. Good responsiveness, suitable for interactive systems.

Key Characteristics:

  • Shared memory: Threads within the same process share code, data, and heap segments
  • Independent stack space: Each thread has its own stack (typically ~1MB)
  • Fast switching: Thread switching takes ~1-10μs, about 1000× faster than process switching
  • Synchronization required: Shared data must be protected with locks

Use Cases:

  • CPU-intensive tasks (computation, image processing)
  • Concurrent tasks that need to share a lot of data
  • Latency-sensitive background tasks

Coroutine: The User-Space "Green Thread"

Coroutine Lightweight Comparison Demo

1000 coroutines
Thread Model
Memory Usage
1000 MB
Creation Time
100 ms
Context Switch
~1-10 us
VS
Coroutine Model
Memory Usage
2000 MB
Creation Time
10 ms
Context Switch
~100 ns
Saves -100% Memory

Key Characteristics:

  • User-space scheduling: Scheduled by the program/runtime library, not the OS
  • Extremely lightweight: Coroutine stacks are typically only a few KB; you can create millions
  • Extremely fast switching: Coroutine switching takes ~100ns, about 100× faster than thread switching
  • Non-preemptive: Coroutines voluntarily yield the CPU (cooperative multitasking)

Use Cases:

  • I/O-intensive high-concurrency services (web servers, gateways)
  • Scenarios requiring many long-lived connections (IM, game servers)
  • Streaming data processing, pipeline workflows

2. Case Study: An E-Commerce Promotion's "Concurrency Nightmare"

2.1 Lessons Learned: Evolution from "Single Machine" to "Distributed"

Let's look at the evolution story of a real e-commerce system:

Phase 1: The Single-Machine Era (1K DAU)

python
# A simple Flask application
from flask import Flask

app = Flask(__name__)

@app.route('/order')
def create_order():
    # Query inventory
    stock = db.query("SELECT stock FROM products WHERE id=1")
    if stock > 0:
        # Deduct inventory
        db.execute("UPDATE products SET stock = stock - 1 WHERE id=1")
        # Create order
        db.execute("INSERT INTO orders ...")
        return "Order created!"
    return "Out of stock!"

# Start: flask run

Problems:

  • Single process, single thread — can only handle one request at a time
  • Inventory deduction is not locked, leading to overselling under concurrency
  • Limited database connections; the connection pool is quickly exhausted

Phase 2: The Multi-Process Era (10K DAU)

python
# Deploy with Gunicorn multi-process
gunicorn -w 4 -k sync app:app

# 4 worker processes, each handling requests independently

New Problems:

  • 4 processes all query inventory simultaneously, all see stock=1, all deduct successfully — 3 oversold items!
  • Need to introduce distributed locks
python
import redis

# Use Redis distributed lock
lock = redis_client.lock("stock_lock", timeout=10)
if lock.acquire():
    try:
        stock = db.query("SELECT stock FROM products WHERE id=1")
        if stock > 0:
            db.execute("UPDATE products SET stock = stock - 1 WHERE id=1")
    finally:
        lock.release()

Phase 3: The Coroutine Era (100K DAU)

python
# Use FastAPI + asyncio
from fastapi import FastAPI
import asyncio

app = FastAPI()

async def check_stock(product_id: int) -> int:
    # Async database query, non-blocking
    result = await db.fetch_one(
        "SELECT stock FROM products WHERE id = :id",
        {"id": product_id}
    )
    return result["stock"]

@app.get("/order")
async def create_order(product_id: int):
    # Concurrently check inventory and user info
    stock_task = check_stock(product_id)
    user_task = get_user_info(request.user_id)

    stock, user = await asyncio.gather(stock_task, user_task)

    if stock > 0:
        # Async inventory deduction
        await db.execute(
            "UPDATE products SET stock = stock - 1 WHERE id = :id",
            {"id": product_id}
        )
        return {"status": "success"}

    return {"status": "out_of_stock"}

# Start: uvicorn main:app --workers 4
# Each worker can handle thousands of concurrent coroutines

Advantages:

  • Thousands of concurrent connections within a single thread
  • Voluntarily yields CPU during I/O operations without blocking other requests
  • Extremely low memory footprint, ideal for high-concurrency long-connection scenarios

2.2 Concurrency Model Evolution Comparison Table

PhaseConcurrency ModelDAU SupportedCore ProblemSolution
MonolithSingle process, single thread1KCannot handle concurrencyIntroduce multi-process
Multi-ProcessMulti-process synchronous10KData races, oversellingDistributed locks
Multi-ThreadedMulti-threaded + locks50KContext-switching overhead, deadlocksThread pools, lock-free queues
CoroutineAsync I/O100K+Code complexity, debugging difficultyFramework encapsulation, distributed tracing
HybridMulti-process + coroutines1M+Architectural complexityService governance, elastic scaling

3. Deep Dive: How Various Concurrency Models Work

3.1 Process Model: Isolation and Communication

Memory Isolation Mechanism

Process Memory Isolation Demo

System Memory

Each process has its own independent virtual address space:

Process A Virtual Memory      Process B Virtual Memory
+----------------+        +----------------+
|  Kernel Space  |        |  Kernel Space  |  <-- Shared (read-only)
|  (shared)      |        |  (shared)      |
+----------------+        +----------------+
|  Stack         |        |  Stack         |  <-- Independent
|  (grows down)  |        |  (grows down)  |
+----------------+        +----------------+
|  Heap          |        |  Heap          |  <-- Independent
|  (grows up)    |        |  (grows up)    |
+----------------+        +----------------+
|  Data Segment  |        |  Data Segment  |  <-- Independent
|  (.bss/.data)  |        |  (.bss/.data)  |
+----------------+        +----------------+
|  Code Segment  |        |  Code Segment  |  <-- Independent
|  (.text)       |        |  (.text)       |
+----------------+        +----------------+

Inter-Process Communication (IPC) Methods

MethodPrincipleSpeedUse Case
PipeKernel buffer, unidirectional streamMediumParent-child process communication
Message QueueKernel message linked listMediumAsync message passing
Shared MemorySame physical memory mappedFastestLarge data sharing
SemaphoreKernel counter-Synchronization and mutual exclusion
SocketNetwork protocol stackSlowerCross-machine communication
SignalSoft interrupt-Event notification

3.2 Thread Model: Scheduling and Synchronization

Thread Scheduling Principles

Thread Scheduling Demo

Timeline
0ms
100ms
200ms
300ms
400ms
500ms
0
Completed Threads
0
Context Switches
0ms
Avg Wait Time
0
Throughput (threads/s)
Current Scheduling Algorithm: Round Robin (Time Slice)

Each thread takes turns executing for a time slice. When the slice expires, it switches to the next thread. Good responsiveness, suitable for interactive systems.

How the OS thread scheduler works:

Ready Queue                  Running                  Waiting Queue
+--------+                +--------+               +--------+
| Thread B|  <-- timeslice| Thread A|  <-- I/O req | Thread C|
| Thread D|      expires  |(running)|              | Thread E|
| Thread F|                +--------+              |(blocked)|
+--------+                                         +--------+
    |                                                  |
    v                                                  v
Scheduler picks next to run by priority    Moved back to ready queue when I/O completes

Common Thread Synchronization Mechanisms

MechanismPrincipleProsCons
MutexBinary state, exclusive accessSimple to implementPoor performance under heavy contention
RWLockShared for reads, exclusive for writesEfficient for read-heavy workloadsComplex implementation, risk of write starvation
SpinlockBusy-waiting, doesn't release CPUEfficient when wait time is shortWastes CPU when wait time is long
Condition VariableWait for a specific condition to be metAvoids busy-waitingMust be used with a lock
SemaphoreCounter controls access countCan limit concurrencyError-prone if misused
Atomic OperationsCPU instruction-level atomicityLock-free, highest performanceOnly works with simple data types
Lock-Free QueueImplemented via CAS operationsExcellent performance under high concurrencyComplex implementation, ABA problem

3.3 Coroutine Model: User-Space Scheduling

Coroutine Lightweight Comparison Demo

1000 coroutines
Thread Model
Memory Usage
1000 MB
Creation Time
100 ms
Context Switch
~1-10 us
VS
Coroutine Model
Memory Usage
2000 MB
Creation Time
10 ms
Context Switch
~100 ns
Saves -100% Memory

Core Advantages of Coroutines

Traditional Multithreading      vs              Coroutine Model

+------------+                       +------------+
|  Thread 1  |                       | Event Loop |
| (1MB stack)|                       | (Scheduler)|
+------------+                       +------------+
     |                                     |
     v                                     v
+------------+                       +------------+
|  Thread 2  |                       | Coroutine A|
| (1MB stack)|                       | (few KB)   |
+------------+                       +------------+
     |                                     |
     v                                     v
+------------+                       +------------+
|  Thread 3  |                       | Coroutine B|
| (1MB stack)|                       | (few KB)   |
+------------+                       +------------+

Overhead: N MB                       Overhead: N KB
Creation: ~10μs                      Creation: ~100ns
Switching: ~1μs                      Switching: ~100ns

How async/await Works

async/await Mechanism Demo

Python asyncio Example
import asyncio

async def fetch_data(url):
    # await suspends, yields CPU
    response = await aiohttp.get(url)
    # Continue after I/O completes
    return response.json()

async def main():
    # Concurrent execution
    tasks = [fetch_data(url) for url in urls]
    results = await asyncio.gather(*tasks)
Execution Timeline
0ms
50ms
100ms
150ms
200ms
Event Loop
Scheduling
Task 1
Exec
I/O
Exec
I/O
Exec
I/O
Task 2
Exec
I/O
Exec
I/O
Exec
I/O
Task 3
Exec
I/O
Exec
I/O
Exec
I/O
Task 4
Exec
I/O
Exec
I/O
Exec
I/O
Task 5
Exec
I/O
Exec
I/O
Exec
I/O
Concurrent Tasks
5
Total Time
100ms
I/O Wait Time
60ms
CPU Utilization
40%
python
import asyncio

async def fetch_data(url):
    # When await is hit, the coroutine suspends and yields the CPU
    response = await aiohttp.get(url)
    # After I/O completes, the event loop wakes the coroutine, resuming from here
    return response.json()

async def main():
    # Create 3 coroutine tasks
    tasks = [
        fetch_data("https://api1.example.com"),
        fetch_data("https://api2.example.com"),
        fetch_data("https://api3.example.com")
    ]
    # Execute concurrently; total time ≈ the slowest request
    results = await asyncio.gather(*tasks)
    return results

# Start the event loop
asyncio.run(main())

Execution Flow:

Timeline -------------------------------------------------------------------->

Coroutine A: [Prepare]--[await suspend]=======[Response received]--[Process]
                     |
Coroutine B:         [Prepare]--[await suspend]=======[Response received]--[Process]
                                  |
Coroutine C:                      [Prepare]--[await suspend]=======[Response received]
                                               |

                                         All I/O complete

Legend: [ ] = CPU execution, === = I/O waiting, | = coroutine switch

3.4 Event Loop: The "Heart" of Coroutines

Event Loop Demo

Call Stack
Stack Empty
Event Loop
Check
1Execute synchronous code in the call stack
2Execute all microtasks
3Render UI (if needed)
4Execute macrotask
Task Queue
Microtask Queue
Queue Empty
Macrotask Queue
Queue Empty

The event loop is the core mechanism of coroutine scheduling:

python
import selectors
import heapq

class EventLoop:
    def __init__(self):
        self.selector = selectors.DefaultSelector()
        self.ready = []  # Ready queue
        self.scheduled = []  # Scheduled task queue
        self.current = None

    def run(self):
        while True:
            # 1. Process scheduled tasks
            now = time.time()
            while self.scheduled and self.scheduled[0][0] <= now:
                _, callback = heapq.heappop(self.scheduled)
                self.ready.append(callback)

            # 2. Wait for I/O events
            timeout = 0 if self.ready else 0.1
            events = self.selector.select(timeout)

            for key, mask in events:
                callback = key.data
                self.ready.append(callback)

            # 3. Execute ready callbacks
            while self.ready:
                callback = self.ready.popleft()
                callback()

3.5 Concurrency vs. Parallelism: Not the Same Thing

Concurrency vs Parallelism Demo

CPU Core (Single Core)
CPU 1
Idle
CPU 2
Idle
CPU 3
Idle
CPU 4
Idle
Task Execution
Task 1
40ms
Task 2
30ms
Task 3
50ms
Task 4
35ms
Concurrency vs Parallelism
🔄
Concurrency
Multiple tasks alternate execution, progressing simultaneously at a macro level
Examples: Single-core CPU multi-threading, coroutine scheduling, async I/O
Parallelism
Multiple tasks execute truly simultaneously
Examples: Multi-core CPU computing, GPU parallel computing, distributed processing
What Conditions Are Needed?
Concurrency: A single-core CPU is sufficient
Parallelism: Requires multi-core CPU or multiple machines
ConceptMeaningAnalogyRequirements
ConcurrencyMultiple tasks interleave execution, appearing to progress simultaneouslyOne person alternating between cooking multiple dishesSingle-core CPU is enough
ParallelismMultiple tasks truly execute at the same timeMultiple people cooking different dishes simultaneouslyMulti-core CPU or multiple machines

Illustration:

Single-Core CPU - Concurrency
Time →  1    2    3    4    5    6    7    8
Task A: [Run ][Run ]      [Run ][Run ]
Task B:      [Run ][Run ]      [Run ][Run ]

Two tasks interleave execution, appearing to progress "simultaneously"

========================================

Multi-Core CPU - Parallelism
Time →  1    2    3    4    5    6    7    8
Core 1: [Task A][Task A][Task A][Task A]
Core 2: [Task B][Task B][Task B][Task B]

Two tasks truly execute "simultaneously"

========================================

In reality, it's often: Concurrency + Parallelism
Time →  1    2    3    4    5    6    7    8
Core 1: [A1][A1][B1][B1][C1][C1][D1][D1]
Core 2: [A2][A2][B2][B2][C2][C2][D2][D2]

Multiple tasks are concurrently scheduled to different cores, then run in parallel on those cores

4. In Practice: Go Goroutines and Green Threads

4.1 Go's Concurrency Philosophy

Go Goroutine & GMP Scheduling Demo

Global Queue (G)3
G1
G2
G3
P (Processors) - 4 Total
P0Running
Local Queue
G4
G5
G6
Bound to M0
P1Idle
Local Queue
G7
G8
G9
P2Idle
Local Queue
G10
G11
G12
P3Idle
Local Queue
-
M (Machine Threads) - 4 Total
M0Running
M1Sleeping
M2Sleeping
M3Sleeping

Go's concurrency design philosophy: Don't communicate by sharing memory; share memory by communicating.

go
package main

import (
    "fmt"
    "time"
)

// Producer
func producer(ch chan<- int, id int) {
    for i := 0; i < 5; i++ {
        fmt.Printf("Producer %d sending: %d\n", id, i)
        ch <- i  // Send data to channel
        time.Sleep(100 * time.Millisecond)
    }
}

// Consumer
func consumer(ch <-chan int, id int) {
    for val := range ch {  // Receive data from channel
        fmt.Printf("Consumer %d received: %d\n", id, val)
    }
}

func main() {
    // Create a buffered channel
    ch := make(chan int, 10)

    // Start 2 producer goroutines
    for i := 0; i < 2; i++ {
        go producer(ch, i)
    }

    // Start 2 consumer goroutines
    for i := 0; i < 2; i++ {
        go consumer(ch, i)
    }

    // Wait for a while
    time.Sleep(3 * time.Second)
    close(ch)
}

4.2 Goroutine Scheduler: The GMP Model

Go's scheduler uses the GMP model:

ComponentMeaningRole
G (Goroutine)CoroutineThe task to execute, lightweight (2KB stack, dynamically resizable)
M (Machine)OS ThreadThe carrier that actually executes G, 1:1 mapping with kernel threads
P (Processor)Logical ProcessorScheduling context containing a runnable G queue; count defaults to the number of CPU cores

Scheduling Flow:

Global Queue
+----------------+
|  G1  |  G2  |  G3  |
+----------------+

P0 Local Queue      P1 Local Queue      P2 Local Queue      P3 Local Queue
+----------+       +----------+       +----------+       +----------+
| G4 | G5  |       | G6 | G7  |       | G8 | G9  |       | G10| G11 |
+----------+       +----------+       +----------+       +----------+
    |                     |                     |                     |
    v                     v                     v                     v
+----------+       +----------+       +----------+       +----------+
|    M0    |       |    M1    |       |    M2    |       |    M3    |
| (OS Thrd)|       | (OS Thrd)|       | (OS Thrd)|       | (OS Thrd)|
+----------+       +----------+       +----------+       +----------+

Scheduling Strategy:
1. Each P maintains a local G queue to reduce lock contention
2. P takes G from its local queue and hands it to M for execution
3. When the local queue is empty, "steal" half the G's from another P (Work Stealing)
4. The global queue serves as a fallback, checked periodically

5. Practical Code Templates

5.1 Python asyncio High-Concurrency Template

python
import asyncio
import aiohttp
from typing import List, Dict
import time

class AsyncHTTPClient:
    """High-performance HTTP client based on asyncio"""

    def __init__(self, max_connections: int = 100, timeout: int = 30):
        self.timeout = aiohttp.ClientTimeout(total=timeout)
        # Limit concurrent connections to avoid overwhelming the target service
        connector = aiohttp.TCPConnector(
            limit=max_connections,
            limit_per_host=10,  # Connection limit per host
            enable_cleanup_closed=True,
            force_close=True,
        )
        self.session = aiohttp.ClientSession(
            connector=connector,
            timeout=self.timeout,
        )

    async def fetch(self, url: str, method: str = 'GET', **kwargs) -> Dict:
        """Send a single request"""
        try:
            async with self.session.request(method, url, **kwargs) as response:
                return {
                    'url': url,
                    'status': response.status,
                    'data': await response.text(),
                    'error': None
                }
        except asyncio.TimeoutError:
            return {'url': url, 'status': None, 'data': None, 'error': 'Timeout'}
        except Exception as e:
            return {'url': url, 'status': None, 'data': None, 'error': str(e)}

    async def fetch_many(self, urls: List[str], concurrency: int = 10) -> List[Dict]:
        """Fetch multiple URLs concurrently, with a concurrency limit"""
        semaphore = asyncio.Semaphore(concurrency)

        async def fetch_with_limit(url):
            async with semaphore:
                return await self.fetch(url)

        # Execute all requests concurrently
        tasks = [fetch_with_limit(url) for url in urls]
        return await asyncio.gather(*tasks, return_exceptions=True)

    async def close(self):
        await self.session.close()


# Usage example
async def main():
    client = AsyncHTTPClient(max_connections=50)

    # List of URLs to fetch
    urls = [
        "https://api.github.com/users/github",
        "https://api.github.com/users/google",
        "https://api.github.com/users/microsoft",
        # ... more URLs
    ] * 10  # Simulate 300 requests

    start = time.time()
    results = await client.fetch_many(urls, concurrency=20)
    elapsed = time.time() - start

    # Summarize results
    success = sum(1 for r in results if r.get('status') == 200)
    failed = len(results) - success

    print(f"Total requests: {len(results)}")
    print(f"Success: {success}, Failed: {failed}")
    print(f"Elapsed: {elapsed:.2f}s")
    print(f"QPS: {len(results)/elapsed:.1f}")

    await client.close()

if __name__ == "__main__":
    asyncio.run(main())

5.2 Go High-Concurrency Service Template

go
package main

import (
	"context"
	"encoding/json"
	"fmt"
	"log"
	"net/http"
	"runtime"
	"time"

	"golang.org/x/sync/errgroup"
)

// Request/Response structures
type OrderRequest struct {
	UserID    int64   `json:"user_id"`
	ProductID int64   `json:"product_id"`
	Quantity  int     `json:"quantity"`
	Price     float64 `json:"price"`
}

type OrderResponse struct {
	OrderID   int64   `json:"order_id"`
	Status    string  `json:"status"`
	Total     float64 `json:"total"`
	CreatedAt string  `json:"created_at"`
}

// Simulated database operations
type Database struct {
	orders map[int64]*OrderResponse
	mutex  chan struct{}
}

func NewDatabase() *Database {
	db := &Database{
		orders: make(map[int64]*OrderResponse),
		mutex:  make(chan struct{}, 1), // Simulated mutex
	}
	return db
}

func (db *Database) CreateOrder(ctx context.Context, req *OrderRequest) (*OrderResponse, error) {
	// Acquire lock
	select {
	case db.mutex <- struct{}{}:
		defer func() { <-db.mutex }()
	case <-ctx.Done():
		return nil, ctx.Err()
	}

	// Simulate database operation latency
	select {
	case <-time.After(50 * time.Millisecond):
	case <-ctx.Done():
		return nil, ctx.Err()
	}

	order := &OrderResponse{
		OrderID:   time.Now().UnixNano(),
		Status:    "created",
		Total:     req.Price * float64(req.Quantity),
		CreatedAt: time.Now().Format(time.RFC3339),
	}
	db.orders[order.OrderID] = order
	return order, nil
}

// HTTP handler
type Handler struct {
	db *Database
}

func NewHandler(db *Database) *Handler {
	return &Handler{db: db}
}

func (h *Handler) CreateOrder(w http.ResponseWriter, r *http.Request) {
	// Set request timeout
	ctx, cancel := context.WithTimeout(r.Context(), 2*time.Second)
	defer cancel()

	var req OrderRequest
	if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
		http.Error(w, err.Error(), http.StatusBadRequest)
		return
	}

	order, err := h.db.CreateOrder(ctx, &req)
	if err != nil {
		if err == context.DeadlineExceeded {
			http.Error(w, "Request timeout", http.StatusGatewayTimeout)
			return
		}
		http.Error(w, err.Error(), http.StatusInternalServerError)
		return
	}

	w.Header().Set("Content-Type", "application/json")
	json.NewEncoder(w).Encode(order)
}

func (h *Handler) Health(w http.ResponseWriter, r *http.Request) {
	info := map[string]interface{}{
		"status":    "ok",
		"goroutine": runtime.NumGoroutine(),
		"cpu":       runtime.NumCPU(),
		"version":   runtime.Version(),
	}
	w.Header().Set("Content-Type", "application/json")
	json.NewEncoder(w).Encode(info)
}

// Batch processing example
func BatchProcess(ctx context.Context, items []int) ([]int, error) {
	g, ctx := errgroup.WithContext(ctx)
	g.SetLimit(10) // Limit concurrency to 10

	results := make([]int, len(items))

	for i, item := range items {
		i, item := i, item // Avoid closure capture pitfall
		g.Go(func() error {
			select {
			case <-ctx.Done():
				return ctx.Err()
			default:
				// Simulate processing
				time.Sleep(100 * time.Millisecond)
				results[i] = item * 2
				return nil
			}
		})
	}

	if err := g.Wait(); err != nil {
		return nil, err
	}
	return results, nil
}

func main() {
	// Initialize database
	db := NewDatabase()

	// Create handler
	handler := NewHandler(db)

	// Set up routes
	mux := http.NewServeMux()
	mux.HandleFunc("/order", handler.CreateOrder)
	mux.HandleFunc("/health", handler.Health)

	// Create server
	server := &http.Server{
		Addr:         ":8080",
		Handler:      mux,
		ReadTimeout:  5 * time.Second,
		WriteTimeout: 10 * time.Second,
		IdleTimeout:  120 * time.Second,
	}

	fmt.Println("Server starting on :8080")
	fmt.Printf("Go version: %s\n", runtime.Version())
	fmt.Printf("CPU cores: %d\n", runtime.NumCPU())

	if err := server.ListenAndServe(); err != nil {
		log.Fatal(err)
	}
}

6. Summary Comparison Tables

6.1 Core Concept Comparison

FeatureProcessThreadCoroutine
SchedulerOSOSUser program / runtime
Switch Overhead~1-10ms~1-10μs~100ns
Memory Footprint~10MB+~1MB~2KB
CommunicationIPCShared memoryShared memory / Channel
Sync RequiredNoLocks neededLocks / Cooperative
Crash ImpactThis process onlyEntire processControllable
Use CaseStrong isolation, multi-tenantCPU-intensiveI/O-intensive
Typical LanguagesAll languagesAll languagesGo, Python, JS, Rust

6.2 Concurrency Model Selection Guide

ScenarioRecommended ModelRationale
Web service gatewayCoroutine + async I/OHigh concurrent connections, low memory footprint
Real-time communicationCoroutine + long connectionsMaintain many WebSocket connections
Data processing pipelinesMulti-process + coroutinesUtilize multi-core, I/O non-blocking
Scientific computingMulti-threaded / multi-processCPU-intensive, needs parallel computation
Microservice architectureMulti-process + coroutinesInter-service isolation, internal high concurrency
Embedded systemsCoroutine / single-threadedResource-constrained, deterministic scheduling

6.3 Terminology Reference

English TermChinese TranslationExplanation
Process进程Basic unit of OS resource allocation, with independent memory space
Thread线程Basic unit of CPU scheduling, shares process memory space
Coroutine协程User-space lightweight thread, scheduled by the program itself
Concurrency并发Multiple tasks interleave execution, appearing to progress simultaneously
Parallelism并行Multiple tasks truly execute simultaneously, requires multi-core support
Context Switch上下文切换The process of the CPU switching from one task to another
Blocking I/O阻塞 I/OInitiates an I/O request and waits for completion; the thread is suspended
Non-blocking I/O非阻塞 I/OInitiates an I/O request and returns immediately without waiting for the result
Async I/O异步 I/ONotifies the caller via callback or notification mechanism when I/O completes
Event Loop事件循环Coroutine scheduling mechanism that continuously listens for events and dispatches them
GoroutineGo 协程Go language's lightweight thread implementation
Channel通道Go's mechanism for communication between goroutines
Mutex互斥锁Synchronization primitive for protecting shared resources
Semaphore信号量Controls the number of threads concurrently accessing a resource
Deadlock死锁Multiple threads waiting for each other to release resources, causing permanent blocking
Race Condition竞态条件Multiple threads accessing shared data simultaneously, leading to non-deterministic results
Thread Pool线程池Pre-create a set of threads and reuse them to reduce creation/destruction overhead
Work Stealing工作窃取Idle threads "steal" tasks from busy threads' queues for execution
Zero-copy零拷贝Data transferred between kernel space and user space without CPU copy
C10K ProblemC10K 问题The challenge of handling 10,000 connections on a single machine
C10M ProblemC10M 问题The ultimate challenge of handling 10 million connections on a single machine

7. Closing Thoughts

7.1 The Golden Rules of Concurrent Programming

  1. Don't optimize prematurely: Make the code work correctly first, then consider performance optimization
  2. Avoid shared state: "Don't communicate by sharing memory; share memory by communicating"
  3. Let errors surface early: Concurrency bugs are often hard to reproduce; expose them as much as possible during testing
  4. Limit concurrency: Unlimited concurrency is no protection at all; use semaphores or connection pools to cap it
  5. Monitor and observe: Concurrent systems must have comprehensive monitoring to quickly locate issues

7.2 Learning Roadmap

Phase 1: Fundamental Understanding
    ├── Understand basic concepts of processes and threads
    ├── Learn synchronization primitives (locks, semaphores, condition variables)
    └── Write simple multi-threaded programs

Phase 2: Deep Dive into Principles
    ├── Understand memory models and visibility
    ├── Learn lock-free programming and atomic operations
    ├── Understand thread pools and work stealing
    └── Analyze deadlocks and race conditions

Phase 3: Advanced Applications
    ├── Master coroutines and async programming
    ├── Learn Go/Python/Rust concurrency models
    ├── Understand concurrency in distributed systems
    └── Performance tuning and capacity planning

Phase 4: Expert Level
    ├── Design high-concurrency system architectures
    ├── Solve complex concurrency bugs
    ├── Develop concurrency programming frameworks
    └── Share and spread concurrency knowledge

I hope this guide helps you build a systematic understanding of concurrent programming. Remember, concurrency is not the goal — it's the means. The real objective is to build high-performance, highly available services. Understand the principles, choose the right model, write solid code, and you'll go far on the concurrency journey.