Skip to content

Monitoring, Logging & Alerting

💡 Learning Guide: This chapter requires no programming background. Through interactive demos, you'll gain a comprehensive understanding of operations — from monitoring and alerting to troubleshooting, from capacity planning to automated operations, mastering all the skills needed to run production systems.

0. Introduction: Deployment Is Just the Beginning

Many beginners think: "Once the code is deployed, the job is done."

That couldn't be more wrong!

Deployment is merely the starting point of operations work. It's like buying a new car — the real work of maintenance, repairs, and refueling is what follows.

Operations has three goals:

  1. Stability: The system stays up and services remain available
  2. Performance: Fast responses and a great user experience
  3. Security: No data leaks and protection against attacks

1. Monitoring

Monitoring is the "eyes" of operations. A system without monitoring is like driving blind — you won't even know when something goes wrong.

1.1 The Three Layers of Monitoring

实时监控面板 (Monitoring Dashboard)
运维的"眼睛" - 让系统状态一目了然
CPU 使用率45%
正常
内存使用率62%
正常
磁盘使用率78%
警告
网络带宽34%
正常
磁盘 I/O55%
正常
负载均衡42%
正常
正常 (Normal)
警告 (Warning)
严重 (Critical)

Infrastructure Monitoring: Tracking server hardware resources

  • CPU usage
  • Memory usage
  • Disk space and I/O
  • Network bandwidth

Application Monitoring: Tracking software runtime state

  • QPS (Queries Per Second)
  • Response time (latency)
  • Error rate
  • Dependency service call status

Business Monitoring: Tracking business health

  • DAU/MAU (Daily/Monthly Active Users)
  • Order volume
  • Payment success rate
  • User retention rate

1.2 Monitoring Tool Stack

ToolPurposeCharacteristics
PrometheusMetric collection & storageTime-series database, ideal for monitoring data
GrafanaVisualization dashboardsPowerful charts and dashboards
ZabbixComprehensive monitoringVeteran tool with full-featured capabilities
DatadogSaaS monitoring platformOne-stop solution, paid

Key Point: Monitoring must be layered, covering everything from infrastructure to business to avoid blind spots.


2. Alerting

Once monitoring detects an issue, operations staff need to be notified promptly — that's alerting.

2.1 Alerting Flow

告警流程 (Alerting Flow)
从发现异常到通知运维的自动化流程
1
监控采集
Prometheus 每隔 15s 采集一次指标
2
规则评估
Alertmanager 评估是否满足告警条件
3
告警分组
相似告警合并,避免轰炸
4
静默判断
检查是否在静默时间(如维护窗口)
5
路由分发
根据标签分发到不同接收器
6
发送通知
通过钉钉/邮件/短信通知值班人员
告警级别说明
P0最高优先级,立即处理(如核心服务宕机)
P1高优先级,30分钟内处理(如部分功能异常)
P2中优先级,当天处理(如性能下降)
P3低优先级,本周处理(如资源使用率偏高)

2.2 Alert Severity Levels

Proper alert classification helps prevent "alert fatigue":

LevelResponse TimeTypical ScenarioNotification Channels
P0Immediate (within 5 min)Core service down, payment failuresPhone + SMS + IM
P1Within 30 minutesPartial feature outage, severe performance degradationSMS + IM + Email
P2Same dayHigh resource usage, occasional errorsIM + Email
P3Within the weekNon-critical issues, optimization suggestionsEmail

2.3 Alert Deduplication & Noise Reduction

Pain Point: A single small issue can trigger hundreds or thousands of alerts, numbing on-call staff.

Solutions:

  1. Alert Grouping: Merge similar alerts (e.g., multiple issues on the same server combined into one)
  2. Alert Suppression: If a parent issue has already fired, don't repeat alerts for child issues
  3. Silence Rules: Automatically suppress alerts during maintenance windows
  4. Rate Limiting: Don't repeat the same alert notification within a short time window

Key Point: Alerts should be "few but meaningful" — every alert must be worth acting on.


3. Logging

Logs are the "black box" for troubleshooting.

3.1 Log Levels

javascript
console.debug('Verbose debug info')  // Used during development
console.info('General information')   // Normal flow logging
console.warn('Warning')               // Potential issues
console.error('Error')                // Errors that need attention

3.2 Structured Logging

Traditional logging (not ideal):

2024-01-15 10:23:45 ERROR User john failed to login, attempts=3, ip=192.168.1.100

Structured logging (recommended):

json
{
  "timestamp": "2024-01-15T10:23:45Z",
  "level": "ERROR",
  "message": "User login failed",
  "user": "john",
  "attempts": 3,
  "ip": "192.168.1.100",
  "service": "auth-service"
}

3.3 The ELK Stack

ELK = Elasticsearch + Logstash + Kibana

  • Logstash: Log collection and filtering
  • Elasticsearch: Log storage and search
  • Kibana: Log visualization and querying

Best Practices:

  • ✅ Don't log sensitive information (passwords, tokens)
  • ✅ Critical operations (login, payment, permission changes) must be logged
  • ✅ Logs should include context (user ID, request ID, timestamp)
  • ✅ Regularly purge expired logs to avoid running out of disk space

4. Distributed Tracing

In a microservices architecture, a single request may pass through dozens of services — how do you trace its complete path?

Trace ID and Span ID

  • Trace ID: The unique identifier for an entire request chain (like a package tracking number)
  • Span ID: The identifier for a single service call (like each transfer hub)

4.1 Distributed Tracing Demo

分布式链路追踪 (Distributed Tracing)
一个请求在微服务间流转的完整路径
Trace ID:a1b2c3d4-e5f6-7890-abcd-ef1234567890
总耗时:450ms
调用服务数:6
0ms
45ms
90ms
135ms
180ms
225ms
270ms
315ms
360ms
405ms
450ms
API Gateway
POST /api/order/create
450ms
User Service
验证用户身份
45ms
Product Service
查询商品信息
85ms
Inventory Service
扣减库存
120ms
Payment Service
创建支付订单
95ms
Order Service
保存订单记录
25ms
正常 (≤200ms)
慢调用 (>200ms)
错误
💡 观察要点
  • 点击"性能瓶颈"查看数据库查询慢导致的延迟
  • 点击"错误追踪"查看库存服务异常如何影响整个链路
  • 每个 Span 都有唯一的 Span ID,通过 Trace ID 关联
  • 时间条越长,表示该服务耗时越长

4.2 The OpenTelemetry Standard

OpenTelemetry (OTel) is the industry standard for distributed tracing, providing a unified API and SDK.

javascript
// Example: Recording a Span with OpenTelemetry
import { trace } from '@opentelemetry/api'

const tracer = trace.getTracer('my-service')

async function processOrder(orderId) {
  // Create a Span
  const span = tracer.startSpan('processOrder')

  try {
    // Set attributes
    span.setAttribute('order.id', orderId)

    // Business logic...
    await validateOrder(orderId)
    await saveToDatabase(orderId)

    span.setStatus({ code: SpanStatusCode.OK })
  } catch (error) {
    span.recordException(error)
    span.setStatus({ code: SpanStatusCode.ERROR, message: error.message })
  } finally {
    span.end() // End the Span
  }
}

Key Point: Distributed tracing quickly identifies performance bottlenecks and failure points — an essential tool for microservices.


5. Troubleshooting Process

Production incidents are inevitable. The key is fast response and fast recovery.

5.1 Incident Response Process

故障响应流程 (Incident Response)
专业团队如何处理线上故障
1
故障发现
T+0 分钟
监控系统自动发现异常指标
关键动作:
  • 监控检测到订单服务错误率从 0.1% 飙升到 8.5%
  • Alertmanager 立即触发 P1 告警
  • 值班人员收到钉钉和短信通知
常用工具:
PrometheusGrafanaAlertmanager
2
快速响应
T+3 分钟
值班人员确认故障并启动应急流程
3
故障定位
T+8 分钟
通过日志和追踪系统分析根因
4
故障修复
T+18 分钟
实施临时解决方案恢复服务
5
恢复验证
T+21 分钟
确认服务完全恢复正常
6
故障复盘
T+48 小时
总结经验教训,制定改进计划
🎯 故障处理最佳实践
快速响应
建立 15 分钟响应机制,P0 故障立即电话通知
📢
信息同步
定期向用户和内部同步故障进展,避免猜测
🔍
保留现场
故障现场数据(日志、监控)完整留存,便于分析
📝
blameless 文化
复盘对事不对人,聚焦流程改进而非个人责任

5.2 Common Troubleshooting Tools

ToolPurposeTypical Scenario
tcpdumpPacket capture analysisNetwork issues, packet loss
straceSystem call tracingProcess hanging, file permission issues
ArthasJava diagnosticsCPU spikes, memory leaks, deadlocks
top/htopSystem resource monitoringHigh CPU/memory usage
netstatNetwork connection inspectionPort conflicts, abnormal connection counts
lsofOpen file inspectionFile locks, disk full

Arthas Example (Alibaba's open-source Java diagnostic tool):

bash
# View top 5 threads by CPU usage
$ top -H -p 12345

# Trace the execution time of a method
$ trace com.example.OrderService createOrder

# View a class's static fields
$ getstatic com.example.Config MAX_CONNECTIONS

# Hot-reload code (no restart needed)
$ mc /tmp/Test.java
$ redefine /tmp/Test.class

5.3 Post-mortem Analysis

A post-mortem is not a blame session!

The purpose of a post-mortem is:

  1. Reconstruct the incident timeline
  2. Identify the root cause (Root Cause Analysis)
  3. Summarize lessons learned
  4. Define improvement actions

The 5 Whys Analysis:

Ask "why" at least 5 times to find the root cause:

  • Why did the service go down?
    • Because of an out-of-memory error
  • Why did memory overflow?
    • Because cached data grew too large
  • Why was cached data too large?
    • Because no expiration time was set
  • Why was no expiration time set?
    • Because it was overlooked during development
  • Root cause: Lack of code review and test coverage

Key Point: Build a blameless culture — focus on process improvement, not individual accountability.


6. Performance Optimization

6.1 Performance Bottleneck Analysis

Top-down optimization approach:

User Experience

Frontend Optimization (reduce requests, CDN, lazy loading)

Network Optimization (HTTP/2, compression, persistent connections)

Backend Optimization (caching, async, batching)

Database Optimization (indexes, query tuning, sharding)

System Optimization (kernel parameters, JVM tuning)

6.2 Database Optimization

Index Optimization:

sql
-- Slow query (no index)
SELECT * FROM orders WHERE user_id = 12345;

-- 100x faster after creating an index
CREATE INDEX idx_user_id ON orders(user_id);

Query Optimization:

sql
-- ❌ Avoid SELECT *
SELECT * FROM users WHERE id = 123;

-- ✅ Only query needed fields
SELECT id, name, email FROM users WHERE id = 123;

-- ❌ Avoid overly large IN clauses
SELECT * FROM orders WHERE user_id IN (1, 2, 3, ..., 10000);

-- ✅ Use JOIN or batch queries
SELECT * FROM orders o JOIN user_ids u ON o.user_id = u.id;

6.3 Cache Optimization

Multi-level Cache Architecture:

Browser Cache (CDN)

Local Cache (In-memory/Guava)

Distributed Cache (Redis/Memcached)

Database (MySQL/PostgreSQL)

Cache Update Strategies:

StrategyProsConsUse Case
Cache-AsideSimple, reliableSlow on first queryRead-heavy, write-light
Write-ThroughGood data consistencySlow writesBalanced read/write
Write-BehindExtremely fast writesPotential data lossWrite-heavy, tolerates brief inconsistency

Key Point: Caching is not a silver bullet — consider consistency, avalanche, and penetration issues (refer to the "System Cache Design" chapter).


7. Capacity Planning

7.1 Capacity Assessment

容量规划计算器 (Capacity Planning)
估算系统需要多少台服务器才能满足需求
📊 业务指标
%
次/秒
💡 通常高峰期流量是平均流量的 2-3 倍,建议预留 50-100% 冗余应对突发流量
📈 容量评估结果
日均总请求量
5,000,000 次/天
高峰期 QPS (目标)
75 次/秒
理论所需服务器
1 台
推荐配置 (含冗余)
2 台
💰 月成本估算 (云服务器)
阿里云 (4核8G)
¥600/月
腾讯云 (4核8G)
¥560/月
AWS (t3.xlarge)
¥1,000/月
🎯 容量规划要点
1️⃣
以峰值为核心
不能按平均流量规划,必须按高峰期流量(通常是平均的 2-3 倍)来准备
2️⃣
预留冗余空间
至少预留 50% 冗余,用于应对突发流量、服务器故障、维护窗口
3️⃣
定期压测验证
每季度进行压力测试,验证实际容量是否满足预估
4️⃣
弹性扩缩容
结合云服务的自动扩缩容,在高峰期自动增加实例
📐 计算公式
日均请求量:DAU × 人均请求次数
平均 QPS:日均请求量 ÷ 86400 秒
高峰 QPS:平均 QPS × 高峰系数 (2-3 倍)
所需服务器:高峰 QPS × 冗余系数 ÷ 单机 QPS

7.2 Stress Testing

Tool Selection:

ToolCharacteristicsUse Case
JMeterFeature-rich, visualHTTP API stress testing
wrk/abLightweight, command-lineQuick benchmarking
LocustPython scripting, distributedComplex scenario testing
K6Modern, JS scriptingCI/CD integration

wrk Example:

bash
# Install wrk
$ brew install wrk  # macOS
$ apt install wrk   # Ubuntu

# Stress test an HTTP endpoint (10 threads, 30 seconds)
$ wrk -t10 -c100 -d30s http://example.com/api/users

# Output:
# Running 30s test @ http://example.com/api/users
#   10 threads and 100 connections
#   Thread Stats   Avg      Stdev     Max   +/- Stdev
#     Latency    45.32ms   12.45ms 120.50ms   87.56%
#     Req/Sec     2.12k   123.45    3.45k    89.01%
#   632450 requests in 30.00s, 1.23GB read
# Requests/sec:  21081.67

7.3 Elastic Scaling

Auto-scaling in the cloud-native era:

yaml
# Kubernetes HPA (Horizontal Pod Autoscaler)
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: my-app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

When CPU usage exceeds 70%, pods automatically scale up (up to 10)

Key Point: Combine business forecasting (e.g., Black Friday sales) with proactive scaling to avoid last-minute scrambling.


8. Security Operations

8.1 Access Control

Principle of Least Privilege:

  • Developers can only access the development environment
  • Operations staff can only access production, and require approval
  • Sensitive database operations require secondary confirmation

Jump Server (Bastion Host):

All operations tasks go through the bastion host, which records complete operation logs.

8.2 Data Backup

The 3-2-1 Backup Rule:

  • 3 copies of data (1 original + 2 backups)
  • 2 different storage media (local disk + cloud storage)
  • 1 offsite backup (to prevent single-point disasters)

Backup Strategies:

TypeFrequencyRetentionRTORPO
Full BackupWeekly1 month4 hours24 hours
Incremental BackupDaily1 week2 hours1 hour
Real-time BackupPer second7 daysMinutesSeconds

RTO (Recovery Time Objective): The maximum acceptable downtime duration RPO (Recovery Point Objective): The maximum acceptable data loss

8.3 Vulnerability Scanning

Regular Scanning:

  • Code Scanning: SonarQube, ESLint (detect potential vulnerabilities)
  • Dependency Scanning: npm audit, Snyk (detect third-party library vulnerabilities)
  • Container Scanning: Trivy, Clair (detect image vulnerabilities)
bash
# npm audit example
$ npm audit

found 3 vulnerabilities (1 moderate, 2 high)

Package         Severity  Vulnerable versions
lodash          high      <4.17.21
express         moderate  4.0.0 - 4.18.2

# Auto-fix
$ npm audit fix

9. Automated Operations (DevOps)

9.1 CI/CD Pipeline

yaml
# .gitlab-ci.yml example
stages:
  - test
  - build
  - deploy

test:
  stage: test
  script:
    - npm install
    - npm test
  tags:
    - docker

build:
  stage: build
  script:
    - docker build -t myapp:$CI_COMMIT_SHA .
    - docker push registry.example.com/myapp:$CI_COMMIT_SHA
  only:
    - main

deploy:
  stage: deploy
  script:
    - kubectl set image deployment/myapp myapp=registry.example.com/myapp:$CI_COMMIT_SHA
  environment:
    name: production
  when: manual # Manually triggered deployment

9.2 Infrastructure as Code (IaC)

Terraform Example (managing cloud resources):

hcl
# main.tf
resource "aws_instance" "web" {
  ami           = "ami-0c55b159cbfafe1f0"
  instance_type = "t2.micro"

  tags = {
    Name = "WebServer"
    Env  = "production"
  }
}

resource "aws_security_group" "web" {
  name = "web-sg"

  ingress {
    from_port   = 80
    to_port     = 80
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

Advantages:

  • ✅ Version Control: All configuration lives in Git
  • ✅ Reproducibility: Environment consistency
  • ✅ Auditability: Clear change history
  • ✅ Rollback: Quickly revert to previous versions

9.3 GitOps Practices

GitOps = Git + IaC + Automation

Core principle: The Git repository is the single source of truth for infrastructure

Workflow:

1. Modify config files (push to Git)

2. Git repository changes trigger CI/CD

3. Automatically run terraform apply / kubectl apply

4. Infrastructure updates automatically

5. Monitor and reconcile actual state vs. desired state

Tools: ArgoCD, Flux (Kubernetes deployment)


10. Summary & Best Practices

Operations is a vast domain, but the core can be distilled into the following:

10.1 Operations Maturity Model

LevelCharacteristicsPractices
BeginnerReactive, manual operationsFix issues only when they arise, manual deploys
IntermediateAutomated, standardizedCI/CD, monitoring & alerting, documentation
AdvancedProactive, self-healingCapacity planning, chaos drills, auto-scaling
ExpertIntelligent, unattendedAIOps, chaos engineering, serverless

10.2 A Day in the Life of an SRE

09:00 - Review overnight alerts, confirm system status
10:00 - Handle user-reported issues
11:00 - Attend engineering weekly, assess operational risk of new proposals
14:00 - Optimize slow queries, improve performance
15:00 - Code review
16:00 - Write deployment docs, update monitoring rules
17:00 - Chaos engineering drills
18:00 - On-call handoff

10.3 Learning Roadmap

Beginner Stage (1–3 months):

  • Learn common Linux commands
  • Understand monitoring systems (Prometheus + Grafana)
  • Master log querying (ELK)

Intermediate Stage (3–6 months):

  • Deep dive into container technology (Docker + K8s)
  • Master a diagnostic tool (Arthas, tcpdump)
  • Practice CI/CD pipelines

Advanced Stage (6–12 months):

  • Performance tuning (database, JVM, network)
  • Capacity planning and cost optimization
  • Post-mortems and process improvement

Expert Stage (1+ year):

  • Architecture design (high availability, disaster recovery)
  • Chaos engineering (proactively inject failures)
  • AIOps (intelligent operations)

11. Glossary

TermFull NameExplanation
Monitoring-Real-time observation of system health.
Alerting-Notifying relevant personnel when anomalies occur.
Logging-Recording events during system operation.
Tracing-Tracking the full path of a request across a distributed system.
QPSQueries Per SecondQueries per second, a measure of system throughput.
Latency-The time from request initiation to response.
RTORecovery Time ObjectiveMaximum acceptable downtime duration.
RPORecovery Point ObjectiveMaximum acceptable data loss.
Post-mortem-Incident review to analyze root causes and improvement actions.
CI/CDContinuous Integration/DeliveryAutomated testing and deployment.
IaCInfrastructure as CodeManaging servers, networks, and other resources via code.
GitOps-Git-driven operations — Git is the single source of truth.
ELKElasticsearch + Logstash + KibanaThe log collection, storage, and visualization trifecta.
SLAService Level AgreementCommitted service availability (e.g., 99.9%).
Blameless-A no-blame culture where post-mortems focus on process over individuals.

12. Further Reading