Incident Response and Troubleshooting

Foreword

At 3 AM, your phone buzzes frantically — the entire online service is down. What do you do? For any internet team, it's not a matter of "whether incidents will happen," but "when they will happen." Great teams aren't those that never have incidents — they're the ones that can respond quickly, recover efficiently, and learn from failures to avoid repeating them.

What will you learn from this article?

After completing this chapter, you will gain:

Severity classification awareness: Master the P0~P4 incident severity grading standards
Response process: Understand the complete incident response timeline from detection to recovery
Organizational collaboration: Learn about role assignments and collaboration mechanisms in the incident command system
Alerting system: Master alert escalation strategies to ensure critical issues are not missed
Postmortem methodology: Learn to use the "Five Whys" to dig into root causes and write valuable postmortem reports

Chapter	Content	Core Concepts
Chapter 1	Severity Classification	P0~P4, impact scope assessment
Chapter 2	Response Timeline	Detection → Response → Recovery → Postmortem
Chapter 3	Command System	IC, Communications Lead, Tech Lead
Chapter 4	Alert Escalation	Tiered alerts, progressive escalation
Chapter 5	Postmortem	Five Whys, blameless culture

0. Big Picture: Failures Are the Best Teachers

Netflix has a famous tool called Chaos Monkey — it randomly kills production servers. It sounds crazy, but the logic is clear: rather than waiting for failures to find you, proactively create failures to train your team's incident response capabilities.

Incident response is not about improvising — it relies on a systematic approach built on processes, roles, and tools working together. Just like fire departments aren't formed when a fire breaks out — they train, drill, and maintain equipment on a regular basis.

Four Core Elements of Incident Response

Rapid Detection: Comprehensive monitoring and alerting systems to ensure issues are detected before users notice
Efficient Collaboration: Clear role assignments and communication mechanisms to avoid duplicated effort during chaos
Fast Recovery: Prioritize service restoration over root cause analysis. Stop the bleeding first, then treat the disease
Continuous Improvement: Every incident is a learning opportunity. Improve systems and processes through postmortems

1. Severity Classification: Not Every Incident Requires "All Hands on Deck"

A button displaying the wrong color and the entire payment system being down are clearly not at the same level of severity. Incident classification exists so that teams can respond to issues at the appropriate level — neither overreacting and wasting resources, nor underestimating problems and allowing damage to escalate.

Critical Incident

Definition

Core business is completely unavailable, many users are affected, and there is serious financial loss or data loss risk.

Response time

Immediate response, staffed within 5 minutes

Notification channels

PhoneSMSChatEmail

Examples

Primary database is down and all reads/writes fail

Payment system is unavailable and users cannot order

Large-scale user data leakage

Response requirements

✓Incident commander joins within 5 minutes

✓Update management every 15 minutes

✓All relevant teams cancel leave and assist immediately

✓Complete postmortem within 24 hours

Level Comparison

Level	User impact	Response time	On-call requirement
P0	All users	Immediate response, staffed within 5 minutes	All hands
P1	Many users	Respond within 15 minutes	Core team
P2	Some users	Respond within 1 hour	On-call engineer
P3	Very few users	Acknowledge today, handle this week	Normal planning
P4	No direct impact	Schedule by priority	No on-call needed

Level	Name	Impact Scope	Response Requirement	Example
P0	Critical	Core business completely unavailable	Immediate response, all hands on deck	Payment system down, data breach
P1	Severe	Core functionality severely impaired	Respond within 15 minutes	Login failure rate > 50%, widespread API timeouts
P2	Major	Some features malfunctioning	Respond within 1 hour	Inaccurate search results, some pages returning 500
P3	Minor	Non-core features malfunctioning	Handle during business hours	Avatar loading failures, non-critical notification delays
P4	Low	UX issues	Schedule for next iteration	UI misalignment, copy errors

Key Principles of Classification

Number of affected users: A P2 affecting 100% of users may be more urgent than a P1 affecting 1% of users
Business impact: Issues directly affecting revenue (payments, orders) have higher priority
Degradable: If there's a temporary workaround that mitigates the impact, the severity can be appropriately downgraded
Dynamic adjustment: As investigation progresses, the level may be upgraded or downgraded

2. Response Timeline: The Complete Process from Detection to Postmortem

An incident response is like a relay race — each stage has clear objectives and handoff points. A clear timeline keeps the team organized even in chaos.

Detect

T+0

Triage

T+5min

Mitigate

T+15min

Resolve

T+1h

Postmortem

T+48h

Five Stages of Incident Response

Detection: Discover anomalies through monitoring alerts, user reports, or internal inspections. Goal: Detect as early as possible, minimize MTTD (Mean Time to Detect).
Response: Confirm the incident, assess severity, assemble the response team, and establish communication channels. Goal: Quickly organize an effective response force.
Mitigation: Take temporary measures to restore service, such as rolling back deployments, switching to backup nodes, or rate limiting/degrading. Goal: Stop the bleeding first, restore user experience.
Resolution: Find the root cause and fix it permanently. Goal: Eliminate the underlying issue, prevent recurrence.
Postmortem: Review the entire process, analyze root causes, and develop improvement measures. Goal: Learn from failures, make the system more resilient.

Metric	Meaning	Optimization Direction
MTTD	Mean Time to Detect	Improve monitoring coverage, lower alert thresholds
MTTR	Mean Time to Recover	Automate recovery, rehearse response plans
MTBF	Mean Time Between Failures	Improve system reliability, eliminate single points of failure

3. Command System: Who Commands This "Battle"?

In a major incident, the biggest fear isn't technical challenges but chaos — a dozen people investigating simultaneously, nobody knowing what others are doing, critical information fragmented across various chat groups. The Incident Command System exists to solve this problem.

🎖️

Incident Commander

📢

Communications Lead

🔧

Operations Lead

💻

Development Lead

🎖️Incident Commander

Core responsibilities

1Coordinate the entire incident response

2Make key decisions such as rollback, traffic shifting, and degradation

3Keep roles collaborating without confusion

4Control response rhythm and synchronize progress regularly

Key skills

Big-picture viewDecision makingCoordinationStress management

Typical phrase

"Current status: payment service is unavailable. Ops checks the database, backend prepares rollback, comms updates every 10 minutes."

Scenario: P0 Payment System Incident

14:02MonitoringPayment success rate drops from 99.9% to 12%, triggering P0 alert.

14:03CommanderConfirms P0 incident, opens incident channel, gathers roles.

14:05CommsNotifies management and updates status page to degraded service.

14:08OpsFinds primary DB CPU at 100% and connection pool exhausted.

14:10DevIdentifies yesterday slow query release as root cause.

14:12CommanderDecision: rollback yesterday change and perform DB failover immediately.

14:15OpsDatabase failover complete and connections recover.

14:18DevCode rollback deployment complete.

14:20CommsPayment success rate recovers to 99.8%; service recovery announced.

Three Core Roles

Incident Commander (IC): The overall person in charge of the incident response. Responsible for decision-making, coordinating resources, and setting the pace. The IC doesn't need to be the most technically skilled person, but must be the calmest and have the best big-picture view.
Communications Lead: Responsible for external communication — updating status pages, notifying customers, briefing management. This allows the IC and technical staff to focus on solving the problem without being interrupted by communication tasks.
Tech Lead: Responsible for technical investigation and remediation. Organizes technical staff in division of labor and reports progress and solutions to the IC.

4. Alert Escalation: Ensuring Critical Issues Are Not Missed

The alert system is the "eyes" of incident response. But too few alerts lead to missed issues, while too many cause "alert fatigue" — when you receive hundreds of alerts daily, the truly important one can easily get buried. Alert escalation strategies are the key to solving this problem.

📡

Monitoring detects issueT+0s

Prometheus detects exhausted DB connection pool and query timeouts.

Automatically triggers P0 alert.

📱

On-call engineerT+30s

Phone, SMS, and chat notify the on-call DBA at the same time.

👥

Team leadsT+5min

Automatically escalates to database and backend team leads.

🎖️

Engineering directorT+15min

Issue is not mitigated, so it escalates to director.

🏢

VP / CTOT+30min

Major incident escalates to executives for external communication.

Escalation Rules

P3/P4 alerts: notify only the on-call engineer; no escalation needed.

P2 alerts: escalate to team lead if not acknowledged within 15 minutes.

P1 alerts: escalate after 5 minutes unacknowledged, then to director after 30 minutes unresolved.

P0 alerts: notify the whole chain immediately; escalate to VP/CTO if not mitigated within 15 minutes.

Three Tiers of Alert Escalation

Tier 1 Response (L1): When an alert triggers, first notify the on-duty engineer. If not acknowledged within 15 minutes, automatically escalate.
Tier 2 Escalation (L2): Notify team leads and relevant domain experts. If not mitigated within 30 minutes, continue escalating.
Tier 3 Escalation (L3): Notify technical directors and management, activate full emergency response.

Alert Level	Notification Method	Response Deadline	Escalation Condition
Warning	IM message	Handle during business hours	Unresolved for 30 minutes
Critical	Phone + IM	Acknowledge within 15 minutes	Unacknowledged or unmitigated
Fatal	Phone barrage + SMS	Respond within 5 minutes	Auto-escalate to management

5. Postmortem: Learning from Failures

After an incident is resolved, the most important step is the postmortem. A postmortem is not about assigning blame — it's about finding systemic improvement opportunities. Companies like Google and Meta practice a "blameless postmortem" culture — focusing on "why the system allowed this error to happen," not "who made this error."

SymptomDepth 0 / 4

💡Payment system was completely unavailable for 18 minutes during peak traffic.

Postmortem Template

1Incident summary+

2Timeline+

3Impact assessment+

4Root cause analysis+

5Improvements+

6Lessons learned+

"Five Whys" Analysis Method

Starting from the surface symptom, repeatedly ask "why" until you find the root cause:

Why did the service go down? → Database connection pool exhausted
Why was the connection pool exhausted? → Slow queries holding connections without releasing them
Why were there slow queries? → Missing indexes, causing full table scans
Why were indexes missing? → No DBA review when new tables went live
Why was there no review? → No mandatory SQL review process

The root cause is not "someone forgot to add an index" but "there's no SQL review process." Fixing the root cause prevents recurrence.

Summary

Incident response and troubleshooting is an essential capability for every technical team. It doesn't rely on heroic individual efforts, but on systematic processes, clear role assignments, and continuous postmortem-driven improvement.

Key takeaways from this chapter:

Tiered response: P0~P4 classification ensures the appropriate level of effort for each level of issue
Clear timeline: Detection → Response → Mitigation → Resolution → Postmortem, with clear objectives at each stage
Command system: IC + Communications Lead + Tech Lead, with divided responsibilities to avoid chaos
Alert escalation: Tiered alerts + automatic escalation to ensure critical issues are not missed
Blameless postmortem: Use the "Five Whys" to dig into root causes, focus on system improvement rather than individual blame

Incident Response and Troubleshooting ​

0. Big Picture: Failures Are the Best Teachers ​

1. Severity Classification: Not Every Incident Requires "All Hands on Deck" ​

2. Response Timeline: The Complete Process from Detection to Postmortem ​

3. Command System: Who Commands This "Battle"? ​

4. Alert Escalation: Ensuring Critical Issues Are Not Missed ​

5. Postmortem: Learning from Failures ​

Summary ​

Further Reading ​