Skip to content

Incident Response and Troubleshooting

Foreword

At 3 AM, your phone buzzes frantically — the entire online service is down. What do you do? For any internet team, it's not a matter of "whether incidents will happen," but "when they will happen." Great teams aren't those that never have incidents — they're the ones that can respond quickly, recover efficiently, and learn from failures to avoid repeating them.

What will you learn from this article?

After completing this chapter, you will gain:

  • Severity classification awareness: Master the P0~P4 incident severity grading standards
  • Response process: Understand the complete incident response timeline from detection to recovery
  • Organizational collaboration: Learn about role assignments and collaboration mechanisms in the incident command system
  • Alerting system: Master alert escalation strategies to ensure critical issues are not missed
  • Postmortem methodology: Learn to use the "Five Whys" to dig into root causes and write valuable postmortem reports
ChapterContentCore Concepts
Chapter 1Severity ClassificationP0~P4, impact scope assessment
Chapter 2Response TimelineDetection → Response → Recovery → Postmortem
Chapter 3Command SystemIC, Communications Lead, Tech Lead
Chapter 4Alert EscalationTiered alerts, progressive escalation
Chapter 5PostmortemFive Whys, blameless culture

0. Big Picture: Failures Are the Best Teachers

Netflix has a famous tool called Chaos Monkey — it randomly kills production servers. It sounds crazy, but the logic is clear: rather than waiting for failures to find you, proactively create failures to train your team's incident response capabilities.

Incident response is not about improvising — it relies on a systematic approach built on processes, roles, and tools working together. Just like fire departments aren't formed when a fire breaks out — they train, drill, and maintain equipment on a regular basis.

Four Core Elements of Incident Response

  • Rapid Detection: Comprehensive monitoring and alerting systems to ensure issues are detected before users notice
  • Efficient Collaboration: Clear role assignments and communication mechanisms to avoid duplicated effort during chaos
  • Fast Recovery: Prioritize service restoration over root cause analysis. Stop the bleeding first, then treat the disease
  • Continuous Improvement: Every incident is a learning opportunity. Improve systems and processes through postmortems

1. Severity Classification: Not Every Incident Requires "All Hands on Deck"

A button displaying the wrong color and the entire payment system being down are clearly not at the same level of severity. Incident classification exists so that teams can respond to issues at the appropriate level — neither overreacting and wasting resources, nor underestimating problems and allowing damage to escalate.

Incident Severity Levels
Select a level to understand response requirements and real examples.
P0
Critical Incident
Core business is completely unavailable, many users are affected, and there is serious financial loss or data loss risk.
Immediate response, staffed within 5 minutes
PhoneSMSChatEmail
Primary database is down and all reads/writes fail
Payment system is unavailable and users cannot order
Large-scale user data leakage
Incident commander joins within 5 minutes
Update management every 15 minutes
All relevant teams cancel leave and assist immediately
Complete postmortem within 24 hours
Level Comparison
LevelUser impactResponse timeOn-call requirement
P0All usersImmediate response, staffed within 5 minutesAll hands
P1Many usersRespond within 15 minutesCore team
P2Some usersRespond within 1 hourOn-call engineer
P3Very few usersAcknowledge today, handle this weekNormal planning
P4No direct impactSchedule by priorityNo on-call needed
LevelNameImpact ScopeResponse RequirementExample
P0CriticalCore business completely unavailableImmediate response, all hands on deckPayment system down, data breach
P1SevereCore functionality severely impairedRespond within 15 minutesLogin failure rate > 50%, widespread API timeouts
P2MajorSome features malfunctioningRespond within 1 hourInaccurate search results, some pages returning 500
P3MinorNon-core features malfunctioningHandle during business hoursAvatar loading failures, non-critical notification delays
P4LowUX issuesSchedule for next iterationUI misalignment, copy errors

Key Principles of Classification

  • Number of affected users: A P2 affecting 100% of users may be more urgent than a P1 affecting 1% of users
  • Business impact: Issues directly affecting revenue (payments, orders) have higher priority
  • Degradable: If there's a temporary workaround that mitigates the impact, the severity can be appropriately downgraded
  • Dynamic adjustment: As investigation progresses, the level may be upgraded or downgraded

2. Response Timeline: The Complete Process from Detection to Postmortem

An incident response is like a relay race — each stage has clear objectives and handoff points. A clear timeline keeps the team organized even in chaos.

Incident Response Timeline
Select each phase to understand key actions.
1
Detect
T+0
2
Triage
T+5min
3
Mitigate
T+15min
4
Resolve
T+1h
5
Postmortem
T+48h

Five Stages of Incident Response

  1. Detection: Discover anomalies through monitoring alerts, user reports, or internal inspections. Goal: Detect as early as possible, minimize MTTD (Mean Time to Detect).
  2. Response: Confirm the incident, assess severity, assemble the response team, and establish communication channels. Goal: Quickly organize an effective response force.
  3. Mitigation: Take temporary measures to restore service, such as rolling back deployments, switching to backup nodes, or rate limiting/degrading. Goal: Stop the bleeding first, restore user experience.
  4. Resolution: Find the root cause and fix it permanently. Goal: Eliminate the underlying issue, prevent recurrence.
  5. Postmortem: Review the entire process, analyze root causes, and develop improvement measures. Goal: Learn from failures, make the system more resilient.
MetricMeaningOptimization Direction
MTTDMean Time to DetectImprove monitoring coverage, lower alert thresholds
MTTRMean Time to RecoverAutomate recovery, rehearse response plans
MTBFMean Time Between FailuresImprove system reliability, eliminate single points of failure

3. Command System: Who Commands This "Battle"?

In a major incident, the biggest fear isn't technical challenges but chaos — a dozen people investigating simultaneously, nobody knowing what others are doing, critical information fragmented across various chat groups. The Incident Command System exists to solve this problem.

Incident Command System
Click a role card to understand responsibilities and collaboration.
🎖️
Incident Commander
Incident Commander
📢
Communications Lead
Communications Lead
🔧
Operations Lead
Operations Lead
💻
Development Lead
Development Lead
🎖️Incident Commander
1Coordinate the entire incident response
2Make key decisions such as rollback, traffic shifting, and degradation
3Keep roles collaborating without confusion
4Control response rhythm and synchronize progress regularly
Big-picture viewDecision makingCoordinationStress management
"Current status: payment service is unavailable. Ops checks the database, backend prepares rollback, comms updates every 10 minutes."
Scenario: P0 Payment System Incident
14:02MonitoringPayment success rate drops from 99.9% to 12%, triggering P0 alert.
14:03CommanderConfirms P0 incident, opens incident channel, gathers roles.
14:05CommsNotifies management and updates status page to degraded service.
14:08OpsFinds primary DB CPU at 100% and connection pool exhausted.
14:10DevIdentifies yesterday slow query release as root cause.
14:12CommanderDecision: rollback yesterday change and perform DB failover immediately.
14:15OpsDatabase failover complete and connections recover.
14:18DevCode rollback deployment complete.
14:20CommsPayment success rate recovers to 99.8%; service recovery announced.

Three Core Roles

  1. Incident Commander (IC): The overall person in charge of the incident response. Responsible for decision-making, coordinating resources, and setting the pace. The IC doesn't need to be the most technically skilled person, but must be the calmest and have the best big-picture view.
  2. Communications Lead: Responsible for external communication — updating status pages, notifying customers, briefing management. This allows the IC and technical staff to focus on solving the problem without being interrupted by communication tasks.
  3. Tech Lead: Responsible for technical investigation and remediation. Organizes technical staff in division of labor and reports progress and solutions to the IC.

4. Alert Escalation: Ensuring Critical Issues Are Not Missed

The alert system is the "eyes" of incident response. But too few alerts lead to missed issues, while too many cause "alert fatigue" — when you receive hundreds of alerts daily, the truly important one can easily get buried. Alert escalation strategies are the key to solving this problem.

Alert Escalation
Choose a scenario and observe how alerts escalate.
📡
Monitoring detects issueT+0s
Prometheus detects exhausted DB connection pool and query timeouts.
Automatically triggers P0 alert.
📱
On-call engineerT+30s
Phone, SMS, and chat notify the on-call DBA at the same time.
👥
Team leadsT+5min
Automatically escalates to database and backend team leads.
🎖️
Engineering directorT+15min
Issue is not mitigated, so it escalates to director.
🏢
VP / CTOT+30min
Major incident escalates to executives for external communication.
Escalation Rules
P3/P4 alerts: notify only the on-call engineer; no escalation needed.
P2 alerts: escalate to team lead if not acknowledged within 15 minutes.
P1 alerts: escalate after 5 minutes unacknowledged, then to director after 30 minutes unresolved.
P0 alerts: notify the whole chain immediately; escalate to VP/CTO if not mitigated within 15 minutes.

Three Tiers of Alert Escalation

  1. Tier 1 Response (L1): When an alert triggers, first notify the on-duty engineer. If not acknowledged within 15 minutes, automatically escalate.
  2. Tier 2 Escalation (L2): Notify team leads and relevant domain experts. If not mitigated within 30 minutes, continue escalating.
  3. Tier 3 Escalation (L3): Notify technical directors and management, activate full emergency response.
Alert LevelNotification MethodResponse DeadlineEscalation Condition
WarningIM messageHandle during business hoursUnresolved for 30 minutes
CriticalPhone + IMAcknowledge within 15 minutesUnacknowledged or unmitigated
FatalPhone barrage + SMSRespond within 5 minutesAuto-escalate to management

5. Postmortem: Learning from Failures

After an incident is resolved, the most important step is the postmortem. A postmortem is not about assigning blame — it's about finding systemic improvement opportunities. Companies like Google and Meta practice a "blameless postmortem" culture — focusing on "why the system allowed this error to happen," not "who made this error."

Postmortem: 5 Whys Analysis
Click "Ask again" to dig layer by layer into root cause.
SymptomDepth 0 / 4
💡Payment system was completely unavailable for 18 minutes during peak traffic.
Postmortem Template
1Incident summary+
2Timeline+
3Impact assessment+
4Root cause analysis+
5Improvements+
6Lessons learned+

"Five Whys" Analysis Method

Starting from the surface symptom, repeatedly ask "why" until you find the root cause:

  1. Why did the service go down? → Database connection pool exhausted
  2. Why was the connection pool exhausted? → Slow queries holding connections without releasing them
  3. Why were there slow queries? → Missing indexes, causing full table scans
  4. Why were indexes missing? → No DBA review when new tables went live
  5. Why was there no review? → No mandatory SQL review process

The root cause is not "someone forgot to add an index" but "there's no SQL review process." Fixing the root cause prevents recurrence.


Summary

Incident response and troubleshooting is an essential capability for every technical team. It doesn't rely on heroic individual efforts, but on systematic processes, clear role assignments, and continuous postmortem-driven improvement.

Key takeaways from this chapter:

  1. Tiered response: P0~P4 classification ensures the appropriate level of effort for each level of issue
  2. Clear timeline: Detection → Response → Mitigation → Resolution → Postmortem, with clear objectives at each stage
  3. Command system: IC + Communications Lead + Tech Lead, with divided responsibilities to avoid chaos
  4. Alert escalation: Tiered alerts + automatic escalation to ensure critical issues are not missed
  5. Blameless postmortem: Use the "Five Whys" to dig into root causes, focus on system improvement rather than individual blame

Further Reading