Skip to content
S sufi.my
Back to Guides

Free Guide

Production Debugging Playbook

A step-by-step framework for investigating production issues quickly and systematically.

5 min read

The mindset

Production debugging is not about guessing. It’s about systematically narrowing down possibilities until only the root cause remains. Every step should either confirm or eliminate a hypothesis.

Rule #1: Don’t change anything until you understand what’s broken. Rule #2: Gather evidence before forming theories. Rule #3: Document as you go — your future self (and teammates) will thank you.

The 5-step framework

Step 1: Understand the symptom

Before investigating, get the facts:

  • What is the user experiencing? (error message, missing data, slow response)
  • When did it start? (timestamp, deployment, config change)
  • Who is affected? (all users, specific region, specific account)
  • How often? (every request, intermittent, time-based)
Example incident report:
- What: Users report "Payment confirmation not received"
- When: Started at 14:30 UTC, after no deployment
- Who: ~5% of users making payments
- How often: Intermittent, no pattern in user type

Step 2: Check the obvious first

Before deep investigation, rule out common causes:

  1. Recent deployments — Did anything ship in the last 24 hours?
  2. Infrastructure — Are CPU/memory/disk within normal range?
  3. Dependencies — Are external services (payment gateway, database, cache) responding?
  4. Error rates — Did the error rate spike at a specific time?
  5. Logs — Are there new error types appearing?
# Quick health checks
- Dashboard: Are error rates elevated?
- Status pages: Is the payment provider reporting issues?
- Metrics: CPU, memory, connection pool utilization
- Recent deploys: git log --since="24 hours ago"

80% of incidents are caused by recent changes or dependency failures. Check these first.

Step 3: Trace the request path

Follow the request from the user’s action to the final system that should have responded:

User → Load Balancer → API Gateway → Auth Service → Payment Service → Database

                                            Payment Provider (external)

At each hop, check:

  • Did the request arrive? (check access logs)
  • Did it succeed or fail? (check application logs)
  • How long did it take? (check latency)
  • What was the response? (check status code + body)

Using correlation IDs

If your system has correlation IDs (and it should), this is where they pay off:

# Find all logs for a specific failed transaction
jsonPayload.correlationId="550e8400-e29b-41d4-a716-446655440000"

This single query shows you the complete journey of one request across all services.

Step 4: Isolate the failure point

You now know which service or integration failed. Dig deeper:

For application errors:

# Search for the specific error
severity=ERROR
resource.labels.service_name="payment-service"
timestamp>="2026-01-15T14:30:00Z"

For timeout/latency issues:

  • Check connection pool metrics (are all connections in use?)
  • Check query execution times (is the database slow?)
  • Check external API response times

For data inconsistency:

  • Compare records across systems (payment gateway vs. your database)
  • Check for race conditions (concurrent writes to the same record)
  • Look for partial failures (one step succeeded, the next failed)

Step 5: Fix, verify, document

Fix:

  • Apply the minimum change that resolves the issue
  • Don’t refactor during an incident
  • If possible, fix forward (deploy a fix) rather than rolling back

Verify:

  • Confirm the fix resolves the original symptom
  • Check for side effects
  • Monitor error rates for the next hour

Document (post-mortem):

## Incident: Missing payment confirmations
**Duration:** 14:30 – 16:15 UTC (1h 45m)
**Impact:** ~5% of payments did not generate confirmation emails
**Root cause:** Database connection pool timeout during high traffic
**Fix:** Increased pool size from 10 to 25, added retry on timeout
**Prevention:** Add alerting on connection pool utilization > 80%

Common failure patterns

1. The silent failure

Symptom: No errors in logs, but data is missing. Common causes:

  • Exception caught and swallowed (empty catch block)
  • Async operation failed but wasn’t awaited
  • Message dropped by queue (no acknowledgment)

How to find it: Add logging at the boundaries — before and after every external call. If you see “before” but not “after,” you found the gap.

2. The intermittent timeout

Symptom: Requests randomly fail with timeout errors. Common causes:

  • Connection pool exhaustion under load
  • DNS resolution delays
  • Garbage collection pauses
  • External service rate limiting

How to find it: Correlate timeout events with resource metrics (CPU, memory, connection count). If timeouts spike when connections hit a ceiling, it’s pool exhaustion.

3. The data mismatch

Symptom: System A says “success” but System B has no record. Common causes:

  • Network failure between acknowledgment and commit
  • Race condition in concurrent writes
  • Missing error handling in webhook processing

How to find it: Compare timestamps and transaction IDs across both systems. The gap between “sent” and “received” (or the absence of “received”) tells you where it broke.

Debugging tools by layer

LayerTools
Browser/ClientDevTools Network tab, console errors
API GatewayAccess logs, request/response headers
ApplicationStructured logs, correlation IDs, APM traces
DatabaseSlow query log, connection pool metrics
InfrastructureCPU/memory/disk metrics, container logs
External APIsStatus pages, response time monitoring

Key principles

  1. Reproduce before you fix — If you can’t reproduce it, you can’t verify the fix
  2. Logs are your best friend — Invest in structured, queryable logging
  3. Correlation IDs are non-negotiable — Without them, tracing distributed requests is guesswork
  4. Time-box your investigation — If you haven’t found the cause in 30 minutes, escalate
  5. Don’t fix in production — Unless it’s a hotfix; otherwise, go through the normal deploy pipeline
  6. Write the post-mortem — The incident is only resolved when the team has learned from it