Free Guide

Production Debugging Playbook

A step-by-step framework for investigating production issues quickly and systematically.

5 min read

The mindset

Production debugging is not about guessing. It’s about systematically narrowing down possibilities until only the root cause remains. Every step should either confirm or eliminate a hypothesis.

Rule #1: Don’t change anything until you understand what’s broken. Rule #2: Gather evidence before forming theories. Rule #3: Document as you go — your future self (and teammates) will thank you.

The 5-step framework

Step 1: Understand the symptom

Before investigating, get the facts:

What is the user experiencing? (error message, missing data, slow response)
When did it start? (timestamp, deployment, config change)
Who is affected? (all users, specific region, specific account)
How often? (every request, intermittent, time-based)

Example incident report:
- What: Users report "Payment confirmation not received"
- When: Started at 14:30 UTC, after no deployment
- Who: ~5% of users making payments
- How often: Intermittent, no pattern in user type

Step 2: Check the obvious first

Before deep investigation, rule out common causes:

Recent deployments — Did anything ship in the last 24 hours?
Infrastructure — Are CPU/memory/disk within normal range?
Dependencies — Are external services (payment gateway, database, cache) responding?
Error rates — Did the error rate spike at a specific time?
Logs — Are there new error types appearing?

# Quick health checks
- Dashboard: Are error rates elevated?
- Status pages: Is the payment provider reporting issues?
- Metrics: CPU, memory, connection pool utilization
- Recent deploys: git log --since="24 hours ago"

80% of incidents are caused by recent changes or dependency failures. Check these first.

Step 3: Trace the request path

Follow the request from the user’s action to the final system that should have responded:

User → Load Balancer → API Gateway → Auth Service → Payment Service → Database
                                                   ↓
                                            Payment Provider (external)

At each hop, check:

Did the request arrive? (check access logs)
Did it succeed or fail? (check application logs)
How long did it take? (check latency)
What was the response? (check status code + body)

Using correlation IDs

If your system has correlation IDs (and it should), this is where they pay off:

# Find all logs for a specific failed transaction
jsonPayload.correlationId="550e8400-e29b-41d4-a716-446655440000"

This single query shows you the complete journey of one request across all services.

Step 4: Isolate the failure point

You now know which service or integration failed. Dig deeper:

For application errors:

# Search for the specific error
severity=ERROR
resource.labels.service_name="payment-service"
timestamp>="2026-01-15T14:30:00Z"

For timeout/latency issues:

Check connection pool metrics (are all connections in use?)
Check query execution times (is the database slow?)
Check external API response times

For data inconsistency:

Compare records across systems (payment gateway vs. your database)
Check for race conditions (concurrent writes to the same record)
Look for partial failures (one step succeeded, the next failed)

Step 5: Fix, verify, document

Fix:

Apply the minimum change that resolves the issue
Don’t refactor during an incident
If possible, fix forward (deploy a fix) rather than rolling back

Verify:

Confirm the fix resolves the original symptom
Check for side effects
Monitor error rates for the next hour

Document (post-mortem):

## Incident: Missing payment confirmations
**Duration:** 14:30 – 16:15 UTC (1h 45m)
**Impact:** ~5% of payments did not generate confirmation emails
**Root cause:** Database connection pool timeout during high traffic
**Fix:** Increased pool size from 10 to 25, added retry on timeout
**Prevention:** Add alerting on connection pool utilization > 80%

Common failure patterns

1. The silent failure

Symptom: No errors in logs, but data is missing. Common causes:

Exception caught and swallowed (empty catch block)
Async operation failed but wasn’t awaited
Message dropped by queue (no acknowledgment)

How to find it: Add logging at the boundaries — before and after every external call. If you see “before” but not “after,” you found the gap.

2. The intermittent timeout

Symptom: Requests randomly fail with timeout errors. Common causes:

Connection pool exhaustion under load
DNS resolution delays
Garbage collection pauses
External service rate limiting

How to find it: Correlate timeout events with resource metrics (CPU, memory, connection count). If timeouts spike when connections hit a ceiling, it’s pool exhaustion.

3. The data mismatch

Symptom: System A says “success” but System B has no record. Common causes:

Network failure between acknowledgment and commit
Race condition in concurrent writes
Missing error handling in webhook processing

How to find it: Compare timestamps and transaction IDs across both systems. The gap between “sent” and “received” (or the absence of “received”) tells you where it broke.

Debugging tools by layer

Layer	Tools
Browser/Client	DevTools Network tab, console errors
API Gateway	Access logs, request/response headers
Application	Structured logs, correlation IDs, APM traces
Database	Slow query log, connection pool metrics
Infrastructure	CPU/memory/disk metrics, container logs
External APIs	Status pages, response time monitoring

Key principles

Reproduce before you fix — If you can’t reproduce it, you can’t verify the fix
Logs are your best friend — Invest in structured, queryable logging
Correlation IDs are non-negotiable — Without them, tracing distributed requests is guesswork
Time-box your investigation — If you haven’t found the cause in 30 minutes, escalate
Don’t fix in production — Unless it’s a hotfix; otherwise, go through the normal deploy pipeline
Write the post-mortem — The incident is only resolved when the team has learned from it