Free Guide
Production Debugging Playbook
A step-by-step framework for investigating production issues quickly and systematically.
5 min read
The mindset
Production debugging is not about guessing. It’s about systematically narrowing down possibilities until only the root cause remains. Every step should either confirm or eliminate a hypothesis.
Rule #1: Don’t change anything until you understand what’s broken. Rule #2: Gather evidence before forming theories. Rule #3: Document as you go — your future self (and teammates) will thank you.
The 5-step framework
Step 1: Understand the symptom
Before investigating, get the facts:
- What is the user experiencing? (error message, missing data, slow response)
- When did it start? (timestamp, deployment, config change)
- Who is affected? (all users, specific region, specific account)
- How often? (every request, intermittent, time-based)
Example incident report:
- What: Users report "Payment confirmation not received"
- When: Started at 14:30 UTC, after no deployment
- Who: ~5% of users making payments
- How often: Intermittent, no pattern in user type
Step 2: Check the obvious first
Before deep investigation, rule out common causes:
- Recent deployments — Did anything ship in the last 24 hours?
- Infrastructure — Are CPU/memory/disk within normal range?
- Dependencies — Are external services (payment gateway, database, cache) responding?
- Error rates — Did the error rate spike at a specific time?
- Logs — Are there new error types appearing?
# Quick health checks
- Dashboard: Are error rates elevated?
- Status pages: Is the payment provider reporting issues?
- Metrics: CPU, memory, connection pool utilization
- Recent deploys: git log --since="24 hours ago"
80% of incidents are caused by recent changes or dependency failures. Check these first.
Step 3: Trace the request path
Follow the request from the user’s action to the final system that should have responded:
User → Load Balancer → API Gateway → Auth Service → Payment Service → Database
↓
Payment Provider (external)
At each hop, check:
- Did the request arrive? (check access logs)
- Did it succeed or fail? (check application logs)
- How long did it take? (check latency)
- What was the response? (check status code + body)
Using correlation IDs
If your system has correlation IDs (and it should), this is where they pay off:
# Find all logs for a specific failed transaction
jsonPayload.correlationId="550e8400-e29b-41d4-a716-446655440000"
This single query shows you the complete journey of one request across all services.
Step 4: Isolate the failure point
You now know which service or integration failed. Dig deeper:
For application errors:
# Search for the specific error
severity=ERROR
resource.labels.service_name="payment-service"
timestamp>="2026-01-15T14:30:00Z"
For timeout/latency issues:
- Check connection pool metrics (are all connections in use?)
- Check query execution times (is the database slow?)
- Check external API response times
For data inconsistency:
- Compare records across systems (payment gateway vs. your database)
- Check for race conditions (concurrent writes to the same record)
- Look for partial failures (one step succeeded, the next failed)
Step 5: Fix, verify, document
Fix:
- Apply the minimum change that resolves the issue
- Don’t refactor during an incident
- If possible, fix forward (deploy a fix) rather than rolling back
Verify:
- Confirm the fix resolves the original symptom
- Check for side effects
- Monitor error rates for the next hour
Document (post-mortem):
## Incident: Missing payment confirmations
**Duration:** 14:30 – 16:15 UTC (1h 45m)
**Impact:** ~5% of payments did not generate confirmation emails
**Root cause:** Database connection pool timeout during high traffic
**Fix:** Increased pool size from 10 to 25, added retry on timeout
**Prevention:** Add alerting on connection pool utilization > 80%
Common failure patterns
1. The silent failure
Symptom: No errors in logs, but data is missing. Common causes:
- Exception caught and swallowed (empty catch block)
- Async operation failed but wasn’t awaited
- Message dropped by queue (no acknowledgment)
How to find it: Add logging at the boundaries — before and after every external call. If you see “before” but not “after,” you found the gap.
2. The intermittent timeout
Symptom: Requests randomly fail with timeout errors. Common causes:
- Connection pool exhaustion under load
- DNS resolution delays
- Garbage collection pauses
- External service rate limiting
How to find it: Correlate timeout events with resource metrics (CPU, memory, connection count). If timeouts spike when connections hit a ceiling, it’s pool exhaustion.
3. The data mismatch
Symptom: System A says “success” but System B has no record. Common causes:
- Network failure between acknowledgment and commit
- Race condition in concurrent writes
- Missing error handling in webhook processing
How to find it: Compare timestamps and transaction IDs across both systems. The gap between “sent” and “received” (or the absence of “received”) tells you where it broke.
Debugging tools by layer
| Layer | Tools |
|---|---|
| Browser/Client | DevTools Network tab, console errors |
| API Gateway | Access logs, request/response headers |
| Application | Structured logs, correlation IDs, APM traces |
| Database | Slow query log, connection pool metrics |
| Infrastructure | CPU/memory/disk metrics, container logs |
| External APIs | Status pages, response time monitoring |
Key principles
- Reproduce before you fix — If you can’t reproduce it, you can’t verify the fix
- Logs are your best friend — Invest in structured, queryable logging
- Correlation IDs are non-negotiable — Without them, tracing distributed requests is guesswork
- Time-box your investigation — If you haven’t found the cause in 30 minutes, escalate
- Don’t fix in production — Unless it’s a hotfix; otherwise, go through the normal deploy pipeline
- Write the post-mortem — The incident is only resolved when the team has learned from it