Skip to content
S sufi.my
Back to Projects

Case Study

Payment Transaction Debugging System

Investigated missing financial transactions by tracing logs across distributed services — identified a silent database write failure in the webhook processing layer and restored affected records.

JavaSpring BootGCP LoggingSQLRazorpay

Context

We operate a fintech payment flow where customers complete payments via Razorpay, which then fires a webhook to our backend. The backend processes the webhook, records the transaction in the database, and triggers invoice generation.

A production incident was raised: users were reporting successful payments with no invoice appearing and no record in their account dashboard.


Problem

The failure was silent — no alerts fired, no error emails, no exception logs at the surface level. From the user’s perspective:

  • Razorpay showed the payment as successful
  • Our dashboard showed no transaction
  • Customer support had no way to verify or recover the payment manually

This is the worst category of production bug: a discrepancy between external system state and internal system state, with no obvious failure signal.


Investigation

I had access to Google Cloud Logging and the Spring Boot application logs. My investigation followed the transaction ID across each system boundary.

Step 1 — Confirm the webhook was received

Filtered GCP logs by the razorpay_payment_id from the affected transaction:

resource.type="gce_instance"
jsonPayload.razorpay_payment_id="pay_XXXXXXXXXXXX"

The webhook hit our endpoint and returned HTTP 200. So the issue was not at the network boundary — Razorpay delivered the event successfully.

Step 2 — Trace the processing chain

The webhook handler passed the payload to a service layer, which was responsible for:

  1. Validating the signature
  2. Parsing the payment data
  3. Writing to the transactions table
  4. Publishing an event for invoice generation

I traced log entries with the same razorpay_payment_id through each step. All four steps showed log entries — but the database write step had a 28-second timestamp gap before the next entry, which was unusual. Normal writes took under 50ms.

Step 3 — Identify the timeout

The 28-second gap matched exactly the default JDBC connection timeout configured for our database pool. Further investigation showed the database had experienced a connection spike at that timestamp — the connection pool was saturated, the write timed out, and the exception was caught and swallowed by the service layer without retrying or logging the failure explicitly.

The webhook handler returned 200 regardless of whether the downstream write succeeded, because it only checked for a successful parse — not a successful persistence.


Root Cause

A JDBC connection pool timeout during a database write in the webhook processing service. The exception was caught silently — the webhook returned HTTP 200 to Razorpay (preventing retries), but the transaction was never persisted.

The failure was masked by two design gaps:

  1. No explicit persistence check before returning 200 to the payment gateway
  2. No retry mechanism for failed transaction writes
  3. Insufficient logging in the catch block — the exception was caught but only a generic warning was emitted without the original payment ID

Resolution

Immediate actions:

  • Queried the Razorpay dashboard API to retrieve all payments in the affected time window
  • Cross-referenced against the transactions table to identify the 7 missing records
  • Manually reprocessed the missing transactions using the payment data from Razorpay

Fixes implemented:

  • Changed the webhook handler to only return HTTP 200 after confirming successful database persistence
  • Added explicit error logging in the catch block including razorpay_payment_id and full stack trace
  • Added a database health check to the service startup sequence

Improvements recommended:

  • Implement idempotent retry logic for the transaction write with exponential backoff
  • Add a reconciliation job that periodically checks for payment events received from Razorpay but missing from the database
  • Set up a GCP Logging alert for any catch block that includes a payment ID without a corresponding successful write

Impact

  • 7 transactions recovered and invoices regenerated for affected users
  • Reduced future incident mean time to identify (MTTI) by introducing explicit payment ID logging in all error paths
  • The debugging procedure was documented and shared with the team as a runbook for similar payment incidents
  • The fix prevented Razorpay from treating the webhook as failed (which would have triggered duplicate payment retries)

What This Demonstrates

  • Log-driven root cause analysis — following a transaction ID across system boundaries using GCP Logging filter queries
  • Understanding distributed failure modes — silent failures caused by mismatched success signals between services
  • Production incident ownership — from identification through data recovery to preventive documentation
  • Practical debugging under pressure — no staging reproduction, working from production logs only