Skip to content

feat(bridge): crash-safe idempotency for withdraws and refunds#1096

Merged
sameh-farouk merged 1 commit into
developmentfrom
fix/bridge-crash-safe-idempotency
Jun 3, 2026
Merged

feat(bridge): crash-safe idempotency for withdraws and refunds#1096
sameh-farouk merged 1 commit into
developmentfrom
fix/bridge-crash-safe-idempotency

Conversation

@sameh-farouk
Copy link
Copy Markdown
Member

@sameh-farouk sameh-farouk commented Jun 1, 2026

Summary

Adds a persistent idempotency store + startup reconciliation so a crash between submitting a Stellar payment and confirming it on TFChain is recovered without double-paying or double-confirming. Per-transaction submit is retained — batching is intentionally excluded/deferred (see notes). Bridge-only, no runtime upgrade.

This is the crash-safety slice carved out of the larger #1079 draft (batching dropped; the 6 bug fixes split into #1094/#1095; #1092's quarantine preserved).

What it does

  • pkg/idempotency.go — bbolt-backed store tracking PROCESSING/COMPLETED per withdraw (by burn tx id) and refund (by tx hash). Refuses to downgrade COMPLETEDPROCESSING. Reset() wipes it.
  • handleWithdrawReady / handleRefundReady — check the store first (skip if COMPLETED); if PROCESSING, look up Horizon for an already-submitted payment and only complete the TFChain confirmation if found; otherwise mark PROCESSING → submit → confirm → mark COMPLETED. The fix(bridge): quarantine undeliverable refunds instead of crashlooping #1092 undeliverable-refund quarantine is preserved and now also marks the refund COMPLETED.
  • Recovery keys: withdraws are tagged with the burn tx id as a Stellar text memo (traceability + recovery by memo), with the account sequence number as a fallback for pre-memo submissions. Refunds recover by the existing MemoReturn hash, sequence as fallback.
  • reconcilePendingTransactions — runs once at startup to recover entries left PROCESSING by a previous run, using one Horizon page for all lookups.
  • Event loop — Ready events processed before Created/Expired (Ready submits time-sensitive Stellar signatures).
  • Chain-reset safety — the store is chain-scoped (burn tx ids restart after a reset and would collide with stale COMPLETED entries → wrongly skipped withdraws), so it is wiped via Reset() when started with RescanBridgeAccount (the same flag that zeroes the Stellar cursor).

Safety reasoning / regression verification (adversarially reviewed)

  • No double Stellar payment. PROCESSING is persisted before submit. Crash-after-submit → restart finds the existing tx (memo/sequence) and only confirms on TFChain. Crash-before-submit → no tx exists → safe re-submit. Even if a submitted tx scrolled out of the 200-record Horizon window, a re-submit reuses the already-consumed sequence → Stellar rejects with tx_bad_seq → postpone. No second payment possible.
  • No false COMPLETED. Set only after a successful on-chain confirm (or intentional quarantine forfeit). A submit failure leaves the entry PROCESSING.
  • Deposit-memo non-conflict. The deposit→mint path (mint.go) reads incoming memos as the mint destination; withdraws are outgoing (account_debited, From==bridge) and the recovery page only queries source_account=bridge, so a withdraw memo can never be read as a deposit. Withdraw (text) vs refund (return) memo types are disjoint.
  • Reset() safety. Atomic DeleteBucket+CreateBucket for both buckets; mints are independently guarded on-chain by IsMintedAlready, so they don't need store tracking.
  • MemoText 28-byte limit — a uint64 is ≤ 20 digits. Safe.

⚠️ Deployment note (breaking — coordinate validators)

The withdraw memo is part of the signed transaction, so all validators must run this version together. During a mixed-version rollout, a withdraw whose collected signatures span memo + no-memo versions is rejected by Stellar with tx_bad_auth and postponed/retried — withdraws stall but do not crash or double-pay, and self-heal once every validator is upgraded. Roll out to the whole validator set together (or in a window where transient withdraw delays are acceptable).

Testing

  • go build ./..., go vet ./..., gofmt -l — all clean.
  • Adversarial logic review of double-pay/false-completion/quarantine/reset/mixed-rollout vectors — no findings at/above NIT.
  • The bridge has no Go unit harness for these handlers; idempotency.go is a good follow-up unit-test target.

Relationship to other PRs

🤖 Generated with Claude Code

Closes #1054, #1053

Adds a persistent idempotency store and startup reconciliation so a crash
between submitting a Stellar payment and confirming it on TFChain is
recovered without double-paying or double-confirming. Per-transaction
submit is retained (batching is intentionally excluded/deferred).

- pkg/idempotency.go: bbolt-backed store tracking PROCESSING/COMPLETED
  state per withdraw (by burn tx id) and refund (by tx hash); refuses to
  downgrade COMPLETED back to PROCESSING. Reset() wipes the store.
- handleWithdrawReady / handleRefundReady: check the store first (skip if
  COMPLETED); on PROCESSING, look up Horizon for an existing outgoing tx
  and only complete the TFChain confirmation if found, otherwise mark
  PROCESSING, submit, confirm, then mark COMPLETED. The existing #1092
  undeliverable-refund quarantine is preserved and now also marks the
  refund COMPLETED.
- Withdraw recovery: withdraw payments are now tagged with the burn tx id
  as a Stellar text memo (traceability + recovery by memo), with the
  account sequence number as a fallback for pre-memo submissions. The memo
  is part of the signed tx and is set identically at both build sites
  (CreatePaymentAndReturnSignature + CreatePaymentWithSignaturesAndSubmit).
  Refund recovery uses the existing MemoReturn hash, sequence as fallback.
- reconcilePendingTransactions: runs once at startup to recover entries
  left PROCESSING by a previous run, using one Horizon page for all lookups.
- Event loop: process Ready events before Created/Expired (Ready submits
  time-sensitive Stellar signatures).
- Chain-reset safety: the store is chain-scoped (burn tx ids restart after
  a reset), so it is wiped via Reset() when started with RescanBridgeAccount
  (the same flag that zeroes the Stellar cursor).

Deployment note: the withdraw memo is part of the signed transaction, so
all validators must run this version together. During a mixed-version
rollout, withdraw submissions whose signature set spans both versions are
rejected (tx_bad_auth) and postponed/retried — withdraws stall but do not
crash or double-pay, and self-heal once all validators are upgraded.

No runtime upgrade.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@sameh-farouk sameh-farouk force-pushed the fix/bridge-crash-safe-idempotency branch from 06e8e11 to 70b133f Compare June 3, 2026 00:44
@sameh-farouk sameh-farouk marked this pull request as ready for review June 3, 2026 11:16
@sameh-farouk sameh-farouk merged commit 86618a8 into development Jun 3, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bridge: Lack of Atomicity

1 participant