Skip to content

fix: nonce cache desync recovery and Loop partial-success preservation#512

Merged
chrisli30 merged 4 commits into
mainfrom
staging
Apr 9, 2026
Merged

fix: nonce cache desync recovery and Loop partial-success preservation#512
chrisli30 merged 4 commits into
mainfrom
staging

Conversation

@chrisli30
Copy link
Copy Markdown
Member

Summary

Test plan

🤖 Generated with Claude Code

will-dz added 3 commits April 8, 2026 12:33
Loop runner was discarding its partial-success results array on the
error return path: any per-iteration runner failure caused the step to
return a non-nil error, which the response builder treats as "drop
OutputData", so the SDK saw data: null instead of [null, {...}].

Distinguish infrastructure failures (queue submit errors, iteration
timeouts) from per-iteration runner failures. Only the former fail the
loop step; the latter leave results[i] = nil as a partial-success
placeholder and the loop returns success with the array preserved.

Fixes #511.
…ssor (#510)

When on_chain_nonce < userOp.Nonce and the bundler returned AA25, the
conflict handler always assumed a predecessor UserOp would eventually
mine. If the predecessor had been dropped from the bundler mempool, the
cache stayed stale, the rebuild loop kept reusing the same nonce, and
every subsequent paymaster-sponsored UserOp for that wallet failed until
the aggregator was restarted.

Now inspect the bundler mempool: if no pending UserOp exists in
[on_chain, userOp.Nonce, reset the nonce cache and rewind to the
on-chain nonce so the rebuild self-heals.

EOF
)
@claude
Copy link
Copy Markdown

claude Bot commented Apr 9, 2026

Code Review

Overall this is a well-motivated pair of fixes with clear comments and good separation of infra vs. per-iteration failures. A few items worth addressing before merge.


1. No test coverage for the nonce desync recovery path (builder.go)

The most significant new logic — the mempool inspection + conditional ResetNonce call — has no automated test. The test plan says go test ./pkg/erc4337/preset/... but that would only catch regressions in existing tests; the three distinct cases in the new branch are all untested:

  • (a) mempool has a predecessor → keep current nonce, sleep + retry
  • (b) mempool is empty → rewind to on-chain nonce via ResetNonce
  • (c) mempool API returns an error → conservative fallback (predecessorPending = true)

Since BundlerClient is a concrete struct (not an interface), you'd need either an interface + mock or an integration test to cover this. Given that this path is exercised only on a live bundler, at minimum a comment explaining what test infrastructure is missing and tracking it as a follow-up would be appropriate.


2. context.Background() with no timeout for mempool inspection (builder.go:1169)

if pendingOps, mempoolErr := bundlerClient.GetPendingUserOpsForSender(context.Background(), entrypoint, userOp.Sender); ...

All other bundlerClient calls in this function already use context.Background(), so this is consistent with the existing pattern — but it's still a concern. If the bundler's RPC hangs on GetPendingUserOpsForSender, the entire retry loop stalls indefinitely on this new call. The existing calls (gas estimation, send) are at least on the hot path where a hang is obvious; this one is in a recovery branch that's harder to observe.

Suggestion: use a short-lived derived context:

mempoolCtx, mempoolCancel := context.WithTimeout(context.Background(), 5*time.Second)
defer mempoolCancel()
pendingOps, mempoolErr := bundlerClient.GetPendingUserOpsForSender(mempoolCtx, entrypoint, userOp.Sender)

3. Silent skip on unparseable nonce with no debug log (builder.go:1172-1179)

if _, ok := opNonce.SetString(strings.TrimPrefix(op.Nonce, "0x"), 16); !ok {
    if _, ok := opNonce.SetString(op.Nonce, 10); !ok {
        continue  // silently skipped
    }
}

If a bundler returns a nonce in an unexpected format, it's silently ignored and predecessorPending stays false, which would trigger a spurious ResetNonce. Add a debug log before continue:

l.Debug("failed to parse pending op nonce, skipping", "nonce", op.Nonce, "sender", userOp.Sender.Hex())

4. Nonce parse ambiguity: decimal-looking strings are silently reinterpreted as hex

big.Int.SetString(strings.TrimPrefix("10", "0x"), 16) succeeds — it parses "10" as hex 16, not decimal 10. The decimal fallback is never reached for bare digit strings. In practice EIP-4337 bundlers return "0x..." hex strings so this path won't be hit, but the comment on the fallback block (// Fall back to decimal parse) implies it handles decimal — it only handles strings that are invalid hex. Worth a clarifying comment or just removing the fallback if the only realistic input is "0x...".


5. success at finalizeStep is always true (vm.go:4853)

After the change, success is set to false only in the same places that also set infraFailure = true. The early-return at line 4848 handles every case where success == false. So finalizeStep(s, success, ...) at line 4853 is always equivalent to finalizeStep(s, true, ...).

This isn't wrong, but it's subtly misleading — it looks like success could be false here. Consider making the invariant explicit:

// infraFailure cases already returned above; success is always true here.
finalizeStep(s, true, nil, "", log.String())

6. Test assertion gap in TestLoopNode_ContractWrite_InvalidAddress_PartialFailure

The test now asserts step.Success == true (correct) but drops the previous step.Error assertion without adding a replacement. Worth explicitly asserting step.Error == "" to lock in the expected behavior that a partial-success loop does not surface an error string:

assert.Empty(t, step.Error, "Loop step error should be empty when loop ran to completion")

Nit: target branch

CLAUDE.md says feature branches target staging, never main directly — but this PR is the staging → main release PR, which is the documented release flow. No action needed; just noting it's expected.


Summary

# Severity File Issue
1 Medium builder.go No test for nonce desync recovery (3 branches uncovered)
2 Medium builder.go:1169 Unbounded context.Background() for mempool inspection
3 Low builder.go:1179 Silent skip on unparseable nonce, no debug log
4 Low builder.go:1172 Decimal-looking nonces misinterpreted as hex
5 Low vm.go:4853 success is always true at final finalizeStep; misleading
6 Low manual_trigger_loop_test.go Missing step.Error == "" assertion

The logic is sound — items 1 and 2 are the ones I'd prioritize before shipping.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses two reliability issues: (1) recovering from ERC-4337 paymaster nonce-cache desyncs that previously caused repeated rebuild loops, and (2) preserving Loop node partial-success output when individual iterations fail, instead of failing the entire Loop step.

Changes:

  • Add bundler-mempool inspection during AA25 nonce-conflict handling to detect missing predecessor UserOps and rewind the nonce cache to the on-chain nonce (#510 recovery path).
  • Update executeLoopWithQueue to only hard-fail the Loop step on infrastructure failures (queue submit errors / timeouts), while keeping per-iteration runner failures as nil entries in the result array.
  • Update Loop test expectations to match the new “partial success preserved” semantics; document Conventional Commits requirement in CLAUDE.md.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
pkg/erc4337/preset/builder.go Detects nonce-cache desync via bundler mempool and rewinds to on-chain nonce to avoid stuck rebuild loops.
core/taskengine/vm.go Changes Loop failure semantics to preserve partial results unless infrastructure fails.
core/taskengine/manual_trigger_loop_test.go Updates test assertions to reflect partial-success Loop behavior.
CLAUDE.md Adds repo guidance requiring Conventional Commits PR titles for semantic-release.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/erc4337/preset/builder.go Outdated
// UserOp exists for this sender at a nonce in [on_chain, userOp.Nonce),
// then case (b) holds — rewind the cache to the on-chain nonce.
predecessorPending := false
if pendingOps, mempoolErr := bundlerClient.GetPendingUserOpsForSender(context.Background(), entrypoint, userOp.Sender); mempoolErr == nil {
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetPendingUserOpsForSender is called with context.Background() inside the retry loop. If the bundler RPC hangs (or debug methods are slow), this can stall sendUserOpCore indefinitely and block user requests. Consider using a context with a short timeout (or a request-scoped context passed into sendUserOpCore) for mempool inspection so retries remain bounded.

Suggested change
if pendingOps, mempoolErr := bundlerClient.GetPendingUserOpsForSender(context.Background(), entrypoint, userOp.Sender); mempoolErr == nil {
mempoolCtx, cancelMempool := context.WithTimeout(context.Background(), 3*time.Second)
pendingOps, mempoolErr := bundlerClient.GetPendingUserOpsForSender(mempoolCtx, entrypoint, userOp.Sender)
cancelMempool()
if mempoolErr == nil {

Copilot uses AI. Check for mistakes.
Comment on lines +1183 to +1187
} else {
l.Warn("Failed to inspect bundler mempool while diagnosing nonce conflict; assuming predecessor pending",
"error", mempoolErr)
predecessorPending = true
}
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If mempool inspection fails (mempoolErr != nil), the code assumes a predecessor is pending and falls back to the old “wait for prior UserOps” behavior. This can reintroduce the stuck loop from #510 when the bundler doesn’t expose debug_bundler_dumpMempool (or it’s temporarily unavailable). Consider a safer fallback (e.g., after N retries, rewind to onChainNonce, or at least log that recovery is disabled without debug RPC).

Copilot uses AI. Check for mistakes.
Comment on lines +1158 to +1167
// On-chain nonce is behind our attempted nonce. Two possibilities:
// (a) Prior UserOps at lower nonces are genuinely pending in the bundler
// mempool and just haven't mined yet → wait for them.
// (b) The cache is stale: a previous UserOp was dropped from the bundler
// mempool without the cache being invalidated, so there is nothing
// at on_chain..userOp.Nonce-1 to ever mine. Waiting is futile and
// leads to the stuck loop described in issue #510.
// Distinguish the two by inspecting the bundler mempool. If no predecessor
// UserOp exists for this sender at a nonce in [on_chain, userOp.Nonce),
// then case (b) holds — rewind the cache to the on-chain nonce.
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new desync-recovery branch (detecting no predecessor UserOp in the bundler mempool and rewinding the nonce cache) isn’t covered by existing tests in the preset package. Given how failure-prone nonce-handling is, please add a focused test that exercises both cases: (1) predecessor pending → wait, (2) predecessor absent → ResetNonce + use onChainNonce for rebuild (ideally with a stubbed bundler client / extracted helper).

Copilot uses AI. Check for mistakes.
Comment on lines +430 to 436
// Per AvaProtocol/EigenLayer-AVS#511, per-iteration runner failures do not
// fail the loop step. The loop ran to completion, so step.Success is true
// and the per-iteration outcomes are reflected in the data array (nil for
// failed iterations).
require.NotNil(t, step, "Execution step should not be nil")
assert.False(t, step.Success, "Loop should report failure when an iteration fails")
assert.Contains(t, step.Error, "invalid address", "Error should mention invalid address")
assert.True(t, step.Success, "Loop step should succeed when it ran to completion, even if an iteration failed")

Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test no longer asserts that the per-iteration failure is observable anywhere (e.g., in step.Log or nested iteration step metadata). Since the loop now reports Success=true on per-iteration runner errors, it’s important to still validate that the failure details are surfaced for debugging/clients (for example: step.Log contains the invalid address error and/or the failed iteration step is marked unsuccessful).

Copilot uses AI. Check for mistakes.
- Bound mempool inspection RPC with a 5s timeout so a hung bundler cannot stall the retry loop.
- Drop misleading decimal-fallback nonce parser; bundlers return hex per EIP-4337. Log skipped unparseable nonces at debug level.
- Assert step.Error is empty in the Loop partial-success test so the no-error contract is locked in.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants