fix: nonce cache desync recovery and Loop partial-success preservation by chrisli30 · Pull Request #512 · AvaProtocol/EigenLayer-AVS

chrisli30 · 2026-04-09T02:01:40Z

Summary

Aggregator paymaster nonce cache desyncs from on-chain state, blocking all subsequent UserOps for the affected wallet until restart #510 fix: paymaster nonce cache rewinds to on-chain nonce when the bundler mempool has no predecessor UserOp, instead of waiting forever for a dropped UserOp to mine. Eliminates the stuck loop that previously required an aggregator restart.
Loop fix: preserves partial-success results when a per-iteration failure occurs in a Loop node.

Test plan

`go test ./pkg/erc4337/preset/...`
`go run scripts/compare_storage_structure.go main` — no breaking storage changes
Reproduce Aggregator paymaster nonce cache desyncs from on-chain state, blocking all subsequent UserOps for the affected wallet until restart #510 against dev aggregator after deploy

🤖 Generated with Claude Code

Loop runner was discarding its partial-success results array on the error return path: any per-iteration runner failure caused the step to return a non-nil error, which the response builder treats as "drop OutputData", so the SDK saw data: null instead of [null, {...}]. Distinguish infrastructure failures (queue submit errors, iteration timeouts) from per-iteration runner failures. Only the former fail the loop step; the latter leave results[i] = nil as a partial-success placeholder and the loop returns success with the array preserved. Fixes #511.

…ssor (#510) When on_chain_nonce < userOp.Nonce and the bundler returned AA25, the conflict handler always assumed a predecessor UserOp would eventually mine. If the predecessor had been dropped from the bundler mempool, the cache stayed stale, the rebuild loop kept reusing the same nonce, and every subsequent paymaster-sponsored UserOp for that wallet failed until the aggregator was restarted. Now inspect the bundler mempool: if no pending UserOp exists in [on_chain, userOp.Nonce, reset the nonce cache and rewind to the on-chain nonce so the rebuild self-heals. EOF )

claude · 2026-04-09T02:06:18Z

Code Review

Overall this is a well-motivated pair of fixes with clear comments and good separation of infra vs. per-iteration failures. A few items worth addressing before merge.

1. No test coverage for the nonce desync recovery path (`builder.go`)

The most significant new logic — the mempool inspection + conditional ResetNonce call — has no automated test. The test plan says go test ./pkg/erc4337/preset/... but that would only catch regressions in existing tests; the three distinct cases in the new branch are all untested:

(a) mempool has a predecessor → keep current nonce, sleep + retry
(b) mempool is empty → rewind to on-chain nonce via ResetNonce
(c) mempool API returns an error → conservative fallback (predecessorPending = true)

Since BundlerClient is a concrete struct (not an interface), you'd need either an interface + mock or an integration test to cover this. Given that this path is exercised only on a live bundler, at minimum a comment explaining what test infrastructure is missing and tracking it as a follow-up would be appropriate.

2. `context.Background()` with no timeout for mempool inspection (`builder.go:1169`)

if pendingOps, mempoolErr := bundlerClient.GetPendingUserOpsForSender(context.Background(), entrypoint, userOp.Sender); ...

All other bundlerClient calls in this function already use context.Background(), so this is consistent with the existing pattern — but it's still a concern. If the bundler's RPC hangs on GetPendingUserOpsForSender, the entire retry loop stalls indefinitely on this new call. The existing calls (gas estimation, send) are at least on the hot path where a hang is obvious; this one is in a recovery branch that's harder to observe.

Suggestion: use a short-lived derived context:

mempoolCtx, mempoolCancel := context.WithTimeout(context.Background(), 5*time.Second)
defer mempoolCancel()
pendingOps, mempoolErr := bundlerClient.GetPendingUserOpsForSender(mempoolCtx, entrypoint, userOp.Sender)

3. Silent skip on unparseable nonce with no debug log (`builder.go:1172-1179`)

if _, ok := opNonce.SetString(strings.TrimPrefix(op.Nonce, "0x"), 16); !ok {
    if _, ok := opNonce.SetString(op.Nonce, 10); !ok {
        continue  // silently skipped
    }
}

If a bundler returns a nonce in an unexpected format, it's silently ignored and predecessorPending stays false, which would trigger a spurious ResetNonce. Add a debug log before continue:

l.Debug("failed to parse pending op nonce, skipping", "nonce", op.Nonce, "sender", userOp.Sender.Hex())

4. Nonce parse ambiguity: decimal-looking strings are silently reinterpreted as hex

big.Int.SetString(strings.TrimPrefix("10", "0x"), 16) succeeds — it parses "10" as hex 16, not decimal 10. The decimal fallback is never reached for bare digit strings. In practice EIP-4337 bundlers return "0x..." hex strings so this path won't be hit, but the comment on the fallback block (// Fall back to decimal parse) implies it handles decimal — it only handles strings that are invalid hex. Worth a clarifying comment or just removing the fallback if the only realistic input is "0x...".

5. `success` at `finalizeStep` is always `true` (`vm.go:4853`)

After the change, success is set to false only in the same places that also set infraFailure = true. The early-return at line 4848 handles every case where success == false. So finalizeStep(s, success, ...) at line 4853 is always equivalent to finalizeStep(s, true, ...).

This isn't wrong, but it's subtly misleading — it looks like success could be false here. Consider making the invariant explicit:

// infraFailure cases already returned above; success is always true here.
finalizeStep(s, true, nil, "", log.String())

6. Test assertion gap in `TestLoopNode_ContractWrite_InvalidAddress_PartialFailure`

The test now asserts step.Success == true (correct) but drops the previous step.Error assertion without adding a replacement. Worth explicitly asserting step.Error == "" to lock in the expected behavior that a partial-success loop does not surface an error string:

assert.Empty(t, step.Error, "Loop step error should be empty when loop ran to completion")

Nit: target branch

CLAUDE.md says feature branches target staging, never main directly — but this PR is the staging → main release PR, which is the documented release flow. No action needed; just noting it's expected.

Summary

#	Severity	File	Issue
1	Medium	`builder.go`	No test for nonce desync recovery (3 branches uncovered)
2	Medium	`builder.go:1169`	Unbounded `context.Background()` for mempool inspection
3	Low	`builder.go:1179`	Silent skip on unparseable nonce, no debug log
4	Low	`builder.go:1172`	Decimal-looking nonces misinterpreted as hex
5	Low	`vm.go:4853`	`success` is always `true` at final `finalizeStep`; misleading
6	Low	`manual_trigger_loop_test.go`	Missing `step.Error == ""` assertion

The logic is sound — items 1 and 2 are the ones I'd prioritize before shipping.

Copilot

Pull request overview

This PR addresses two reliability issues: (1) recovering from ERC-4337 paymaster nonce-cache desyncs that previously caused repeated rebuild loops, and (2) preserving Loop node partial-success output when individual iterations fail, instead of failing the entire Loop step.

Changes:

Add bundler-mempool inspection during AA25 nonce-conflict handling to detect missing predecessor UserOps and rewind the nonce cache to the on-chain nonce (#510 recovery path).
Update executeLoopWithQueue to only hard-fail the Loop step on infrastructure failures (queue submit errors / timeouts), while keeping per-iteration runner failures as nil entries in the result array.
Update Loop test expectations to match the new “partial success preserved” semantics; document Conventional Commits requirement in CLAUDE.md.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File	Description
pkg/erc4337/preset/builder.go	Detects nonce-cache desync via bundler mempool and rewinds to on-chain nonce to avoid stuck rebuild loops.
core/taskengine/vm.go	Changes Loop failure semantics to preserve partial results unless infrastructure fails.
core/taskengine/manual_trigger_loop_test.go	Updates test assertions to reflect partial-success Loop behavior.
CLAUDE.md	Adds repo guidance requiring Conventional Commits PR titles for semantic-release.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-09T02:08:19Z

+					// UserOp exists for this sender at a nonce in [on_chain, userOp.Nonce),
+					// then case (b) holds — rewind the cache to the on-chain nonce.
+					predecessorPending := false
+					if pendingOps, mempoolErr := bundlerClient.GetPendingUserOpsForSender(context.Background(), entrypoint, userOp.Sender); mempoolErr == nil {


GetPendingUserOpsForSender is called with context.Background() inside the retry loop. If the bundler RPC hangs (or debug methods are slow), this can stall sendUserOpCore indefinitely and block user requests. Consider using a context with a short timeout (or a request-scoped context passed into sendUserOpCore) for mempool inspection so retries remain bounded.

Suggested change

if pendingOps, mempoolErr := bundlerClient.GetPendingUserOpsForSender(context.Background(), entrypoint, userOp.Sender); mempoolErr == nil {

mempoolCtx, cancelMempool := context.WithTimeout(context.Background(), 3*time.Second)

pendingOps, mempoolErr := bundlerClient.GetPendingUserOpsForSender(mempoolCtx, entrypoint, userOp.Sender)

cancelMempool()

if mempoolErr == nil {

Copilot · 2026-04-09T02:08:20Z

+					} else {
+						l.Warn("Failed to inspect bundler mempool while diagnosing nonce conflict; assuming predecessor pending",
+							"error", mempoolErr)
+						predecessorPending = true
+					}


If mempool inspection fails (mempoolErr != nil), the code assumes a predecessor is pending and falls back to the old “wait for prior UserOps” behavior. This can reintroduce the stuck loop from #510 when the bundler doesn’t expose debug_bundler_dumpMempool (or it’s temporarily unavailable). Consider a safer fallback (e.g., after N retries, rewind to onChainNonce, or at least log that recovery is disabled without debug RPC).

Copilot · 2026-04-09T02:08:20Z

+					// On-chain nonce is behind our attempted nonce. Two possibilities:
+					//   (a) Prior UserOps at lower nonces are genuinely pending in the bundler
+					//       mempool and just haven't mined yet → wait for them.
+					//   (b) The cache is stale: a previous UserOp was dropped from the bundler
+					//       mempool without the cache being invalidated, so there is nothing
+					//       at on_chain..userOp.Nonce-1 to ever mine. Waiting is futile and
+					//       leads to the stuck loop described in issue #510.
+					// Distinguish the two by inspecting the bundler mempool. If no predecessor
+					// UserOp exists for this sender at a nonce in [on_chain, userOp.Nonce),
+					// then case (b) holds — rewind the cache to the on-chain nonce.


This new desync-recovery branch (detecting no predecessor UserOp in the bundler mempool and rewinding the nonce cache) isn’t covered by existing tests in the preset package. Given how failure-prone nonce-handling is, please add a focused test that exercises both cases: (1) predecessor pending → wait, (2) predecessor absent → ResetNonce + use onChainNonce for rebuild (ideally with a stubbed bundler client / extracted helper).

Copilot · 2026-04-09T02:08:20Z

+	// Per AvaProtocol/EigenLayer-AVS#511, per-iteration runner failures do not
+	// fail the loop step. The loop ran to completion, so step.Success is true
+	// and the per-iteration outcomes are reflected in the data array (nil for
+	// failed iterations).
 	require.NotNil(t, step, "Execution step should not be nil")
-	assert.False(t, step.Success, "Loop should report failure when an iteration fails")
-	assert.Contains(t, step.Error, "invalid address", "Error should mention invalid address")
+	assert.True(t, step.Success, "Loop step should succeed when it ran to completion, even if an iteration failed")



The test no longer asserts that the per-iteration failure is observable anywhere (e.g., in step.Log or nested iteration step metadata). Since the loop now reports Success=true on per-iteration runner errors, it’s important to still validate that the failure details are surfaced for debugging/clients (for example: step.Log contains the invalid address error and/or the failed iteration step is marked unsuccessful).

- Bound mempool inspection RPC with a 5s timeout so a hung bundler cannot stall the retry loop. - Drop misleading decimal-fallback nonce parser; bundlers return hex per EIP-4337. Log skipped unparseable nonces at debug level. - Assert step.Error is empty in the Loop partial-success test so the no-error contract is locked in.

will-dz added 3 commits April 8, 2026 12:33

docs: update CLAUDE.md for PR title instructions

2730583

chrisli30 temporarily deployed to Test April 9, 2026 02:02 — with GitHub Actions Inactive

chrisli30 requested a review from Copilot April 9, 2026 02:02

Copilot started reviewing on behalf of chrisli30 April 9, 2026 02:03 View session

Copilot AI reviewed Apr 9, 2026

View reviewed changes

chrisli30 had a problem deploying to Test April 9, 2026 02:12 — with GitHub Actions Failure

chrisli30 merged commit b724f1e into main Apr 9, 2026
4 of 18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: nonce cache desync recovery and Loop partial-success preservation#512

fix: nonce cache desync recovery and Loop partial-success preservation#512
chrisli30 merged 4 commits into
mainfrom
staging

chrisli30 commented Apr 9, 2026

Uh oh!

claude Bot commented Apr 9, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 9, 2026

Uh oh!

Copilot AI Apr 9, 2026

Uh oh!

Copilot AI Apr 9, 2026

Uh oh!

Copilot AI Apr 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

-					if pendingOps, mempoolErr := bundlerClient.GetPendingUserOpsForSender(context.Background(), entrypoint, userOp.Sender); mempoolErr == nil {
+					mempoolCtx, cancelMempool := context.WithTimeout(context.Background(), 3*time.Second)
+					pendingOps, mempoolErr := bundlerClient.GetPendingUserOpsForSender(mempoolCtx, entrypoint, userOp.Sender)
+					cancelMempool()
+					if mempoolErr == nil {

Conversation

chrisli30 commented Apr 9, 2026

Summary

Test plan

Uh oh!

claude Bot commented Apr 9, 2026

Code Review

1. No test coverage for the nonce desync recovery path (builder.go)

2. context.Background() with no timeout for mempool inspection (builder.go:1169)

3. Silent skip on unparseable nonce with no debug log (builder.go:1172-1179)

4. Nonce parse ambiguity: decimal-looking strings are silently reinterpreted as hex

5. success at finalizeStep is always true (vm.go:4853)

6. Test assertion gap in TestLoopNode_ContractWrite_InvalidAddress_PartialFailure

Nit: target branch

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

1. No test coverage for the nonce desync recovery path (`builder.go`)

2. `context.Background()` with no timeout for mempool inspection (`builder.go:1169`)

3. Silent skip on unparseable nonce with no debug log (`builder.go:1172-1179`)

5. `success` at `finalizeStep` is always `true` (`vm.go:4853`)

6. Test assertion gap in `TestLoopNode_ContractWrite_InvalidAddress_PartialFailure`