Skip to content

fix(op-node): stall consolidate on NotFound during EL sync#21004

Draft
ajsutton wants to merge 2 commits into
developfrom
aj/fix/el-sync-consolidate-no-reset
Draft

fix(op-node): stall consolidate on NotFound during EL sync#21004
ajsutton wants to merge 2 commits into
developfrom
aj/fix/el-sync-consolidate-no-reset

Conversation

@ajsutton
Copy link
Copy Markdown
Contributor

@ajsutton ajsutton commented May 25, 2026

When op-node receives a future unsafe payload via gossip during EL sync, the EL accepts it as a sync target and returns SYNCING. Derivation then tries to consolidate the next safe block against the unsafe chain, but the EL doesn't have pending_safe+1 yet, so PayloadByNumber returns NotFound.

Previously this triggered a derivation pipeline reset. The reset's forkchoiceUpdated re-targets the EL away from its in-flight sync target, and the next gossiped payload moves it back — producing a thrash loop where EL sync never finishes and local safe never advances.

When the engine reports IsEngineInitialELSyncing, this PR treats NotFound at pending_safe+1 as a transient condition and emits EngineTemporaryErrorEvent instead of ResetEvent. The attributes stay queued; consolidation retries on the next pending-safe poke, by which point EL sync should have filled in the block. Outside EL sync, NotFound still triggers the existing reset.

Risk

One thing I'm still worried about with this is that if we are the sequencer and there's been a reset that affects the network as a whole (e.g. invalid interop message) then it's possible that we got a gossip message from before we reset echo'd back to us but the EL sync won't be able to complete because all nodes have reorged and discarded the chain we're trying to sync. And since we are the sequencer and are stuck in sync mode, we'll never produce a block on the new chain that lets EL sync switch to it and complete. If EL sync can't complete we wind up in a temporary error forever.

Notes

This is an alternative to #20990, which addressed the same incident by changing insertUnsafePayload to not promote EL-sync targets into unsafeHead. That approach removed the bogus future unsafeHead that was causing consolidate to enter this branch — but left a separate risk: once consolidate's pre-state aligned with reality, derivation would freely issue a new-payload FCU with the derived block as Head, again interrupting EL sync. Fixing the reset directly is the root cause and avoids that risk.

ajsutton added 2 commits May 25, 2026 11:17
When op-node receives a future unsafe payload via gossip during EL sync,
the EL accepts it as a sync target and returns SYNCING. Derivation then
tries to consolidate the next safe block against the unsafe chain, but
the EL doesn't have pending_safe+1 yet, so PayloadByNumber returns
NotFound.

Previously this triggered a derivation pipeline reset. The reset's
forkchoiceUpdated re-targets the EL away from its in-flight sync target,
and the next gossiped payload moves it back — producing a thrash loop
where EL sync never finishes and local safe never advances.

When the engine reports IsEngineInitialELSyncing, treat NotFound at
pending_safe+1 as a transient condition and emit
EngineTemporaryErrorEvent instead of ResetEvent. The attributes stay
queued and consolidation retries on the next pending-safe poke, by
which point EL sync should have filled in the block. Outside EL sync,
NotFound still triggers the existing reset.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant