fix(op-node): stall consolidate on NotFound during EL sync#21004
Draft
ajsutton wants to merge 2 commits into
Draft
fix(op-node): stall consolidate on NotFound during EL sync#21004ajsutton wants to merge 2 commits into
ajsutton wants to merge 2 commits into
Conversation
When op-node receives a future unsafe payload via gossip during EL sync, the EL accepts it as a sync target and returns SYNCING. Derivation then tries to consolidate the next safe block against the unsafe chain, but the EL doesn't have pending_safe+1 yet, so PayloadByNumber returns NotFound. Previously this triggered a derivation pipeline reset. The reset's forkchoiceUpdated re-targets the EL away from its in-flight sync target, and the next gossiped payload moves it back — producing a thrash loop where EL sync never finishes and local safe never advances. When the engine reports IsEngineInitialELSyncing, treat NotFound at pending_safe+1 as a transient condition and emit EngineTemporaryErrorEvent instead of ResetEvent. The attributes stay queued and consolidation retries on the next pending-safe poke, by which point EL sync should have filled in the block. Outside EL sync, NotFound still triggers the existing reset.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
When op-node receives a future unsafe payload via gossip during EL sync, the EL accepts it as a sync target and returns
SYNCING. Derivation then tries to consolidate the next safe block against the unsafe chain, but the EL doesn't havepending_safe+1yet, soPayloadByNumberreturnsNotFound.Previously this triggered a derivation pipeline reset. The reset's
forkchoiceUpdatedre-targets the EL away from its in-flight sync target, and the next gossiped payload moves it back — producing a thrash loop where EL sync never finishes and local safe never advances.When the engine reports
IsEngineInitialELSyncing, this PR treatsNotFoundatpending_safe+1as a transient condition and emitsEngineTemporaryErrorEventinstead ofResetEvent. The attributes stay queued; consolidation retries on the next pending-safe poke, by which point EL sync should have filled in the block. Outside EL sync,NotFoundstill triggers the existing reset.Risk
One thing I'm still worried about with this is that if we are the sequencer and there's been a reset that affects the network as a whole (e.g. invalid interop message) then it's possible that we got a gossip message from before we reset echo'd back to us but the EL sync won't be able to complete because all nodes have reorged and discarded the chain we're trying to sync. And since we are the sequencer and are stuck in sync mode, we'll never produce a block on the new chain that lets EL sync switch to it and complete. If EL sync can't complete we wind up in a temporary error forever.
Notes
This is an alternative to #20990, which addressed the same incident by changing
insertUnsafePayloadto not promote EL-sync targets intounsafeHead. That approach removed the bogus futureunsafeHeadthat was causing consolidate to enter this branch — but left a separate risk: once consolidate's pre-state aligned with reality, derivation would freely issue a new-payload FCU with the derived block as Head, again interrupting EL sync. Fixing the reset directly is the root cause and avoids that risk.