fix: swallow lighthouse head 404 noise on transient REST API race#713
Closed
barnabasbusa wants to merge 1 commit into
Closed
fix: swallow lighthouse head 404 noise on transient REST API race#713barnabasbusa wants to merge 1 commit into
barnabasbusa wants to merge 1 commit into
Conversation
Lighthouse occasionally emits SSE head/block events for a root that is not yet queryable via the REST API. The block always becomes available a moment later (verified: previously-404'd roots return canonical/finalized on the same node), but the surfaced "failed processing head ...: 404 NOT_FOUND" was logged at error level and showed up frequently in multi-client devnets. Match the 404-swallow pattern already used in GetBlockHeaderBySlot and GetBlockBodyByBlockroot: detect the 404 in GetBlockHeaderByBlockroot and wrap it in a typed rpc.ErrBlockNotFound sentinel. Demote the log line in processBlockEvent / processHeadEvent to debug for that sentinel; real errors still log at error. A sentinel is used (rather than a plain nil-return) because EnsureHeader's contract is "non-nil header iff nil error" — the block cache would otherwise be left in a corrupt state.
🤖 qu0b-reviewerSummaryThe PR introduces a sentinel error Issues
Suggestions
Reviewed @ |
Collaborator
Author
|
sigp/lighthouse#9338 should fix this, and then we won't need to deal with this on dora's side. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Lighthouse occasionally emits SSE
head(and sometimesblock) events for a block root that is not yet queryable via its own REST API. The block always lands on the REST endpoint a moment later — I verified this on a running kurtosis devnet: every previously-404'd root I checked (slots 53, 54, 57, 63, 70, 71, 78) returnsHTTP 200withcanonical: true, finalized: truefrom the same Lighthouse instance that 404'd it.But the indexer was surfacing every one of these as:
which is misleading (looks operational, isn't) and frequent on multi-client devnets with multiple Lighthouse nodes.
What this PR does
GetBlockHeaderByBlockrootusing the samestrings.HasPrefix(err.Error(), "GET failed with status 404")check thatGetBlockHeaderBySlotandGetBlockBodyByBlockrootalready use, so the lookup path matches the rest of the 404 swallows in this file.rpc.ErrBlockNotFoundsentinel so callers can distinguish "transient race, will retry" from a genuine RPC failure.processBlockEvent/processHeadEvent(indexer/beacon/client.go), demotes the log line toDebugfwhen the error isErrBlockNotFound. Anything else still logs atErrorf.Why a typed error and not bare
(nil, nil)The sibling methods return
(nil, nil)directly, but their callers explicitly check for nil.EnsureHeader's contract is "non-nil header iff nil error" — returning(nil, nil)would leave the block-cache entry in a corrupt state (header channel closed with a nil header). The sentinel keeps that invariant and just lets the nextblockorheadevent for the same root re-trigger the fetch (which always succeeds, your logs prove it).Test plan
go build ./...is cleanfailed processing head ...: 404lines no longer surface at error level (they should appear at debug only)canonical head computation completeevents continue arriving in the seconds following a debug-level "not yet queryable" log)Out of scope
GetBlockBodyByBlockrootalready swallows 404 to(nil, nil), andEnsureBlock/setBlockIndexwill happily store that nil — almost certainly a latent bug but pre-existing, and worth a separate PR.