Skip to content

fix: swallow lighthouse head 404 noise on transient REST API race#713

Closed
barnabasbusa wants to merge 1 commit into
masterfrom
bbusa/fix-lighthouse-head-404-noise
Closed

fix: swallow lighthouse head 404 noise on transient REST API race#713
barnabasbusa wants to merge 1 commit into
masterfrom
bbusa/fix-lighthouse-head-404-noise

Conversation

@barnabasbusa
Copy link
Copy Markdown
Collaborator

Summary

Lighthouse occasionally emits SSE head (and sometimes block) events for a block root that is not yet queryable via its own REST API. The block always lands on the REST endpoint a moment later — I verified this on a running kurtosis devnet: every previously-404'd root I checked (slots 53, 54, 57, 63, 70, 71, 78) returns HTTP 200 with canonical: true, finalized: true from the same Lighthouse instance that 404'd it.

But the indexer was surfacing every one of these as:

ERROR  failed processing head 70 (0xb9b4…): GET failed with status 404: NOT_FOUND: beacon block with root 0xb9b4…  client=cl-3-lighthouse-ethrex

which is misleading (looks operational, isn't) and frequent on multi-client devnets with multiple Lighthouse nodes.

What this PR does

  • Detects the 404 in GetBlockHeaderByBlockroot using the same strings.HasPrefix(err.Error(), "GET failed with status 404") check that GetBlockHeaderBySlot and GetBlockBodyByBlockroot already use, so the lookup path matches the rest of the 404 swallows in this file.
  • Wraps the 404 in a typed rpc.ErrBlockNotFound sentinel so callers can distinguish "transient race, will retry" from a genuine RPC failure.
  • In processBlockEvent / processHeadEvent (indexer/beacon/client.go), demotes the log line to Debugf when the error is ErrBlockNotFound. Anything else still logs at Errorf.

Why a typed error and not bare (nil, nil)

The sibling methods return (nil, nil) directly, but their callers explicitly check for nil. EnsureHeader's contract is "non-nil header iff nil error" — returning (nil, nil) would leave the block-cache entry in a corrupt state (header channel closed with a nil header). The sentinel keeps that invariant and just lets the next block or head event for the same root re-trigger the fetch (which always succeeds, your logs prove it).

Test plan

  • go build ./... is clean
  • Run dora against the current kurtosis devnet with multiple Lighthouse nodes and confirm the failed processing head ...: 404 lines no longer surface at error level (they should appear at debug only)
  • Confirm head/block tracking still keeps up after the change (canonical head computation complete events continue arriving in the seconds following a debug-level "not yet queryable" log)

Out of scope

GetBlockBodyByBlockroot already swallows 404 to (nil, nil), and EnsureBlock / setBlockIndex will happily store that nil — almost certainly a latent bug but pre-existing, and worth a separate PR.

Lighthouse occasionally emits SSE head/block events for a root that
is not yet queryable via the REST API. The block always becomes
available a moment later (verified: previously-404'd roots return
canonical/finalized on the same node), but the surfaced "failed
processing head ...: 404 NOT_FOUND" was logged at error level and
showed up frequently in multi-client devnets.

Match the 404-swallow pattern already used in GetBlockHeaderBySlot
and GetBlockBodyByBlockroot: detect the 404 in
GetBlockHeaderByBlockroot and wrap it in a typed rpc.ErrBlockNotFound
sentinel. Demote the log line in processBlockEvent / processHeadEvent
to debug for that sentinel; real errors still log at error.

A sentinel is used (rather than a plain nil-return) because
EnsureHeader's contract is "non-nil header iff nil error" — the
block cache would otherwise be left in a corrupt state.
@qu0b-reviewer
Copy link
Copy Markdown

qu0b-reviewer Bot commented May 22, 2026

🤖 qu0b-reviewer

Summary

The PR introduces a sentinel error ErrBlockNotFound from GetBlockHeaderByBlockroot and demotes it from error to debug log at the SSE event handlers, allowing the event stream to self-heal when Lighthouse's SSE and REST APIs race. The approach is sound for the intended fix, but there are some call-sites with subtle error-handling gaps worth noting.

Issues

  • 🟡 indexer/beacon/requests.go:36LoadBeaconHeader dereferences header.Header without a nil guard. Before this PR, GetBlockHeaderByBlockroot never returned (nil, non-nil err), so this was safe. After this PR it can: the next Lighthouse race would cause a nil dereference panic. Fix: add if header == nil { return nil, ErrBlockNotFound } (or wrap with sentinel) before line 36, then update callers at client.go:570 and blockcache.go:442,477 accordingly.
  • 🟡 indexer/beacon/client.go:570-571 — the parentHead == nil branch calls LoadBeaconHeader and only checks headerRsp == nil after. If the 404-path now returns an error (per the linked issue above), this caller will propagate it upward instead of falling through to the nil path. Confirmed: LoadBeaconHeader currently can't propagate ErrBlockNotFound because GetBlockHeaderByBlockroot never returned it — but once you add it (fixing the nil deref), this caller's error-gate stops working.
  • 🟡 clienst/consensus/rpc/beaconapi.goGetBlockHeaderByBlockroot now returns ErrBlockNotFound wrapped, while the adjacent GetBlockHeaderBySlot still silently returns (nil, nil) on 404. This asymmetry isn't motivated in the PR and makes it hard to reason about which callers need to handle the 404 case. Consider applying the same sentinel to GetBlockHeaderBySlot for consistency.

Suggestions

  • 🟢 clients/consensus/rpc/beaconapi.go:364strings.HasPrefix on err.Error() is fragile; if the error string format changes upstream (in the HTTP client or eth2-client library), this silently stops working. A better check would validate the response type in the underlying provider or use a status-code-aware wrapper. Acceptable as a pragmatic workaround, but add a test or at minimum a comment listing the expected error string format.

Reviewed @ 94cbb207
"Cheaper to fly to Vietnam than to fix it here."

@barnabasbusa
Copy link
Copy Markdown
Collaborator Author

sigp/lighthouse#9338 should fix this, and then we won't need to deal with this on dora's side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant