fix(l1): retry transient chain ID errors instead of shutting down the node#3771
fix(l1): retry transient chain ID errors instead of shutting down the node#3771Ehsan-saradar wants to merge 2 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
This PR changes the L1 client’s startup chain-ID verification so transient eth_chainId failures (rate limits/timeouts/unresponsive RPC) are retried instead of returning an error that shuts down the entire node, while keeping genuine L1/L2 network mismatches fatal.
Changes:
- Add
verifyChainIDwith an unbounded retry loop for transient chain-ID probe failures and a typed fatalchainIDMismatchError. - Make
checkChainIDsingle-attempt and updateRunto callverifyChainID(whileCatchUpL1Headcontinues to usecheckChainIDfor fail-fast behavior). - Update/add tests to validate retry-on-transient behavior and keep fail-fast assertions on the
CatchUpL1Headpath.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| l1/l1.go | Introduces retrying chain-ID verification in Run, keeps mismatches fatal via a typed error, and makes checkChainID single-shot for migration path. |
| l1/l1_test.go | Re-targets existing chain-ID failure tests to CatchUpL1Head and adds a regression test ensuring transient chain-ID errors don’t shut down the node. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| c.logger.Warn("Failed to verify L1 chain ID; retrying", | ||
| zap.Duration("tryAgainIn", c.resubscribeDelay), | ||
| zap.Error(err), | ||
| ) |
There was a problem hiding this comment.
Good catch — fixed. The Warn now logs only the high-level message; the detailed error (which can include the L1 URL/API key) is moved to Debug, matching the existing convention in the resubscribe loop a few lines below.
… node The L1 client ran as a node service, and node.StartService brings the whole node down whenever a service's Run returns. checkChainID returned transient RPC failures (rate limits, timeouts, an unresponsive endpoint) straight out of Run, so a single eth_chainId hiccup on a free-tier provider — e.g. "daily request count exceeded, request rate limited" — killed the node. This also violated the service contract, which asks services to log and retry non-critical errors rather than return them. Add verifyChainID, used only by Run, which retries transient failures with the client's existing resubscribe-delay loop (matching subscribeToUpdates and finalisedHeight), demoting them to warnings until the check succeeds or the context is cancelled. Retries are unbounded by design: a cap would just reintroduce the shutdown. A genuine chain-ID mismatch is a misconfiguration and stays fatal, now modelled as a typed chainIDMismatchError so the retry loop can tell it apart from transient errors. checkChainID stays single-attempt because the one-shot CatchUpL1Head migration path must keep failing fast. Fixes NethermindEth#1385
de6ed75 to
d0210c0
Compare
| func (c *Client) Run(ctx context.Context) error { | ||
| defer c.l1.Close() | ||
| if err := c.checkChainID(ctx); err != nil { | ||
| if err := c.verifyChainID(ctx); err != nil { | ||
| return err | ||
| } | ||
|
|
There was a problem hiding this comment.
Good catch — added an early ctx.Err() check after verifyChainID so we return cleanly on shutdown instead of starting catch-up.
What & why
A transient L1 RPC error during the startup chain-ID check would take the whole node down. Per #1385:
eth_chainIdreturnsdaily request count exceeded, request rate limited, the L1 client'sRunreturns that error, and sincenode.StartServiceruns each service withdefer cancel(), any service'sRunreturning shuts the node down. This also violated theservice.Servicecontract, which asks services to log and retry non-critical errors rather than return them.What this changes
verifyChainID(used only byRun) retries transient chain-ID failures — rate limits, timeouts, an unresponsive node — as warnings, reusing the client's existing resubscribe-delay loop (same pattern assubscribeToUpdatesandfinalisedHeight), until the check passes orctxis cancelled. Retries are unbounded on purpose: a cap would just return an error and re-trigger the shutdown this fixes.chainIDMismatchError, same user-facing message).checkChainIDstays single-attempt — the one-shotCatchUpL1Headmigration path relies on it failing fast.After this change no transient L1 RPC error can shut the node down:
catchUpL1HeadUpdatesis already best-effort,subscribeToUpdates/finalisedHeightalready retry forever, and the only remaining fatal paths are a genuine mismatch and a local DB-write failure insetL1Head— both legitimately critical.Relationship to #3685
Complementary, not overlapping: #3685 makes
StartServicetolerate a service returningnil, but an error return still stops the node. Transient chain-ID failures are errors, so the L1 client must stop emitting them regardless. No dependency on #3685.Tests
go test ./l1/passes. NewTestTransientChainIDErrorDoesNotShutDownNodereproduces the issue's rate-limit scenario and asserts the node retries instead of shutting down (fails on the old code).TestChainIDCheckTimeout/TestChainIDFetchErrorwere re-pointed at the fail-fastCatchUpL1Headpath (same error-message assertions); the mismatch tests are unchanged.Fixes #1385