Skip to content

fix(l1): retry transient chain ID errors instead of shutting down the node#3771

Open
Ehsan-saradar wants to merge 2 commits into
NethermindEth:mainfrom
Ehsan-saradar:fix/1385-l1-no-shutdown-on-error
Open

fix(l1): retry transient chain ID errors instead of shutting down the node#3771
Ehsan-saradar wants to merge 2 commits into
NethermindEth:mainfrom
Ehsan-saradar:fix/1385-l1-no-shutdown-on-error

Conversation

@Ehsan-saradar

@Ehsan-saradar Ehsan-saradar commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

What & why

A transient L1 RPC error during the startup chain-ID check would take the whole node down. Per #1385: eth_chainId returns daily request count exceeded, request rate limited, the L1 client's Run returns that error, and since node.StartService runs each service with defer cancel(), any service's Run returning shuts the node down. This also violated the service.Service contract, which asks services to log and retry non-critical errors rather than return them.

What this changes

  • Retry transient failures. A new verifyChainID (used only by Run) retries transient chain-ID failures — rate limits, timeouts, an unresponsive node — as warnings, reusing the client's existing resubscribe-delay loop (same pattern as subscribeToUpdates and finalisedHeight), until the check passes or ctx is cancelled. Retries are unbounded on purpose: a cap would just return an error and re-trigger the shutdown this fixes.
  • Keep genuine mismatches fatal. A real L1/L2 network mismatch is a misconfiguration retrying can't fix, so it stays fatal (now a typed chainIDMismatchError, same user-facing message).
  • checkChainID stays single-attempt — the one-shot CatchUpL1Head migration path relies on it failing fast.

After this change no transient L1 RPC error can shut the node down: catchUpL1HeadUpdates is already best-effort, subscribeToUpdates/finalisedHeight already retry forever, and the only remaining fatal paths are a genuine mismatch and a local DB-write failure in setL1Head — both legitimately critical.

Relationship to #3685

Complementary, not overlapping: #3685 makes StartService tolerate a service returning nil, but an error return still stops the node. Transient chain-ID failures are errors, so the L1 client must stop emitting them regardless. No dependency on #3685.

Tests

go test ./l1/ passes. New TestTransientChainIDErrorDoesNotShutDownNode reproduces the issue's rate-limit scenario and asserts the node retries instead of shutting down (fails on the old code). TestChainIDCheckTimeout / TestChainIDFetchError were re-pointed at the fail-fast CatchUpL1Head path (same error-message assertions); the mismatch tests are unchanged.

Fixes #1385

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR changes the L1 client’s startup chain-ID verification so transient eth_chainId failures (rate limits/timeouts/unresponsive RPC) are retried instead of returning an error that shuts down the entire node, while keeping genuine L1/L2 network mismatches fatal.

Changes:

  • Add verifyChainID with an unbounded retry loop for transient chain-ID probe failures and a typed fatal chainIDMismatchError.
  • Make checkChainID single-attempt and update Run to call verifyChainID (while CatchUpL1Head continues to use checkChainID for fail-fast behavior).
  • Update/add tests to validate retry-on-transient behavior and keep fail-fast assertions on the CatchUpL1Head path.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
l1/l1.go Introduces retrying chain-ID verification in Run, keeps mismatches fatal via a typed error, and makes checkChainID single-shot for migration path.
l1/l1_test.go Re-targets existing chain-ID failure tests to CatchUpL1Head and adds a regression test ensuring transient chain-ID errors don’t shut down the node.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread l1/l1.go
Comment on lines +188 to +191
c.logger.Warn("Failed to verify L1 chain ID; retrying",
zap.Duration("tryAgainIn", c.resubscribeDelay),
zap.Error(err),
)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — fixed. The Warn now logs only the high-level message; the detailed error (which can include the L1 URL/API key) is moved to Debug, matching the existing convention in the resubscribe loop a few lines below.

… node

The L1 client ran as a node service, and node.StartService brings the whole
node down whenever a service's Run returns. checkChainID returned transient
RPC failures (rate limits, timeouts, an unresponsive endpoint) straight out of
Run, so a single eth_chainId hiccup on a free-tier provider — e.g. "daily
request count exceeded, request rate limited" — killed the node. This also
violated the service contract, which asks services to log and retry
non-critical errors rather than return them.

Add verifyChainID, used only by Run, which retries transient failures with the
client's existing resubscribe-delay loop (matching subscribeToUpdates and
finalisedHeight), demoting them to warnings until the check succeeds or the
context is cancelled. Retries are unbounded by design: a cap would just
reintroduce the shutdown. A genuine chain-ID mismatch is a misconfiguration and
stays fatal, now modelled as a typed chainIDMismatchError so the retry loop can
tell it apart from transient errors.

checkChainID stays single-attempt because the one-shot CatchUpL1Head migration
path must keep failing fast.

Fixes NethermindEth#1385

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

Comment thread l1/l1.go
Comment on lines 224 to 229
func (c *Client) Run(ctx context.Context) error {
defer c.l1.Close()
if err := c.checkChainID(ctx); err != nil {
if err := c.verifyChainID(ctx); err != nil {
return err
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — added an early ctx.Err() check after verifyChainID so we return cleanly on shutdown instead of starting catch-up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

L1 Verifier shouldn't shutdown the node when it encounters an error

2 participants