fix(l1): retry transient chain ID errors instead of shutting down the node by Ehsan-saradar · Pull Request #3771 · NethermindEth/juno

Ehsan-saradar · 2026-06-28T12:29:58Z

What & why

A transient L1 RPC error during the startup chain-ID check would take the whole node down. Per #1385: eth_chainId returns daily request count exceeded, request rate limited, the L1 client's Run returns that error, and since node.StartService runs each service with defer cancel(), any service's Run returning shuts the node down. This also violated the service.Service contract, which asks services to log and retry non-critical errors rather than return them.

What this changes

Retry transient failures. A new verifyChainID (used only by Run) retries transient chain-ID failures — rate limits, timeouts, an unresponsive node — as warnings, reusing the client's existing resubscribe-delay loop (same pattern as subscribeToUpdates and finalisedHeight), until the check passes or ctx is cancelled. Retries are unbounded on purpose: a cap would just return an error and re-trigger the shutdown this fixes.
Keep genuine mismatches fatal. A real L1/L2 network mismatch is a misconfiguration retrying can't fix, so it stays fatal (now a typed chainIDMismatchError, same user-facing message).
checkChainID stays single-attempt — the one-shot CatchUpL1Head migration path relies on it failing fast.

After this change no transient L1 RPC error can shut the node down: catchUpL1HeadUpdates is already best-effort, subscribeToUpdates/finalisedHeight already retry forever, and the only remaining fatal paths are a genuine mismatch and a local DB-write failure in setL1Head — both legitimately critical.

Relationship to #3685

Complementary, not overlapping: #3685 makes StartService tolerate a service returning nil, but an error return still stops the node. Transient chain-ID failures are errors, so the L1 client must stop emitting them regardless. No dependency on #3685.

Tests

go test ./l1/ passes. New TestTransientChainIDErrorDoesNotShutDownNode reproduces the issue's rate-limit scenario and asserts the node retries instead of shutting down (fails on the old code). TestChainIDCheckTimeout / TestChainIDFetchError were re-pointed at the fail-fast CatchUpL1Head path (same error-message assertions); the mismatch tests are unchanged.

Fixes #1385

Copilot

Pull request overview

This PR changes the L1 client’s startup chain-ID verification so transient eth_chainId failures (rate limits/timeouts/unresponsive RPC) are retried instead of returning an error that shuts down the entire node, while keeping genuine L1/L2 network mismatches fatal.

Changes:

Add verifyChainID with an unbounded retry loop for transient chain-ID probe failures and a typed fatal chainIDMismatchError.
Make checkChainID single-attempt and update Run to call verifyChainID (while CatchUpL1Head continues to use checkChainID for fail-fast behavior).
Update/add tests to validate retry-on-transient behavior and keep fail-fast assertions on the CatchUpL1Head path.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
l1/l1.go	Introduces retrying chain-ID verification in `Run`, keeps mismatches fatal via a typed error, and makes `checkChainID` single-shot for migration path.
l1/l1_test.go	Re-targets existing chain-ID failure tests to `CatchUpL1Head` and adds a regression test ensuring transient chain-ID errors don’t shut down the node.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Ehsan-saradar · 2026-06-28T12:42:13Z

+			c.logger.Warn("Failed to verify L1 chain ID; retrying",
+				zap.Duration("tryAgainIn", c.resubscribeDelay),
+				zap.Error(err),
+			)


Good catch — fixed. The Warn now logs only the high-level message; the detailed error (which can include the L1 URL/API key) is moved to Debug, matching the existing convention in the resubscribe loop a few lines below.

… node The L1 client ran as a node service, and node.StartService brings the whole node down whenever a service's Run returns. checkChainID returned transient RPC failures (rate limits, timeouts, an unresponsive endpoint) straight out of Run, so a single eth_chainId hiccup on a free-tier provider — e.g. "daily request count exceeded, request rate limited" — killed the node. This also violated the service contract, which asks services to log and retry non-critical errors rather than return them. Add verifyChainID, used only by Run, which retries transient failures with the client's existing resubscribe-delay loop (matching subscribeToUpdates and finalisedHeight), demoting them to warnings until the check succeeds or the context is cancelled. Retries are unbounded by design: a cap would just reintroduce the shutdown. A genuine chain-ID mismatch is a misconfiguration and stays fatal, now modelled as a typed chainIDMismatchError so the retry loop can tell it apart from transient errors. checkChainID stays single-attempt because the one-shot CatchUpL1Head migration path must keep failing fast. Fixes NethermindEth#1385

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

Ehsan-saradar · 2026-06-29T13:12:11Z

 func (c *Client) Run(ctx context.Context) error {
 	defer c.l1.Close()
-	if err := c.checkChainID(ctx); err != nil {
+	if err := c.verifyChainID(ctx); err != nil {
 		return err
 	}



Good catch — added an early ctx.Err() check after verifyChainID so we return cleanly on shutdown instead of starting catch-up.

Copilot AI review requested due to automatic review settings June 28, 2026 12:29

Ehsan-saradar mentioned this pull request Jun 28, 2026

L1 Verifier shouldn't shutdown the node when it encounters an error #1385

Open

Copilot started reviewing on behalf of Ehsan-saradar June 28, 2026 12:30 View session

Copilot AI reviewed Jun 28, 2026

View reviewed changes

Ehsan-saradar force-pushed the fix/1385-l1-no-shutdown-on-error branch from de6ed75 to d0210c0 Compare June 28, 2026 12:42

Ehsan-saradar requested a review from Copilot June 29, 2026 10:14

Copilot started reviewing on behalf of Ehsan-saradar June 29, 2026 10:14 View session

Copilot AI reviewed Jun 29, 2026

View reviewed changes

fix(l1): skip catch-up when ctx cancelled during chain ID verify

1527478

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(l1): retry transient chain ID errors instead of shutting down the node#3771

fix(l1): retry transient chain ID errors instead of shutting down the node#3771
Ehsan-saradar wants to merge 2 commits into
NethermindEth:mainfrom
Ehsan-saradar:fix/1385-l1-no-shutdown-on-error

Ehsan-saradar commented Jun 28, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Ehsan-saradar Jun 28, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Ehsan-saradar Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Ehsan-saradar commented Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What & why

What this changes

Relationship to #3685

Tests

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Ehsan-saradar Jun 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Ehsan-saradar Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Ehsan-saradar commented Jun 28, 2026 •

edited

Loading