Skip to content

fix(node): keep node running when a service exits without error#3770

Open
Ehsan-saradar wants to merge 7 commits into
NethermindEth:mainfrom
Ehsan-saradar:fix/3685-clean-service-exit
Open

fix(node): keep node running when a service exits without error#3770
Ehsan-saradar wants to merge 7 commits into
NethermindEth:mainfrom
Ehsan-saradar:fix/3685-clean-service-exit

Conversation

@Ehsan-saradar

Copy link
Copy Markdown
Contributor

Closes #3685.

StartService deferred cancel() on every service return, so a service that finished its work and returned nil would tear the whole node down. That forced short-lived services to artificially block on <-ctx.Done() just to stay alive.

Now the node is only shut down when a service returns an error or panics. A clean return lets the service exit on its own; if it happens before shutdown was requested (unexpected for a long-running service) we log a warning instead of stopping the node.

Added tests covering the three cases: error, panic, and clean exit.

StartService deferred cancel() on every service return, so a service that
finished its work and returned nil would shut the whole node down. That forced
short-lived services to artificially block on <-ctx.Done() just to stay alive.

Only tear the node down when a service returns an error or panics. A clean
return now lets the service exit on its own; if it returns before shutdown was
requested (unexpected for a long-running service) we log a warning.

Closes NethermindEth#3685
Copilot AI review requested due to automatic review settings June 28, 2026 07:02

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR changes node service lifecycle handling so a service that returns nil no longer cancels the node-wide context (preventing unintended full node shutdown). Instead, the node is only canceled on service error or panic; clean early exits are logged as warnings.

Changes:

  • Update Node.StartService to cancel the node context only on service error or panic (and re-panic to propagate panics).
  • Add logging for clean service exits (warn if the node wasn’t already shutting down; debug otherwise).
  • Add unit tests covering error, panic, and clean-exit behaviors.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
node/node.go Adjusts StartService cancellation/panic behavior and adds clean-exit logging.
node/start_service_test.go Adds tests validating shutdown behavior for error, panic, and clean service exits.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread node/node.go Outdated
// re-panic so the WaitGroup can propagate it to the caller.
defer func() {
if r := recover(); r != nil {
n.logger.Error("Service panicked, shutting down node", zap.String("name", name), zap.Any("panic", r))

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. The stack is now captured with debug.Stack() inside the recover, before the re-panic — at that point gopanic is still running our deferred func on top of the original panicking frames, so it points at the real panic site rather than the wg.Wait() re-panic. Added as a stack field on the log entry in 242f665.

Comment thread node/node.go
Comment on lines +759 to +763
// The service returned without an error. Let it exit on its own without
// taking the rest of the node down. A clean return before shutdown was
// requested is unexpected for a long-running service, so flag it.
if ctx.Err() == nil {
n.logger.Warn("Service stopped before node shutdown was requested", zap.String("name", name))

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Updated the service.Service doc to drop "Services should be blocking" and spell out the contract instead: a long-running service blocks until ctx is cancelled, a short-lived service may return nil once its work is done without shutting the node down, and an error/panic triggers a node-wide shutdown. Done in 242f665.

- Capture debug.Stack() at the recover site so the panic log points at the
  original panic location instead of the WaitGroup re-panic.
- Update the service.Service doc to reflect that a short-lived service may
  return nil cleanly without shutting the node down.

@rodrodros rodrodros left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution

Comment thread node/start_service_test.go Outdated
@@ -0,0 +1,97 @@
package node

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use package node_test. It is important test node behaviour. The _test suffix forces you to only have acess to the public API when writing tests.

Note that this doesn't mean you should change the public / private API and keep the tests the same but you need to update the tests so they are relevant but only through the use of public API.

I believe after changing the package to _test the modifications to the test file should be minimal.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — moved to package node_test. Used a small export_test.go seam (NewWithLogger) so the external test can build a logger-only Node without widening the public API.

Comment thread service/service.go
Comment thread node/node.go Outdated
Comment thread node/node.go Outdated
Comment thread node/node.go
Comment thread node/node_service_test.go
Comment thread node/node_service_test.go
Comment on lines +20 to +41
// TestStartService covers the three ways a service's Run can return and the
// expected effect on the node-wide context:
// - error -> shut the node down
// - panic -> shut the node down and propagate the panic
// - nil -> let the service exit without taking the node down
func TestStartService(t *testing.T) {
newNode := func() *Node { return &Node{logger: log.NewNopZapLogger()} }

t.Run("error shuts the node down", func(t *testing.T) {
n := newNode()
wg := conc.NewWaitGroup()
ctx, cancel := context.WithCancel(t.Context())
defer cancel()

failing := &fakeService{run: func(context.Context) error {
return errors.New("boom")
}}
n.StartService(wg, ctx, cancel, failing)
wg.Wait()

require.Error(t, ctx.Err(), "a service error should shut the node down")
})

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tests interact with the services but they don't check anything, just that they node stopped.

Please use the zap.Observer utility and then check that the right final error logs where logged, otherwise you've no way of checking node behaviour.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — now asserting on the emitted logs via zaptest/observer: Service error on the error path, and the early-exit warning on a clean return.

Comment thread node/start_service_test.go Outdated
@codecov

codecov Bot commented Jun 29, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 66.66667% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 75.23%. Comparing base (62d2222) to head (39f6e6e).
⚠️ Report is 16 commits behind head on main.

Files with missing lines Patch % Lines
node/node.go 66.66% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3770      +/-   ##
==========================================
+ Coverage   75.07%   75.23%   +0.16%     
==========================================
  Files         427      433       +6     
  Lines       38532    38675     +143     
==========================================
+ Hits        28928    29099     +171     
+ Misses       7608     7574      -34     
- Partials     1996     2002       +6     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Ehsan-saradar Ehsan-saradar marked this pull request as draft June 29, 2026 10:22

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

Comment thread node/node.go Outdated
Comment on lines +750 to +751
n.logger.Error("Service panicked, shutting down node",
zap.String("name", name), zap.Any("panic", r), zap.ByteString("stack", debug.Stack()))

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is moot now — the panic log was dropped entirely in a later commit (we agreed it shouldn't ever fire), so there's no more debug.Stack() to encode. The recover path just cancels and re-panics, and the WaitGroup surfaces the original stack to the caller.

Drop the panic log and manual debug.Stack() capture: the WaitGroup
re-panics with the original stack, so logging it here is redundant. Trim
the "shutting down node" suffix from the error log (the node already
announces its own shutdown), and shorten the Service.Run contract doc.
Move the test to package node_test and assert the emitted logs with
zaptest/observer instead of only checking the cancelled context. Add an
export_test.go seam so the external test can build a logger-only Node
without widening the public API.
@Ehsan-saradar Ehsan-saradar marked this pull request as ready for review June 29, 2026 12:50
Comment thread node/export_test.go

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why do you need this?

Comment thread node/node_service_test.go
// messages StartService emits.
newNode := func() (*node.Node, *observer.ObservedLogs) {
core, logs := observer.New(zapcore.DebugLevel)
return node.NewWithLogger(log.NewZapLoggerWithCore(core)), logs

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cannot you initialize the log here directly with the right logger?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the catch is logger is unexported, so from package node_test I can't set it on the struct directly. The only public way to get a Node with a logger is New(), but that spins up the DB, VM, mempool and all the services — way too heavy for a lifecycle test.

That's all NewWithLogger is for — it lives in export_test.go, so it's compiled only during tests and never becomes part of the package's actual API. It's the standard Go trick for testing from an external _test package while reaching one internal bit, and it's what keeps the test in node_test without changing the public/private API like you asked.

Happy to go the New() route instead if you'd rather, but it'd pull a lot of unrelated setup into the test. Let me know what you prefer.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes no NewWithLogger please, use the available public API to do your tests

Comment thread node/node_service_test.go Outdated

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

Comment thread node/node.go
Comment on lines 738 to +740
wg.Go(func() {
// Immediately acknowledge panicing services by shutting down the node
// Without the deffered cancel(), we would have to wait for user to hit Ctrl+C
defer cancel()
name := reflect.TypeOf(s).String()

Comment thread service/service.go
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Ehsan-saradar <ehsan.saradar@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

node: service closing without error shouldn't shut down the whole node

3 participants