fix(node): keep node running when a service exits without error#3770
fix(node): keep node running when a service exits without error#3770Ehsan-saradar wants to merge 7 commits into
Conversation
StartService deferred cancel() on every service return, so a service that finished its work and returned nil would shut the whole node down. That forced short-lived services to artificially block on <-ctx.Done() just to stay alive. Only tear the node down when a service returns an error or panics. A clean return now lets the service exit on its own; if it returns before shutdown was requested (unexpected for a long-running service) we log a warning. Closes NethermindEth#3685
There was a problem hiding this comment.
Pull request overview
This PR changes node service lifecycle handling so a service that returns nil no longer cancels the node-wide context (preventing unintended full node shutdown). Instead, the node is only canceled on service error or panic; clean early exits are logged as warnings.
Changes:
- Update
Node.StartServiceto cancel the node context only on service error or panic (and re-panic to propagate panics). - Add logging for clean service exits (warn if the node wasn’t already shutting down; debug otherwise).
- Add unit tests covering error, panic, and clean-exit behaviors.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
node/node.go |
Adjusts StartService cancellation/panic behavior and adds clean-exit logging. |
node/start_service_test.go |
Adds tests validating shutdown behavior for error, panic, and clean service exits. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // re-panic so the WaitGroup can propagate it to the caller. | ||
| defer func() { | ||
| if r := recover(); r != nil { | ||
| n.logger.Error("Service panicked, shutting down node", zap.String("name", name), zap.Any("panic", r)) |
There was a problem hiding this comment.
Good catch. The stack is now captured with debug.Stack() inside the recover, before the re-panic — at that point gopanic is still running our deferred func on top of the original panicking frames, so it points at the real panic site rather than the wg.Wait() re-panic. Added as a stack field on the log entry in 242f665.
| // The service returned without an error. Let it exit on its own without | ||
| // taking the rest of the node down. A clean return before shutdown was | ||
| // requested is unexpected for a long-running service, so flag it. | ||
| if ctx.Err() == nil { | ||
| n.logger.Warn("Service stopped before node shutdown was requested", zap.String("name", name)) |
There was a problem hiding this comment.
Agreed. Updated the service.Service doc to drop "Services should be blocking" and spell out the contract instead: a long-running service blocks until ctx is cancelled, a short-lived service may return nil once its work is done without shutting the node down, and an error/panic triggers a node-wide shutdown. Done in 242f665.
- Capture debug.Stack() at the recover site so the panic log points at the original panic location instead of the WaitGroup re-panic. - Update the service.Service doc to reflect that a short-lived service may return nil cleanly without shutting the node down.
rodrodros
left a comment
There was a problem hiding this comment.
Thanks for your contribution
| @@ -0,0 +1,97 @@ | |||
| package node | |||
There was a problem hiding this comment.
Please use package node_test. It is important test node behaviour. The _test suffix forces you to only have acess to the public API when writing tests.
Note that this doesn't mean you should change the public / private API and keep the tests the same but you need to update the tests so they are relevant but only through the use of public API.
I believe after changing the package to _test the modifications to the test file should be minimal.
There was a problem hiding this comment.
Done — moved to package node_test. Used a small export_test.go seam (NewWithLogger) so the external test can build a logger-only Node without widening the public API.
| // TestStartService covers the three ways a service's Run can return and the | ||
| // expected effect on the node-wide context: | ||
| // - error -> shut the node down | ||
| // - panic -> shut the node down and propagate the panic | ||
| // - nil -> let the service exit without taking the node down | ||
| func TestStartService(t *testing.T) { | ||
| newNode := func() *Node { return &Node{logger: log.NewNopZapLogger()} } | ||
|
|
||
| t.Run("error shuts the node down", func(t *testing.T) { | ||
| n := newNode() | ||
| wg := conc.NewWaitGroup() | ||
| ctx, cancel := context.WithCancel(t.Context()) | ||
| defer cancel() | ||
|
|
||
| failing := &fakeService{run: func(context.Context) error { | ||
| return errors.New("boom") | ||
| }} | ||
| n.StartService(wg, ctx, cancel, failing) | ||
| wg.Wait() | ||
|
|
||
| require.Error(t, ctx.Err(), "a service error should shut the node down") | ||
| }) |
There was a problem hiding this comment.
This tests interact with the services but they don't check anything, just that they node stopped.
Please use the zap.Observer utility and then check that the right final error logs where logged, otherwise you've no way of checking node behaviour.
There was a problem hiding this comment.
Done — now asserting on the emitted logs via zaptest/observer: Service error on the error path, and the early-exit warning on a clean return.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3770 +/- ##
==========================================
+ Coverage 75.07% 75.23% +0.16%
==========================================
Files 427 433 +6
Lines 38532 38675 +143
==========================================
+ Hits 28928 29099 +171
+ Misses 7608 7574 -34
- Partials 1996 2002 +6 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
| n.logger.Error("Service panicked, shutting down node", | ||
| zap.String("name", name), zap.Any("panic", r), zap.ByteString("stack", debug.Stack())) |
There was a problem hiding this comment.
This is moot now — the panic log was dropped entirely in a later commit (we agreed it shouldn't ever fire), so there's no more debug.Stack() to encode. The recover path just cancels and re-panics, and the WaitGroup surfaces the original stack to the caller.
Drop the panic log and manual debug.Stack() capture: the WaitGroup re-panics with the original stack, so logging it here is redundant. Trim the "shutting down node" suffix from the error log (the node already announces its own shutdown), and shorten the Service.Run contract doc.
Move the test to package node_test and assert the emitted logs with zaptest/observer instead of only checking the cancelled context. Add an export_test.go seam so the external test can build a logger-only Node without widening the public API.
There was a problem hiding this comment.
I don't understand why do you need this?
| // messages StartService emits. | ||
| newNode := func() (*node.Node, *observer.ObservedLogs) { | ||
| core, logs := observer.New(zapcore.DebugLevel) | ||
| return node.NewWithLogger(log.NewZapLoggerWithCore(core)), logs |
There was a problem hiding this comment.
Cannot you initialize the log here directly with the right logger?
There was a problem hiding this comment.
Yeah, the catch is logger is unexported, so from package node_test I can't set it on the struct directly. The only public way to get a Node with a logger is New(), but that spins up the DB, VM, mempool and all the services — way too heavy for a lifecycle test.
That's all NewWithLogger is for — it lives in export_test.go, so it's compiled only during tests and never becomes part of the package's actual API. It's the standard Go trick for testing from an external _test package while reaching one internal bit, and it's what keeps the test in node_test without changing the public/private API like you asked.
Happy to go the New() route instead if you'd rather, but it'd pull a lot of unrelated setup into the test. Let me know what you prefer.
There was a problem hiding this comment.
Yes no NewWithLogger please, use the available public API to do your tests
| wg.Go(func() { | ||
| // Immediately acknowledge panicing services by shutting down the node | ||
| // Without the deffered cancel(), we would have to wait for user to hit Ctrl+C | ||
| defer cancel() | ||
| name := reflect.TypeOf(s).String() | ||
|
|
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Signed-off-by: Ehsan-saradar <ehsan.saradar@gmail.com>
Closes #3685.
StartServicedeferredcancel()on every service return, so a service that finished its work and returnednilwould tear the whole node down. That forced short-lived services to artificially block on<-ctx.Done()just to stay alive.Now the node is only shut down when a service returns an error or panics. A clean return lets the service exit on its own; if it happens before shutdown was requested (unexpected for a long-running service) we log a warning instead of stopping the node.
Added tests covering the three cases: error, panic, and clean exit.