Skip to content

fix(core/service): defer tracing exporter termination until graceful shutdown completes#5531

Open
leevec wants to merge 1 commit into
zeromicro:masterfrom
leevec:fix/tracing-shutdown-loss
Open

fix(core/service): defer tracing exporter termination until graceful shutdown completes#5531
leevec wants to merge 1 commit into
zeromicro:masterfrom
leevec:fix/tracing-shutdown-loss

Conversation

@leevec
Copy link
Copy Markdown

@leevec leevec commented Apr 5, 2026

Currently, the tracing provider is torn down prior to the completion of the service’s graceful shutdown sequence. Consequently, spans emitted during this phase—such as those from in-flight requests or cleanup routines—are dropped, as the exporter terminates before it can flush pending data to the backend.

@kevwan kevwan added kind/bug Something isn't working area/core labels Apr 30, 2026
@kevwan
Copy link
Copy Markdown
Contributor

kevwan commented Apr 30, 2026

Review

Good fix for a real observability gap — spans emitted during graceful shutdown were being silently dropped.

Problem: The previous code registered trace.StopAgent() as a shutdown listener via proc.AddShutdownListener. This ran during signal handling (SIGTERM/SIGINT), which happens concurrently with or before the server finishes draining in-flight requests. As a result, spans from the draining phase were dropped.

The fix:

  1. Removes trace.StopAgent() from proc.AddShutdownListener
  2. Adds TearDown() to ServiceConf that calls trace.StopAgent()
  3. Calls TearDown() from rest.Server.Stop() and zrpc.RpcServer.Stop() — i.e., after the server has finished draining

Concerns:

  1. Missing gateway servergateway.Server has its own Stop() method. This PR updates rest and zrpc servers but not the gateway. Should gateway.Server.Stop() also call TearDown()?

  2. Signal-triggered shutdown (not through Stop()) — If the process is killed via signal and Stop() is never called explicitly (e.g., in tests or unusual shutdown flows), trace.StopAgent() won't be called. The previous behavior ensured it ran on shutdown signal. Consider whether a belt-and-suspenders approach (both shutdown listener + Stop()) with deduplication would be safer.

  3. RpcServerConf stores the conf struct by value — This means TearDown() is called on a copy of ServiceConf. Since trace.StopAgent() likely uses global state, this should be fine in practice, but it's worth noting.

  4. Tests — The test changes (defer c.TearDown()) are correct. However, tests for rest.Server.Stop() and zrpc.RpcServer.Stop() calling TearDown() would increase confidence.

Overall: The direction is correct and the fix improves trace completeness during graceful shutdown. The gateway concern is worth addressing before merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/core kind/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants