network: terminate live CDP connections on shutdown#2511
Conversation
The CDP server ignored a single SIGTERM while a connection was live: the process only exited if the socket was closed before the signal, or after a third signal. A conventional one-shot graceful stop (SIGTERM then waitpid) hung. On shutdown the sighandler runs Network.stop (which sets `shutdown` and lets the run loop exit) before Server.shutdown. A live CDP worker parks in curl_multi_poll and is woken ONLY by the Network thread via dropCdp -> handles.wakeup(). Once the run loop exits with links still live, nothing can wake those workers, so Server.deinit()'s `while (active_threads > 0)` loop spins forever. Drop every still-live CDP link from the run loop when `shutdown` is set, reusing the existing peer-EOF path: dropCdp(notify=true) pushes a .disconnect into the worker's inbox and wakes it, so cdp.tick() returns false and the worker exits before the loop breaks.
b058716 to
bdf28c5
Compare
Currently, if a disconnect/close is captured in a worker during a syncRequest, that specific request is terminated, but the error doesn't bubble up. The worker remains alive and will subsequently block in a perform, with no connection alive to wake it up. In this commit, when disconnect/close is received, inbox.terminate is set to true. This flag is checked (in syncRequest and http_client.tick) and error.ClientDisconnected is returned. (Also, on network shutdown, always broadcast the cdp_unregister since there's no harm in sending an extra signal even if nothing was removed).
|
@krichprollsch mind reviewing my changes? They are somewhat independent but serve the same purpose. The original commit looks good, but, for me, it wasn't enough to cleanly stop under all circumstances. Only the two sets of changes combined were enough. |
…-io#2510) Drives a .disconnect through the worker's inbox and asserts a second tick still returns error.ClientDisconnected once the inbox is empty. Fails against the pre-latch HttpClient (the worker would re-park in perform with no producer to wake it); passes with the latch.
…onnect syncRequest's `terminated` early return (added in 88b98e7) exits before request() takes ownership of req.headers, so the curl_slist leaked after a latched disconnect (any nested fetch / importScripts / external stylesheet). Free it in the early return, mirroring request()'s own failure paths. Also adds a behavioral syncRequest-after-disconnect test. The leak itself isn't assertable here (curl_slist is C-allocated through the tracking allocator, outside the per-test leak check), so it's verified by the ownership contract rather than the suite.
|
Thanks @karlseguin — good catch, I'd missed the
So: mine delivers + wakes, yours makes the signal survive a swallowed error. Both needed. (Your half also fixes this outside shutdown — a client disconnecting mid-synchronous-fetch would leak its worker thread the same way, and on a single-connection server wedge the next connection.) I checked the latch's safety since it's never cleared: One real issue in if (self.inbox.terminated) {
req.headers.deinit();
return error.ClientDisconnected;
}Tests: I pushed a One thing, take or leave: the now-unconditional @krichprollsch over to you for the combined diff. |
krichprollsch
left a comment
There was a problem hiding this comment.
One non-blocking comment
|
|
||
| var it = self.cdp_links.first; | ||
| while (it) |node| { | ||
| const next = node.next; |
There was a problem hiding this comment.
maybe it's silly and I see we do this way on other loops, but why not directly having it = node.next;?
There was a problem hiding this comment.
Not silly at all. The direct form isn't safe here because this loop removes nodes while it iterates, so reading node.next after dropCdp breaks two ways:
dropCdpcallsself.cdp_links.remove(&link.node)(Network.zig:535). Zig'sDoublyLinkedList.removedoesn't clear the removed node's ownnext(it just relinks the neighbors), sonode.nextstill points at the old successor and we'd keep walking a node that's already out of the list.- With
notify=true,dropCdpalso wakes the worker, which can then exit and free theCdpLinkthatnodelives inside. By the time the iteration resumes,nodeitself may be gone, so it's a real use-after-free and not just a stale read.
Grabbing next before the dropCdp call avoids both. The loops that advance with : (it = node.next) directly (like prepareCdpPollFds, Network.zig:571) only read the list and never drop nodes, so they're fine. This one matches the processCdpEvents first pass (Network.zig:601), which removes nodes for the same reason.
There was a problem hiding this comment.
Yes, but you can grab next BEFORE the drop INTO it
while (it) |node| {
it = node.next;
const link: *CdpLink = @fieldParentPtr("node", node);
if (link.state == .live) {
self.dropCdp(link, null, true);
}
}
What
lightpanda serveignored a singleSIGTERMwhile a CDP connection was live — the process only exited if the socket was closed before the signal, or after a third signal. A conventional one-shot graceful stop (kill -TERM <pid>thenwaitpid) hung after the client had driven a session.Closes #2510.
Root cause
On shutdown the sighandler runs
Network.stop(setsshutdown, lets the run loop exit) beforeServer.shutdown. A live CDP worker parks incurl_multi_polland is woken only by the Network thread viadropCdp -> handles.wakeup(). Once the run loop exits with links still live, nothing can wake those workers, soServer.deinit()'swhile (active_threads > 0)loop spins forever.flowchart TD A1[SIGTERM] --> A2["Network.stop: shutdown = true"] A2 --> A3{"run loop: shutdown and running_handles == 0"} A3 -->|break| A4["Network thread exits<br/>(live CDP links abandoned)"] A4 --> A5["Server.shutdown: conn.shutdown()"] A5 --> A6["worker still parked in curl_multi_poll<br/>(nothing left to wake it)"] A6 --> A7["Server.deinit: while active_threads > 0<br/>spins forever"]Change
Network.run, on observingshutdown, now calls a newshutdownCdpLinks()that drops every still-live CDP link via the existing peer-EOF path (dropCdp(notify=true)— pushes.disconnectinto the worker's inbox and wakes it out ofcurl_multi_poll). The worker'scdp.tick()then returns false and it exits before the loop breaks. The helper is idempotent (a no-op once the links are drained).flowchart TD B1[SIGTERM] --> B2["Network.stop: shutdown = true"] B2 --> B3["run loop: shutdownCdpLinks()<br/>dropCdp(notify=true) per live link"] B3 --> B4["worker woken: .disconnect in inbox<br/>+ curl_multi_wakeup"] B4 --> B5["cdp.tick() returns false<br/>worker exits, active_threads -> 0"] B3 --> B6[run loop breaks, Network exits] B5 --> B7["Server.deinit returns cleanly"]Test plan
zig build test— 717/717 pass.SIGTERMwith the connection still open — now exits cleanly; before the fix it stayed alive 12s+ untilSIGKILL. Closing the socket beforeSIGTERMalready worked and still does.No in-process unit test: the shutdown path is multi-threaded (sighandler thread → Network run loop → per-connection worker), and the without-fix failure mode is a
Server.deinit()spin rather than an assertion. The link's lifetime is guarded byunregisterCdp's synchronous handoff on the Network thread, so dropping links from any other thread (e.g. a test) risks a use-after-free oncdp_poll_snapshot. A race-free test would need a dedicatedServer/Networkinstance plus a watchdog — happy to add that if you'd like it in the suite.Reproducer
Self-contained (Lightpanda + Python stdlib, no extra deps), telemetry disabled to show this is independent of #2509:
Before this PR the second case prints
SIGTERM IGNORED; after it, both printexited on SIGTERM (ok).