Skip to content

network: terminate live CDP connections on shutdown#2511

Open
navidemad wants to merge 5 commits into
lightpanda-io:mainfrom
navidemad:fix-sigterm-live-cdp-connection
Open

network: terminate live CDP connections on shutdown#2511
navidemad wants to merge 5 commits into
lightpanda-io:mainfrom
navidemad:fix-sigterm-live-cdp-connection

Conversation

@navidemad
Copy link
Copy Markdown
Contributor

@navidemad navidemad commented May 20, 2026

What

lightpanda serve ignored a single SIGTERM while a CDP connection was live — the process only exited if the socket was closed before the signal, or after a third signal. A conventional one-shot graceful stop (kill -TERM <pid> then waitpid) hung after the client had driven a session.

Closes #2510.

Root cause

On shutdown the sighandler runs Network.stop (sets shutdown, lets the run loop exit) before Server.shutdown. A live CDP worker parks in curl_multi_poll and is woken only by the Network thread via dropCdp -> handles.wakeup(). Once the run loop exits with links still live, nothing can wake those workers, so Server.deinit()'s while (active_threads > 0) loop spins forever.

flowchart TD
    A1[SIGTERM] --> A2["Network.stop: shutdown = true"]
    A2 --> A3{"run loop: shutdown and running_handles == 0"}
    A3 -->|break| A4["Network thread exits<br/>(live CDP links abandoned)"]
    A4 --> A5["Server.shutdown: conn.shutdown()"]
    A5 --> A6["worker still parked in curl_multi_poll<br/>(nothing left to wake it)"]
    A6 --> A7["Server.deinit: while active_threads &gt; 0<br/>spins forever"]
Loading

Change

Network.run, on observing shutdown, now calls a new shutdownCdpLinks() that drops every still-live CDP link via the existing peer-EOF path (dropCdp(notify=true) — pushes .disconnect into the worker's inbox and wakes it out of curl_multi_poll). The worker's cdp.tick() then returns false and it exits before the loop breaks. The helper is idempotent (a no-op once the links are drained).

flowchart TD
    B1[SIGTERM] --> B2["Network.stop: shutdown = true"]
    B2 --> B3["run loop: shutdownCdpLinks()<br/>dropCdp(notify=true) per live link"]
    B3 --> B4["worker woken: .disconnect in inbox<br/>+ curl_multi_wakeup"]
    B4 --> B5["cdp.tick() returns false<br/>worker exits, active_threads -&gt; 0"]
    B3 --> B6[run loop breaks, Network exits]
    B5 --> B7["Server.deinit returns cleanly"]
Loading

Test plan

  • zig build test — 717/717 pass.
  • End-to-end (ReleaseFast build): the reproducer below — a raw CDP client that opens a WebSocket and sends one SIGTERM with the connection still open — now exits cleanly; before the fix it stayed alive 12s+ until SIGKILL. Closing the socket before SIGTERM already worked and still does.

No in-process unit test: the shutdown path is multi-threaded (sighandler thread → Network run loop → per-connection worker), and the without-fix failure mode is a Server.deinit() spin rather than an assertion. The link's lifetime is guarded by unregisterCdp's synchronous handoff on the Network thread, so dropping links from any other thread (e.g. a test) risks a use-after-free on cdp_poll_snapshot. A race-free test would need a dedicated Server/Network instance plus a watchdog — happy to add that if you'd like it in the suite.

Reproducer

Self-contained (Lightpanda + Python stdlib, no extra deps), telemetry disabled to show this is independent of #2509:

import socket, base64, os, sys, subprocess, time, signal

BIN = sys.argv[1]
print("binary:", subprocess.run([BIN, "version"], capture_output=True, text=True).stdout.strip())

def ws_connect(port):
    s = socket.create_connection(("127.0.0.1", port), timeout=5)
    key = base64.b64encode(os.urandom(16)).decode()
    s.sendall(("GET / HTTP/1.1\r\nHost: x\r\nUpgrade: websocket\r\nConnection: Upgrade\r\n"
               "Sec-WebSocket-Key: %s\r\nSec-WebSocket-Version: 13\r\n\r\n" % key).encode())
    buf = b""
    while b"\r\n\r\n" not in buf:
        buf += s.recv(1)
    assert buf.startswith(b"HTTP/1.1 101"), buf[:40]
    return s

def run(label, close_before_sigterm, port):
    env = dict(os.environ); env["LIGHTPANDA_DISABLE_TELEMETRY"] = "true"
    p = subprocess.Popen([BIN, "serve", "--host", "127.0.0.1", "--port", str(port), "--log_level", "error"],
                         stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL, env=env, start_new_session=True)
    time.sleep(2)
    s = ws_connect(port)               # one CDP WebSocket; no commands sent
    if close_before_sigterm:
        s.close(); time.sleep(0.3)
    p.send_signal(signal.SIGTERM)      # exactly ONE SIGTERM
    deadline = time.time() + 12
    while time.time() < deadline:
        if p.poll() is not None:
            print("[%s] exited on SIGTERM (ok)" % label); break
        time.sleep(0.25)
    else:
        print("[%s] *** SIGTERM IGNORED — still alive 12s after one SIGTERM ***" % label)
        os.killpg(os.getpgid(p.pid), signal.SIGKILL)
    try: s.close()
    except OSError: pass
    try: p.wait(timeout=3)
    except Exception:
        try: os.killpg(os.getpgid(p.pid), signal.SIGKILL)
        except Exception: pass

run("WS closed before SIGTERM", True, 39920)
run("WS still open at SIGTERM", False, 39921)

Before this PR the second case prints SIGTERM IGNORED; after it, both print exited on SIGTERM (ok).

The CDP server ignored a single SIGTERM while a connection was live: the
process only exited if the socket was closed before the signal, or after a
third signal. A conventional one-shot graceful stop (SIGTERM then waitpid)
hung.

On shutdown the sighandler runs Network.stop (which sets `shutdown` and lets
the run loop exit) before Server.shutdown. A live CDP worker parks in
curl_multi_poll and is woken ONLY by the Network thread via
dropCdp -> handles.wakeup(). Once the run loop exits with links still live,
nothing can wake those workers, so Server.deinit()'s
`while (active_threads > 0)` loop spins forever.

Drop every still-live CDP link from the run loop when `shutdown` is set,
reusing the existing peer-EOF path: dropCdp(notify=true) pushes a .disconnect
into the worker's inbox and wakes it, so cdp.tick() returns false and the
worker exits before the loop breaks.
@navidemad navidemad force-pushed the fix-sigterm-live-cdp-connection branch from b058716 to bdf28c5 Compare May 20, 2026 19:02
Currently, if a disconnect/close is captured in a worker during a syncRequest,
that specific request is terminated, but the error doesn't bubble up. The worker
remains alive and will subsequently block in a perform, with no connection alive
to wake it up.

In this commit, when disconnect/close is received, inbox.terminate is set to
true. This flag is checked (in syncRequest and http_client.tick) and
error.ClientDisconnected is returned.

(Also, on network shutdown, always broadcast the cdp_unregister since there's
no harm in sending an extra signal even if nothing was removed).
@karlseguin
Copy link
Copy Markdown
Collaborator

@krichprollsch mind reviewing my changes? They are somewhat independent but serve the same purpose. The original commit looks good, but, for me, it wasn't enough to cleanly stop under all circumstances. Only the two sets of changes combined were enough.

navidemad added 3 commits May 21, 2026 12:18
…-io#2510)

Drives a .disconnect through the worker's inbox and asserts a second tick still returns error.ClientDisconnected once the inbox is empty. Fails against the pre-latch HttpClient (the worker would re-park in perform with no producer to wake it); passes with the latch.
…onnect

syncRequest's `terminated` early return (added in 88b98e7) exits before request() takes ownership of req.headers, so the curl_slist leaked after a latched disconnect (any nested fetch / importScripts / external stylesheet). Free it in the early return, mirroring request()'s own failure paths.

Also adds a behavioral syncRequest-after-disconnect test. The leak itself isn't assertable here (curl_slist is C-allocated through the tracking allocator, outside the per-test leak check), so it's verified by the ownership contract rather than the suite.
@navidemad
Copy link
Copy Markdown
Contributor Author

Thanks @karlseguin — good catch, I'd missed the syncRequest path. Agreed the two changes are complementary, not redundant. Writing out the split for @krichprollsch's review:

  • My commit handles the idle worker — parked at the top-level perform/poll. shutdownCdpLinksdropCdp(notify=true) pushes a .disconnect and wakes it, so drainInbox returns error.ClientDisconnected and the worker exits.
  • Your commit handles the busy worker — mid-syncRequest. The .disconnect is drained inside the .sync_wait loop and error.ClientDisconnected propagates out of syncRequest, but the JS caller treats it as a failed fetch rather than a fatal disconnect. The worker resumes, re-parks with the link already dropped and nothing left to wake it. The terminated latch makes that one-shot error sticky so the next tick/syncRequest exits cleanly.

So: mine delivers + wakes, yours makes the signal survive a swallowed error. Both needed. (Your half also fixes this outside shutdown — a client disconnecting mid-synchronous-fetch would leak its worker thread the same way, and on a single-connection server wedge the next connection.)

I checked the latch's safety since it's never cleared: terminated is only touched on the worker thread (the Network thread only touches inbox.queue under the mutex, so no atomic needed), and it can't poison the next pooled connection because cdp.initbrowser.inithttp_client.init rebuilds the inbox fresh (.inbox = .{}).

One real issue in Capture disconnect/close in Worker: the early return in syncRequest (HttpClient.zig:543) returns before self.request(r, null), which takes ownership of req.headers and frees the curl_slist on every path. Callers (Frame.zig external stylesheet, ScriptManager blocking scripts, worker importScripts) rely on request() to free them — so after a latched disconnect, any nested fetch leaks its header list. I pushed a one-line fix mirroring request()'s own failure paths:

if (self.inbox.terminated) {
    req.headers.deinit();
    return error.ClientDisconnected;
}

Tests: I pushed a tick-latch regression test (fails against the pre-latch HttpClient.zig and passes with it) and a behavioral syncRequest-after-disconnect test. The header leak above isn't directly assertable — curl_slist is C-allocated through the tracking allocator, outside the per-test leak check — so that one's verified by review. Full suite green (719/719).

One thing, take or leave: the now-unconditional cdp_unregister.broadcast() — is that closing a specific missed-wakeup race, or defensive? Harmless either way; a one-line comment naming the race would help future readers if there is one.

@krichprollsch over to you for the combined diff.

Copy link
Copy Markdown
Member

@krichprollsch krichprollsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One non-blocking comment

Comment thread src/network/Network.zig

var it = self.cdp_links.first;
while (it) |node| {
const next = node.next;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe it's silly and I see we do this way on other loops, but why not directly having it = node.next;?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not silly at all. The direct form isn't safe here because this loop removes nodes while it iterates, so reading node.next after dropCdp breaks two ways:

  1. dropCdp calls self.cdp_links.remove(&link.node) (Network.zig:535). Zig's DoublyLinkedList.remove doesn't clear the removed node's own next (it just relinks the neighbors), so node.next still points at the old successor and we'd keep walking a node that's already out of the list.
  2. With notify=true, dropCdp also wakes the worker, which can then exit and free the CdpLink that node lives inside. By the time the iteration resumes, node itself may be gone, so it's a real use-after-free and not just a stale read.

Grabbing next before the dropCdp call avoids both. The loops that advance with : (it = node.next) directly (like prepareCdpPollFds, Network.zig:571) only read the list and never drop nodes, so they're fine. This one matches the processCdpEvents first pass (Network.zig:601), which removes nodes for the same reason.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but you can grab next BEFORE the drop INTO it

while (it) |node| {
    it = node.next;
    const link: *CdpLink = @fieldParentPtr("node", node);
    if (link.state == .live) {
        self.dropCdp(link, null, true);
    }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CDP server ignores a single SIGTERM while a connection is live (distinct from #2507)

3 participants