Skip to content

CDP server ignores a single SIGTERM while a connection is live (distinct from #2507) #2510

@navidemad

Description

@navidemad

CDP server ignores a single SIGTERM while a connection is live (distinct from #2507)

Summary

lightpanda serve ignores a single SIGTERM whenever a CDP WebSocket
connection is still open. The process only terminates if the CDP socket is
closed (client EOF) before the signal, or if SIGTERM is sent three times.
A conventional one-shot graceful stop (kill -TERM <pid> then waitpid) hangs.

This is separate from #2507 / #2509 (the telemetry curl-multi poll-fd bug):

Environment

Reproduction (Lightpanda + raw CDP only)

#!/usr/bin/env python3
# usage: repro.py /path/to/lightpanda
import socket, base64, os, sys, subprocess, time, signal

BIN = sys.argv[1]
print("binary:", subprocess.run([BIN, "version"], capture_output=True, text=True).stdout.strip())

def ws_connect(port):
    s = socket.create_connection(("127.0.0.1", port), timeout=5)
    key = base64.b64encode(os.urandom(16)).decode()
    s.sendall(("GET / HTTP/1.1\r\nHost: x\r\nUpgrade: websocket\r\nConnection: Upgrade\r\n"
               "Sec-WebSocket-Key: %s\r\nSec-WebSocket-Version: 13\r\n\r\n" % key).encode())
    buf = b""
    while b"\r\n\r\n" not in buf:
        buf += s.recv(1)
    assert buf.startswith(b"HTTP/1.1 101"), buf[:40]
    return s

def run(label, close_before_sigterm, port):
    env = dict(os.environ); env["LIGHTPANDA_DISABLE_TELEMETRY"] = "true"
    p = subprocess.Popen([BIN, "serve", "--host", "127.0.0.1", "--port", str(port), "--log_level", "error"],
                         stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL, env=env, start_new_session=True)
    time.sleep(2)
    s = ws_connect(port)               # one CDP WebSocket; no commands sent
    if close_before_sigterm:
        s.close(); time.sleep(0.3)
    p.send_signal(signal.SIGTERM)      # exactly ONE SIGTERM
    deadline = time.time() + 12
    while time.time() < deadline:
        if p.poll() is not None:
            print("[%s] exited on SIGTERM (ok)" % label); break
        time.sleep(0.25)
    else:
        print("[%s] *** SIGTERM IGNORED — still alive 12s after one SIGTERM ***" % label)
        os.killpg(os.getpgid(p.pid), signal.SIGKILL)
    try: s.close()
    except OSError: pass
    try: p.wait(timeout=3)
    except Exception:
        try: os.killpg(os.getpgid(p.pid), signal.SIGKILL)
        except Exception: pass

run("WS closed before SIGTERM", True, 39920)
run("WS still open at SIGTERM", False, 39921)

Output (identical on 6353 and on 6354 = main + #2509)

binary: 1.0.0-dev.6354+e1e49c8a
[WS closed before SIGTERM] exited on SIGTERM (ok)
[WS still open at SIGTERM] *** SIGTERM IGNORED — still alive 12s after one SIGTERM ***

Broken flow

sequenceDiagram
    participant C as CDP client
    participant S as Sighandler thread (sigwait)
    participant M as Server/Network loop
    participant W as Connection worker (handleConnection)
    C->>M: WS upgrade → spawnWorker (active_threads=1)
    Note over C: connection left OPEN
    C->>S: SIGTERM (1st)
    S->>M: attempt=1 → run graceful listeners → Server.deinit()
    M->>W: Server.shutdown(): conn.shutdown()
    Note over W: not unwound while the CDP socket is open
    loop forever
        M->>M: while active_threads>0 { sleep 10ms }
    end
    Note over S: a 2nd SIGTERM re-runs listeners, only a 3rd does process.exit(1)
    Note over C,M: client sent ONE SIGTERM ⇒ HANG
Loading

Expected flow

sequenceDiagram
    participant C as CDP client
    participant S as Sighandler thread (sigwait)
    participant M as Server/Network loop
    participant W as Connection worker
    C->>M: WS upgrade → spawnWorker (active_threads=1)
    C->>S: SIGTERM (1st)
    S->>M: run graceful listeners → Server.deinit()
    M->>W: Server.shutdown(): conn.shutdown() (interrupts the worker)
    W-->>M: worker returns (active_threads→0)
    M-->>C: process exits cleanly
Loading

Root cause (pointers)

  • src/Sighandler.zig sighandle: the 1st/2nd SIGTERM only run the graceful
    listeners; only the 3rd (attempt > 1) calls std.process.exit(1).
  • src/Server.zig deinit: after shutdown() it spins
    while (active_threads.load() > 0) sleep(10ms), waiting for each
    handleConnection worker to return.
  • Server.shutdown() calls conn.shutdown() per live CDP connection, but that
    does not unwind a worker whose connection socket is still open, so
    active_threads never reaches 0 and the graceful path blocks indefinitely.

The single SIGTERM is therefore consumed by the graceful path that can't
complete, instead of terminating the process.

Expected vs actual

  • Expected: one SIGTERM terminates lightpanda serve whether or not a CDP
    connection is open (as already happens when the socket is closed first).
  • Actual: with a live CDP connection, the first (and second) SIGTERM are
    absorbed; only SIGKILL or a third SIGTERM stops it.

Suggested direction

On the first termination signal, force-close live CDP connections (and/or make
conn.shutdown() interrupt the worker's blocking wait) so Server.deinit's
drain loop can complete — i.e. the first SIGTERM should always lead to exit.

Why it matters

Any client that stops the browser with a single graceful SIGTERM (the
conventional path — Puppeteer/Playwright/chromedp/custom drivers) hangs after
driving a session. In capybara-lightpanda this surfaced as a multi-minute /
indefinite hang at suite teardown when process cleanup fell to a GC finalizer
that SIGTERMs without first closing the CDP socket.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions