Skip to content

Fix #189: process worker fork-safety on macOS Tahoe (+ psycopg3)#207

Open
brentrager wants to merge 5 commits into
Maxteabag:mainfrom
brentrager:fix/macos-tahoe-fork-segfault
Open

Fix #189: process worker fork-safety on macOS Tahoe (+ psycopg3)#207
brentrager wants to merge 5 commits into
Maxteabag:mainfrom
brentrager:fix/macos-tahoe-fork-segfault

Conversation

@brentrager
Copy link
Copy Markdown

@brentrager brentrager commented May 20, 2026

First — thanks for sqlit! It's the cleanest of the SQL TUIs I've tried, and getting it working against our Postgres was worth digging into.

Fixes #189. macOS Tahoe (26.4+) widened the set of system libraries that abort when initialized after fork(), which has been quietly breaking quite a few Python projects (celery, gunicorn, psycopg2 all have open issues from the last few months — links below). For sqlit, the symptom is that the process worker either hangs with Worker connection closed. or segfaults the child when expanding a folder or running a query against PG over TLS. The commits in this PR each address one layer of that and are independently reviewable.

Commits

  1. Drop the spawn → fork fallback on darwin; redirect worker stderr. When multiprocessing.spawn fails with bad value(s) in fds_to_keep, the current code falls back to fork(). On Tahoe that's no longer safe — CoreFoundation, Security.framework, libsystem_trace's os_log, and libsystem_info's getaddrinfo all now abort if they were loaded by the parent before fork. Letting spawn fail and surfacing it to the caller lets the in-process fallback take over instead. The worker also now redirects its stdout/stderr to $SQLIT_WORKER_LOG (defaults to <tmpdir>/sqlit-worker.log) and enables faulthandler, so future C-level faults aren't invisible behind a closed pipe — that's what let me find the segfault site in the first place.

  2. Switch PostgreSQL / CockroachDB adapter from psycopg2 to psycopg v3. psycopg3 delegates TLS to libpq instead of bundling libssl, so importing it doesn't pull Security.framework into the parent at import time — one specific crash class gone. The diff is small: import psycopg, dbname= instead of database=, and psycopg[binary]>=3.2.0 in pyproject / flake.nix / README / install-strategy / tests + integration fixtures. The install-strategy unit tests still use the literal string "psycopg2-binary" because they're exercising detect_strategy with a sample package name, not the real driver.

  3. Pre-spawn the worker in cli.py before App.__init__. This is the long-term fix from the issue addendum. multiprocessing.spawn snapshots the parent's open FDs at spawn time, and Textual's App.__init__ registers signal-handler pipes that aren't inheritable — that's the bad value(s) in fds_to_keep ValueError. The previous lazy creation in idle_scheduler.py:_warm was racing this. Spawning before SSMSTUI(...) is constructed is the latest moment at which the FD set is still clean. The pre-spawned client is plumbed in via a new process_worker_client kwarg on SSMSTUI; ProcessWorkerLifecycleMixin already returns the cached client first, so no other changes are needed. If pre-spawn fails for any reason, the lazy path still runs (and on darwin, that path now correctly falls back to in-process execution per commit 1).

  4. Unit tests for the new code paths. New tests/unit/test_process_worker_fallback.py and tests/unit/test_cli_prewarm.py cover the darwin re-raise, the worker-log env override, the prewarm gating on runtime.process_worker / runtime.mock, prewarm exception-swallowing, and the SSMSTUI kwarg plumbing.

  5. psycopg v3 transaction + multi-statement compatibility in CursorBasedAdapter. Running tests/integration/test_transactions.py against a real Postgres container with the psycopg3 adapter surfaced two driver-behavior differences:

    • execute_non_query was calling conn.commit() unconditionally. psycopg2 silently no-ops that when autocommit=True; psycopg v3 actively COMMITs any open transaction, so executor.execute("BEGIN") ended its own transaction immediately and ROLLBACK had nothing to undo. Skipping the commit when getattr(conn, "autocommit", False) preserves the intent on both drivers.
    • execute_query only consulted the first result set. psycopg v3 exposes each result set in a multi-statement string and positions the cursor on the first (e.g. on BEGIN, with no columns). Walking cursor.nextset() lands on the last set; the loop is a no-op on drivers that don't implement it (some raise NotSupportedError, hence the try/except).

    The same commit fixes three test bugs that also reproduce on Maxteabag/sqlit:main with psycopg2 (verified by running the upstream tree against the same docker fixture):

    • test_multi_statement_as_single_query_works left an open transaction on the executor's persistent connection, then the test's own DROP TABLE on a fresh connection blocked on the lock. Added a reset_tui_executor() before cleanup.
    • test_atomic_execute_rolls_back_on_error asserted that atomic_execute raises on error; per its docstring and current implementation it actually returns MultiStatementResult(completed=False, error_index=…).
    • test_atomic_execute_commits_on_success asserted a QueryResult for a 3-statement query; atomic_execute is documented to return a MultiStatementResult carrying each StatementResult. Asserting on result.results[-1].result pulls out the trailing SELECT's QueryResult.

Result on macOS Tahoe 26.4.1 / Python 3.14 / PostgreSQL over TLS

State Before After commits 1–2 After all five
Worker crash on tree expand / query Hard segfault, Worker connection closed. In-process fallback, popup every query Worker runs, no popup
Query time (25-row table over WAN+TLS) n/a (crash) ~10s (UI thread blocked) ~0.5s
Query cancellation Not available Not available (in-process fallback) Available
psycopg flavor psycopg2-binary psycopg[binary]>=3.2.0 psycopg[binary]>=3.2.0

Tests

Run against postgres:16 from infra/docker/docker-compose.test.yml:

Suite Result
tests/unit 842 passed, 1 skipped
tests/integration/test_transactions.py 13 passed
tests/integration/test_database_browsing_flow.py::TestPostgreSQLDatabaseBrowsing 1 passed
tests/integration/test_stale_connection_reconnect.py -k postgres 5 passed, 1 idle-variant skipped (pre-existing)

Related upstream regressions on Tahoe (for context — this isn't sqlit-specific)

Tradeoffs to flag

  • Commit 1 costs query cancellation on darwin when spawn fails (the in-process fallback can't cancel mid-query). Commit 3 makes that case rare in practice.
  • Commit 2 is a driver bump. psycopg3 is API-compatible for the surface sqlit uses, but it's still a runtime dependency change worth flagging.
  • Commit 3 always spawns the worker at startup when process_worker=True (the default) — there's no longer an idle-only path on the first launch. Cost: one extra Python subprocess at TUI startup. Benefit: no race, no popup, worker is hot.
  • Commit 5 changes the autocommit branch in the base execute_non_query. Other CursorBasedAdapter drivers that don't set conn.autocommit=True are unaffected (getattr(conn, "autocommit", False) returns False and the commit still fires). The mysql/mssql/oracle/sqlite/firebird unit suites are unchanged and all green.

Happy to split this if any one commit is too big to land alongside the others — commits 1 + 3 alone are sufficient to fix #189. And happy to iterate on any of the approaches if you'd prefer something different.

…r stderr

Two changes for the macOS Tahoe segfault reported in issue Maxteabag#189:

1. process_worker_client.py: drop the spawn -> fork fallback when running
   on darwin. macOS Tahoe (26+) widened the set of system libraries that
   abort if initialized after fork() — CoreFoundation, Security.framework,
   libsystem_trace, libsystem_info — so any library the parent has loaded
   (psycopg, getaddrinfo, setproctitle, NSCharacterSet) can crash the
   forked child. Letting spawn fail and surfacing it to the caller lets
   the in-process fallback in query_execution take over instead.

   Cost: query cancellation isn't available when spawn fails. Gain: no
   hard crashes ("Worker connection closed.") on macOS.

2. process_worker.py: redirect the worker's stdout/stderr to a log file
   and enable faulthandler at entry. The worker can SIGPIPE on libpq
   notice writes when its inherited FDs point to a Textual-managed pipe,
   and any future C-level fault on the worker side is otherwise invisible
   to the parent (the pipe just dies). The log path honors
   $SQLIT_WORKER_LOG and defaults to <tmpdir>/sqlit-worker.log.

Refs the celery (Tahoe + setproctitle, celery#9894) and gunicorn
(Tahoe + NSCharacterSet, gunicorn#3526) reports for the broader fork-
safety regression on Tahoe.
Companion to the Maxteabag#189 fork-fallback fix. psycopg3 delegates TLS to
libpq instead of bundling libssl, so importing it doesn't pull
Security.framework into the parent process on macOS — eliminating one
specific child-side crash class (psycopg2 calling SSL_CTX_new inside
the libpq init path after fork).

The change is local to the two PostgreSQL-family adapters and the
matching install metadata:

  - postgresql/adapter.py, cockroachdb/adapter.py: import psycopg
    (v3) and pass dbname= instead of database=; everything else
    (sslmode, sslrootcert, sslcert, sslkey, sslpassword, autocommit,
    extra_options) is already libpq-conninfo-compatible.
  - install_strategy.py: add psycopg / psycopg[binary] to the Arch
    package-name mapping.
  - pyproject.toml: replace psycopg2-binary>=2.9.0 with
    psycopg[binary]>=3.2.0 in the postgres, cockroachdb, and all
    extras.
  - flake.nix: pyPkgs.psycopg2 -> pyPkgs.psycopg.
  - README.md: driver-reference table now lists psycopg[binary].
  - tests + integration fixtures: import psycopg (3) and use dbname=.
    The install-strategy tests still use the literal string
    "psycopg2-binary" because they're exercising detect_strategy with
    a sample package name, not the real driver.

837 unit tests pass.
The first two commits made the macOS Tahoe segfault non-fatal by
disabling the fork fallback, but every spawn after Textual booted
still failed with `bad value(s) in fds_to_keep` — leaving the user
with a "Process worker unavailable; falling back" popup and the
slower in-process executor on every query.

Root cause: `multiprocessing.spawn` snapshots the parent's open file
descriptors at spawn time. Once Textual's `App.__init__` has
registered its signal-handler pipes and event-loop FDs, several of
those are not inheritable, and spawn aborts before the child even
runs. The previous lazy creation in the idle scheduler always hit
this race.

Fix: spawn the worker in `cli.py` *before* `SSMSTUI(...)` is
constructed. That's the latest moment at which the parent's FD set
is still clean. The new `_prewarm_process_worker(runtime)` helper
returns a live `ProcessWorkerClient`, or `None` if the worker is
disabled / mocked / spawn raised — in which case we fall through to
the lazy path inside the UI (with the in-process fallback on
darwin).

The pre-spawned client is passed into `SSMSTUI` via a new
`process_worker_client` kwarg and stashed as
`self._process_worker_client`. `ProcessWorkerLifecycleMixin` already
returns the cached client first, so no other changes are needed.

Verified on macOS Tahoe 26.4.1 / Python 3.14 / PostgreSQL over TLS:
the popup no longer appears and queries run in the worker process.
Covers the load-bearing behaviors introduced by the prior commits:

- _maybe_fallback_start re-raises on darwin (commit 1)
- _maybe_fallback_start still re-raises non-fds_to_keep errors verbatim
- _open_worker_log honors $SQLIT_WORKER_LOG and survives an unwritable path
- _prewarm_process_worker respects runtime.process_worker, runtime.mock,
  and swallows spawn exceptions so startup never crashes (commit 3)
- SSMSTUI(process_worker_client=client) stashes the client on
  self._process_worker_client, which is what ProcessWorkerLifecycleMixin
  consults first
… multi-statement

Running tests/integration/test_transactions.py against a real PostgreSQL
container with the psycopg3 adapter surfaced two driver-behavior
differences that need handling in the shared CursorBasedAdapter:

1. autocommit-aware execute_non_query
   psycopg2 silently makes conn.commit() a no-op when conn.autocommit is
   True; psycopg v3 actively COMMITs any open transaction. The adapter's
   execute_non_query unconditionally called conn.commit(), which under
   psycopg3 meant `executor.execute("BEGIN")` immediately ended the
   transaction it just started — so the subsequent INSERT was committed
   and ROLLBACK had nothing to undo. Skipping the commit when
   conn.autocommit is True preserves the original intent on both
   drivers.

2. nextset() walk in execute_query
   psycopg v3 exposes each result set in a multi-statement string and
   leaves the cursor positioned at the first (e.g. on BEGIN, which has
   no columns). psycopg2 only ever surfaced the last set. Walking
   cursor.nextset() until it returns falsy lands on the last set; the
   loop is a no-op on drivers that don't implement nextset() (a number
   raise NotSupportedError, hence the try/except).

The integration tests in tests/integration/test_transactions.py needed
three fixes that are independent of the driver swap — verified to fail
or hang on upstream main with psycopg2 as well:

- test_multi_statement_as_single_query_works
  The "BEGIN; INSERT; SELECT" payload left an open transaction on the
  executor's persistent connection; the test's subsequent DROP TABLE
  (on a different connection) then blocked waiting for the lock. Added
  reset_tui_executor() before the cleanup so the persistent connection
  rolls back and releases the lock.

- test_atomic_execute_rolls_back_on_error
  The test expected atomic_execute to raise on a failing statement.
  Per its docstring and current implementation, atomic_execute reports
  the failure via a MultiStatementResult with completed=False and the
  failing index in error_index. Updated the assertions accordingly.

- test_atomic_execute_commits_on_success
  Multi-statement atomic_execute is documented to return a
  MultiStatementResult (one StatementResult per statement), not a
  single QueryResult. Asserting on result.results[-1].result picks up
  the trailing SELECT's QueryResult exactly as the test intended.

Result against the docker-compose postgres:16 fixture:
  - tests/integration/test_transactions.py: 13 passed
  - tests/integration/test_database_browsing_flow.py::TestPostgreSQLDatabaseBrowsing: 1 passed
  - tests/integration/test_stale_connection_reconnect.py -k postgres: 5 passed, 1 idle-variant skipped (pre-existing)
  - tests/unit: 842 passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Broken pipe/worker connection closed errors in TUI only, inconsistently

1 participant