Skip to content

Intermittent hang after all tests complete: execnet _thread_receiver blocked on dead worker pipes (Python 3.14t, --dist loadfile) #1313

@clemlesne

Description

@clemlesne

Summary

pytest-xdist 3.8.0 intermittently hangs after all tests have passed when using --dist loadfile with 10 workers on Python 3.14t (free-threaded). The main thread loops forever in dsession.loop_once()queue.get(timeout=2.0)Empty → retry, because _active_nodes is never emptied. Meanwhile, 10 execnet _thread_receiver threads are stuck in gateway_base.read() on dead worker pipes, so they never report worker death events to unregister nodes.

Environment

  • Python: 3.14.2 free-threading build (cpython-3.14.2+freethreaded-macos-aarch64-none)
  • pytest: 8.4.2
  • pytest-xdist: 3.8.0
  • execnet: 2.1.2
  • OS: macOS 26.3 (Darwin 25.3.0, Apple Silicon)
  • GIL: Irrelevant — reproduces with both PYTHON_GIL=0 (default) and PYTHON_GIL=1

Configuration

```toml

pyproject.toml

[tool.pytest.ini_options]
addopts = ["-n", "auto", "--dist", "loadfile"]
```

Reproduction

```bash

~50% reproduction rate on a 710-test suite

Individual test files never hang — only the full suite

uv run pytest tests/ -m "unit and not integration" --no-cov -q

Workaround: serial mode always passes

uv run pytest tests/ -m "unit and not integration" --no-cov -q -n 0
```

The hang occurs at ~96% completion (after ~680/710 tests have passed). Progress output stops and pytest never exits. CPU usage drops to 0%.

Thread dump (faulthandler)

Captured via `faulthandler.dump_traceback_later(120)`:

Main thread — caught during `queue.get(timeout=2.0)` wait inside the infinite retry loop. The loop never exits because `_active_nodes` is never emptied (workers aren't unregistered since their receiver threads are stuck):

```
Thread 0x0000000200b07080 (most recent call first):
File "threading.py", line 373 in wait
File "queue.py", line 210 in get
File "xdist/dsession.py", line 154 in loop_once
File "xdist/dsession.py", line 138 in pytest_runtestloop
File "pluggy/_callers.py", line 121 in _multicall
File "_pytest/main.py", line 343 in _main
```

10 receiver threads — all identical, stuck in blocking `read()` on dead worker pipes:

```
Thread 0x000000017654b000 (most recent call first):
File "execnet/gateway_base.py", line 534 in read
File "execnet/gateway_base.py", line 567 in from_io
File "execnet/gateway_base.py", line 1160 in _thread_receiver
File "execnet/gateway_base.py", line 341 in run
File "execnet/gateway_base.py", line 411 in _perform_spawn
```

(All 10 worker threads show the same trace — `_thread_receiver` → `from_io` → `read`)

macOS `sample` trace

Confirms the same via native profiling:

```
lock_PyThread_acquire_lock (in libpython3.14t.dylib) + 60
_PyMutex_LockTimed (in libpython3.14t.dylib) + 880
_pthread_cond_wait (in libsystem_pthread.dylib) + 1028
__psynch_cvwait (in libsystem_kernel.dylib) + 8
```

Analysis

The worker subprocesses have finished and exited, but execnet's `_thread_receiver` threads remain blocked on `gateway_base.py:534 read()` — a blocking read from the worker's pipe that never returns EOF. Since these threads never detect worker death, they never fire the shutdown event that would remove the node from `dsession._active_nodes`. The main thread's `loop_once()` keeps retrying `queue.get(timeout=2.0)` → `Empty` → checks `_active_nodes` (still populated) → loops forever.

This is a race condition in worker cleanup: if a worker subprocess exits and closes its pipe in a way that the OS doesn't deliver EOF to the parent process's `read()` call, the receiver thread blocks indefinitely.

The `dsession.loop_once()` while-loop at line 148 cannot break out because:

  1. `_active_nodes` is never emptied (workers not unregistered)
  2. `queue.get(timeout=2.0)` always raises `Empty` (no events arriving)
  3. The `continue` restarts the loop

Related issues

  • execnet #43 — "Test process hanging forever" — same class of bug (`waitall()` without timeout on dead workers). Partially fixed with timeout support but `_thread_receiver`'s blocking `read()` was not addressed.
  • pytest-xdist #884 — "worksteal + high core counts leads to hangs" — similar symptom (hang after tests complete) but different root cause (queue replacement race in worksteal scheduler). Fixed in 3.2.1.
  • pytest-xdist #1071 — "concurrent remote_exec deadlock for main_thread_only execmodel" — different deadlock in execnet's execmodel, fixed in 3.6.1.
  • scikit-learn #30007 — "Upgrade free-threading CI to run with pytest-freethreaded instead of pytest-xdist" — suggests pytest-xdist is not fully compatible with free-threaded Python.

Suggested fix

The root cause is in execnet (`gateway_base.py:534`). The `_thread_receiver` loop's `read()` call blocks indefinitely when a worker pipe doesn't deliver EOF on process exit. Options:

  1. execnet fix: Use non-blocking or timeout-based reads in `_thread_receiver`, or poll the worker process liveness.
  2. xdist mitigation: In `loop_once()`, add a worker liveness check after N consecutive `Empty` timeouts — if all workers' subprocesses have exited (via `os.waitpid` or similar), force-unregister them from `_active_nodes`.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions