Summary
pytest-xdist 3.8.0 intermittently hangs after all tests have passed when using --dist loadfile with 10 workers on Python 3.14t (free-threaded). The main thread loops forever in dsession.loop_once() → queue.get(timeout=2.0) → Empty → retry, because _active_nodes is never emptied. Meanwhile, 10 execnet _thread_receiver threads are stuck in gateway_base.read() on dead worker pipes, so they never report worker death events to unregister nodes.
Environment
- Python: 3.14.2 free-threading build (
cpython-3.14.2+freethreaded-macos-aarch64-none)
- pytest: 8.4.2
- pytest-xdist: 3.8.0
- execnet: 2.1.2
- OS: macOS 26.3 (Darwin 25.3.0, Apple Silicon)
- GIL: Irrelevant — reproduces with both
PYTHON_GIL=0 (default) and PYTHON_GIL=1
Configuration
```toml
pyproject.toml
[tool.pytest.ini_options]
addopts = ["-n", "auto", "--dist", "loadfile"]
```
Reproduction
```bash
~50% reproduction rate on a 710-test suite
Individual test files never hang — only the full suite
uv run pytest tests/ -m "unit and not integration" --no-cov -q
Workaround: serial mode always passes
uv run pytest tests/ -m "unit and not integration" --no-cov -q -n 0
```
The hang occurs at ~96% completion (after ~680/710 tests have passed). Progress output stops and pytest never exits. CPU usage drops to 0%.
Thread dump (faulthandler)
Captured via `faulthandler.dump_traceback_later(120)`:
Main thread — caught during `queue.get(timeout=2.0)` wait inside the infinite retry loop. The loop never exits because `_active_nodes` is never emptied (workers aren't unregistered since their receiver threads are stuck):
```
Thread 0x0000000200b07080 (most recent call first):
File "threading.py", line 373 in wait
File "queue.py", line 210 in get
File "xdist/dsession.py", line 154 in loop_once
File "xdist/dsession.py", line 138 in pytest_runtestloop
File "pluggy/_callers.py", line 121 in _multicall
File "_pytest/main.py", line 343 in _main
```
10 receiver threads — all identical, stuck in blocking `read()` on dead worker pipes:
```
Thread 0x000000017654b000 (most recent call first):
File "execnet/gateway_base.py", line 534 in read
File "execnet/gateway_base.py", line 567 in from_io
File "execnet/gateway_base.py", line 1160 in _thread_receiver
File "execnet/gateway_base.py", line 341 in run
File "execnet/gateway_base.py", line 411 in _perform_spawn
```
(All 10 worker threads show the same trace — `_thread_receiver` → `from_io` → `read`)
macOS `sample` trace
Confirms the same via native profiling:
```
lock_PyThread_acquire_lock (in libpython3.14t.dylib) + 60
_PyMutex_LockTimed (in libpython3.14t.dylib) + 880
_pthread_cond_wait (in libsystem_pthread.dylib) + 1028
__psynch_cvwait (in libsystem_kernel.dylib) + 8
```
Analysis
The worker subprocesses have finished and exited, but execnet's `_thread_receiver` threads remain blocked on `gateway_base.py:534 read()` — a blocking read from the worker's pipe that never returns EOF. Since these threads never detect worker death, they never fire the shutdown event that would remove the node from `dsession._active_nodes`. The main thread's `loop_once()` keeps retrying `queue.get(timeout=2.0)` → `Empty` → checks `_active_nodes` (still populated) → loops forever.
This is a race condition in worker cleanup: if a worker subprocess exits and closes its pipe in a way that the OS doesn't deliver EOF to the parent process's `read()` call, the receiver thread blocks indefinitely.
The `dsession.loop_once()` while-loop at line 148 cannot break out because:
- `_active_nodes` is never emptied (workers not unregistered)
- `queue.get(timeout=2.0)` always raises `Empty` (no events arriving)
- The `continue` restarts the loop
Related issues
- execnet #43 — "Test process hanging forever" — same class of bug (`waitall()` without timeout on dead workers). Partially fixed with timeout support but `_thread_receiver`'s blocking `read()` was not addressed.
- pytest-xdist #884 — "worksteal + high core counts leads to hangs" — similar symptom (hang after tests complete) but different root cause (queue replacement race in worksteal scheduler). Fixed in 3.2.1.
- pytest-xdist #1071 — "concurrent remote_exec deadlock for main_thread_only execmodel" — different deadlock in execnet's execmodel, fixed in 3.6.1.
- scikit-learn #30007 — "Upgrade free-threading CI to run with pytest-freethreaded instead of pytest-xdist" — suggests pytest-xdist is not fully compatible with free-threaded Python.
Suggested fix
The root cause is in execnet (`gateway_base.py:534`). The `_thread_receiver` loop's `read()` call blocks indefinitely when a worker pipe doesn't deliver EOF on process exit. Options:
- execnet fix: Use non-blocking or timeout-based reads in `_thread_receiver`, or poll the worker process liveness.
- xdist mitigation: In `loop_once()`, add a worker liveness check after N consecutive `Empty` timeouts — if all workers' subprocesses have exited (via `os.waitpid` or similar), force-unregister them from `_active_nodes`.
Summary
pytest-xdist 3.8.0 intermittently hangs after all tests have passed when using
--dist loadfilewith 10 workers on Python 3.14t (free-threaded). The main thread loops forever indsession.loop_once()→queue.get(timeout=2.0)→Empty→ retry, because_active_nodesis never emptied. Meanwhile, 10 execnet_thread_receiverthreads are stuck ingateway_base.read()on dead worker pipes, so they never report worker death events to unregister nodes.Environment
cpython-3.14.2+freethreaded-macos-aarch64-none)PYTHON_GIL=0(default) andPYTHON_GIL=1Configuration
```toml
pyproject.toml
[tool.pytest.ini_options]
addopts = ["-n", "auto", "--dist", "loadfile"]
```
Reproduction
```bash
~50% reproduction rate on a 710-test suite
Individual test files never hang — only the full suite
uv run pytest tests/ -m "unit and not integration" --no-cov -q
Workaround: serial mode always passes
uv run pytest tests/ -m "unit and not integration" --no-cov -q -n 0
```
The hang occurs at ~96% completion (after ~680/710 tests have passed). Progress output stops and pytest never exits. CPU usage drops to 0%.
Thread dump (faulthandler)
Captured via `faulthandler.dump_traceback_later(120)`:
Main thread — caught during `queue.get(timeout=2.0)` wait inside the infinite retry loop. The loop never exits because `_active_nodes` is never emptied (workers aren't unregistered since their receiver threads are stuck):
```
Thread 0x0000000200b07080 (most recent call first):
File "threading.py", line 373 in wait
File "queue.py", line 210 in get
File "xdist/dsession.py", line 154 in loop_once
File "xdist/dsession.py", line 138 in pytest_runtestloop
File "pluggy/_callers.py", line 121 in _multicall
File "_pytest/main.py", line 343 in _main
```
10 receiver threads — all identical, stuck in blocking `read()` on dead worker pipes:
```
Thread 0x000000017654b000 (most recent call first):
File "execnet/gateway_base.py", line 534 in read
File "execnet/gateway_base.py", line 567 in from_io
File "execnet/gateway_base.py", line 1160 in _thread_receiver
File "execnet/gateway_base.py", line 341 in run
File "execnet/gateway_base.py", line 411 in _perform_spawn
```
(All 10 worker threads show the same trace — `_thread_receiver` → `from_io` → `read`)
macOS `sample` trace
Confirms the same via native profiling:
```
lock_PyThread_acquire_lock (in libpython3.14t.dylib) + 60
_PyMutex_LockTimed (in libpython3.14t.dylib) + 880
_pthread_cond_wait (in libsystem_pthread.dylib) + 1028
__psynch_cvwait (in libsystem_kernel.dylib) + 8
```
Analysis
The worker subprocesses have finished and exited, but execnet's `_thread_receiver` threads remain blocked on `gateway_base.py:534 read()` — a blocking read from the worker's pipe that never returns EOF. Since these threads never detect worker death, they never fire the shutdown event that would remove the node from `dsession._active_nodes`. The main thread's `loop_once()` keeps retrying `queue.get(timeout=2.0)` → `Empty` → checks `_active_nodes` (still populated) → loops forever.
This is a race condition in worker cleanup: if a worker subprocess exits and closes its pipe in a way that the OS doesn't deliver EOF to the parent process's `read()` call, the receiver thread blocks indefinitely.
The `dsession.loop_once()` while-loop at line 148 cannot break out because:
Related issues
Suggested fix
The root cause is in execnet (`gateway_base.py:534`). The `_thread_receiver` loop's `read()` call blocks indefinitely when a worker pipe doesn't deliver EOF on process exit. Options: