Commit 9571043
authored
Fix parallel-coordinator hang when summary pipe saturates (#248)
* Fix parallel-coordinator hang when summary pipe saturates
The coordinator drained the 'summary' queue only after joining all worker
processes. With enough queued data (or a single large testsFailed dict),
the summary-pipe buffer (~64 KiB on Linux) saturates and worker feeder
threads block in pipe_write, both inside on_timeout's join_thread() and
during Python's end-of-process queue finalization. This in turn hangs the
coordinator's p.join() indefinitely.
Introduce a module-level helper _join_workers_with_summary_drain that
joins workers while continuously draining 'summary' from a background
thread, and use it in execute(). Also correct the stale comment in the
on_timeout closure to describe the actual watcher-thread os._exit(1)
flow.
* Use SimpleQueue for summary in parallel coordinator
SimpleQueue.put is synchronous (no feeder thread, no internal buffer),
so a successful put() implies the bytes are already in the kernel pipe.
That removes the need for the summary.close() + summary.join_thread()
dance in on_timeout before the watcher's os._exit(1), and the comment
that explained it.
The coordinator-side drain thread is updated to a blocking get() driven
by a sentinel on shutdown, eliminating its busy-loop timeout too.
results stays a Queue because the progressbar liveness loop relies on
get(timeout=...), which SimpleQueue does not expose publicly.
* Address review: reorder puts in on_timeout, fix stale comment
In the on_timeout closure, put 'results' before 'summary'. The summary
queue is a SimpleQueue with a synchronous put(), and the coordinator
only starts draining it after every result is in. Putting summary first
risked blocking on a full summary pipe while the coordinator was still
waiting on this worker's result, which would have stalled the whole
results-collection loop. Putting results first guarantees the worker's
output reaches the coordinator unconditionally; the subsequent summary
put may briefly block but always unblocks once the coordinator moves to
the drain phase.
Also drop the stale 'feeder threads' wording near the call site: the
summary queue no longer has a feeder thread.
* Collapse parallel results+summary into a single queue
The parallel coordinator used two queues: 'results' for per-test output
(read by the progressbar loop) and 'summary' for per-worker aggregates
('done' count and the worker's full testsFailed dict, read after the
loop). Workers' summary.put could block on a full pipe because the
coordinator only drained summary in a second phase, after every results
message had been received.
Collapse to a single 'results' queue carrying one self-contained message
per test: { test_name, output, done, failures }. The worker resets
self.testsFailed = {} before each test so addFailure() writes into a
fresh dict that ships verbatim; the worker keeps no cumulative state.
The coordinator owns the canonical testsFailed via update() per message.
This eliminates the deadlock by construction: the only queue is drained
continuously by the coordinator's progressbar loop for the entire
lifetime of the workers, so worker put()s can never block on a full
pipe. Removes the SimpleQueue import, the _SUMMARY_DRAIN_STOP sentinel,
and the _join_workers_with_summary_drain helper. on_timeout shrinks to
a single put + close + join_thread.
The unit test is updated to exercise the new pattern: workers push many
large per-test messages while the main thread drains them live.
* Enhance parallel test execution: add shutdown messages and improve result handling
* Address review: ship shutdown sentinel from on_timeout; misc fixups
- on_timeout now ships both the per-test result (with shutdown=False)
and the shutdown sentinel before the watcher thread calls os._exit(1),
keeping the coordinator's bounded count of n_jobs + parallelism
accurate when a worker dies on timeout. Also fixes a KeyError on the
missing 'shutdown' key in the timeout payload.
- Grammar: 'no more processors is alive' -> 'are alive'.
- test_parallel_drain: use time.monotonic() for elapsed measurement.1 parent b2c5fd3 commit 9571043
2 files changed
Lines changed: 142 additions & 39 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
941 | 941 | | |
942 | 942 | | |
943 | 943 | | |
944 | | - | |
| 944 | + | |
945 | 945 | | |
946 | | - | |
947 | 946 | | |
948 | 947 | | |
949 | 948 | | |
950 | 949 | | |
951 | 950 | | |
952 | 951 | | |
| 952 | + | |
| 953 | + | |
| 954 | + | |
| 955 | + | |
953 | 956 | | |
954 | 957 | | |
955 | 958 | | |
956 | | - | |
957 | 959 | | |
958 | | - | |
959 | 960 | | |
960 | 961 | | |
961 | 962 | | |
962 | 963 | | |
963 | 964 | | |
964 | | - | |
965 | | - | |
966 | | - | |
| 965 | + | |
| 966 | + | |
| 967 | + | |
| 968 | + | |
| 969 | + | |
| 970 | + | |
| 971 | + | |
| 972 | + | |
| 973 | + | |
| 974 | + | |
| 975 | + | |
| 976 | + | |
| 977 | + | |
967 | 978 | | |
968 | | - | |
969 | | - | |
970 | 979 | | |
971 | | - | |
972 | 980 | | |
973 | | - | |
| 981 | + | |
974 | 982 | | |
975 | | - | |
| 983 | + | |
| 984 | + | |
| 985 | + | |
976 | 986 | | |
977 | | - | |
978 | | - | |
| 987 | + | |
| 988 | + | |
| 989 | + | |
| 990 | + | |
| 991 | + | |
| 992 | + | |
| 993 | + | |
| 994 | + | |
| 995 | + | |
979 | 996 | | |
980 | 997 | | |
981 | | - | |
982 | 998 | | |
983 | 999 | | |
984 | 1000 | | |
| |||
987 | 1003 | | |
988 | 1004 | | |
989 | 1005 | | |
990 | | - | |
| 1006 | + | |
991 | 1007 | | |
992 | 1008 | | |
993 | 1009 | | |
994 | | - | |
| 1010 | + | |
| 1011 | + | |
| 1012 | + | |
| 1013 | + | |
| 1014 | + | |
| 1015 | + | |
| 1016 | + | |
| 1017 | + | |
| 1018 | + | |
995 | 1019 | | |
996 | | - | |
997 | | - | |
998 | | - | |
999 | | - | |
1000 | | - | |
1001 | | - | |
1002 | 1020 | | |
1003 | | - | |
1004 | | - | |
1005 | | - | |
1006 | | - | |
1007 | | - | |
1008 | | - | |
1009 | | - | |
| 1021 | + | |
| 1022 | + | |
| 1023 | + | |
| 1024 | + | |
| 1025 | + | |
| 1026 | + | |
| 1027 | + | |
| 1028 | + | |
| 1029 | + | |
| 1030 | + | |
| 1031 | + | |
| 1032 | + | |
| 1033 | + | |
| 1034 | + | |
| 1035 | + | |
1010 | 1036 | | |
1011 | 1037 | | |
1012 | 1038 | | |
1013 | 1039 | | |
1014 | | - | |
1015 | | - | |
1016 | | - | |
1017 | | - | |
1018 | | - | |
1019 | | - | |
1020 | | - | |
1021 | | - | |
1022 | | - | |
1023 | 1040 | | |
1024 | 1041 | | |
1025 | 1042 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
0 commit comments