Commit ad77074
fix env server deadlock (#921)
asyncio.wait_for wraps recv_multipart in a Task and cancels it on
timeout. There is a race in CPython's Task.__step: when the recv
completes (consuming data from the ZMQ buffer) but _must_cancel is
already set by the timeout, the result is silently discarded via
super().cancel() instead of super().set_result(). The message is
consumed from the socket but never processed — gone forever.
This caused a deadlock when training with rescheduling + validation:
the client hangs waiting for responses to requests the server silently
dropped. Observed as 2/450 messages lost in production.
Replace with zmq.asyncio.Poller which is non-destructive: poll only
checks socket readability without consuming any data. recv_multipart
is only called when data is guaranteed available, so it completes
immediately with no cancellation window.
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>1 parent 92f0ae6 commit ad77074
1 file changed
Lines changed: 15 additions & 12 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
32 | 32 | | |
33 | 33 | | |
34 | 34 | | |
35 | | - | |
36 | | - | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
37 | 44 | | |
38 | 45 | | |
39 | 46 | | |
40 | | - | |
41 | 47 | | |
42 | 48 | | |
43 | 49 | | |
44 | 50 | | |
45 | 51 | | |
46 | | - | |
47 | | - | |
48 | | - | |
49 | | - | |
50 | | - | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
51 | 57 | | |
52 | 58 | | |
53 | 59 | | |
| |||
64 | 70 | | |
65 | 71 | | |
66 | 72 | | |
67 | | - | |
68 | | - | |
69 | 73 | | |
70 | 74 | | |
71 | 75 | | |
72 | 76 | | |
73 | 77 | | |
74 | | - | |
75 | | - | |
| 78 | + | |
76 | 79 | | |
77 | 80 | | |
78 | 81 | | |
| |||
0 commit comments