Fix silent drop of return payloads under master backpressure#69197
Open
dwoz wants to merge 1 commit into
Open
Conversation
When RequestClient._send_recv pulled a (future, message) pair off the queue whose future had already been marked done — i.e. the per-message timeout fired while the entry waited behind a slow master — the pre-send `if future.done(): continue` short-circuit discarded the message without ever sending it. The minion's caller saw the same SaltReqTimeoutError it sees in 3007.x, but the master never received the frame, so _return / event_return / external job cache side effects never fired and result stores were left silently incomplete. The reproducer in TRANSPORT_BUG.md shows ~4x delivery loss in 3008.0rc3 vs 3007.14 under the same workload. The short-circuit was added during the worker-pool routing rewrite (#68532) as a load-shedding optimization. That intent is sound for genuine request/response traffic, but wrong for fire-and-await-ACK payloads like _return: the master persists the side effects from the message itself, regardless of whether the originating future is still alive. Remove the pre-send check from both _send_recv implementations. The remaining post-send `future.done()` handling and the recv-loop's `elif future.done(): break` preserve the back-pressure guarantees (POLLOUT/POLLIN poll budgets, reconnect on stale futures) — same control flow as 3007.x. Add a regression test that pre-stages 20 stale-future entries on the queue and asserts every one reaches the mocked wire. Fails on the unpatched source with 0/20 sends recorded. Also fix test_client_timeout_msg to use a guaranteed-unused local port instead of the default 4506 — the test assumes nothing is listening there, which collides with any host running a real salt-master.
twangboy
approved these changes
May 21, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
When RequestClient._send_recv pulled a (future, message) pair off the queue whose future had already been marked done — i.e. the per-message timeout fired while the entry waited behind a slow master — the pre-send
if future.done(): continueshort-circuit discarded the message without ever sending it. The minion's caller saw the same SaltReqTimeoutError it sees in 3007.x, but the master never received the frame, so _return / event_return / external job cache side effects never fired and result stores were left silently incomplete. The reproducer in TRANSPORT_BUG.md shows ~4x delivery loss in 3008.0rc3 vs 3007.14 under the same workload.The short-circuit was added during the worker-pool routing rewrite (#68532) as a load-shedding optimization. That intent is sound for genuine request/response traffic, but wrong for fire-and-await-ACK payloads like _return: the master persists the side effects from the message itself, regardless of whether the originating future is still alive. Remove the pre-send check from both _send_recv implementations. The remaining post-send
future.done()handling and the recv-loop'selif future.done(): breakpreserve the back-pressure guarantees (POLLOUT/POLLIN poll budgets, reconnect on stale futures) — same control flow as 3007.x.Add a regression test that pre-stages 20 stale-future entries on the queue and asserts every one reaches the mocked wire. Fails on the unpatched source with 0/20 sends recorded.
Also fix test_client_timeout_msg to use a guaranteed-unused local port instead of the default 4506 — the test assumes nothing is listening there, which collides with any host running a real salt-master.