ggml-rpc: serialize send_rpc_cmd per socket to fix concurrent-send race#22530
Open
dingleberry61 wants to merge 1 commit intoggml-org:masterfrom
Open
ggml-rpc: serialize send_rpc_cmd per socket to fix concurrent-send race#22530dingleberry61 wants to merge 1 commit intoggml-org:masterfrom
dingleberry61 wants to merge 1 commit intoggml-org:masterfrom
Conversation
When multiple threads invoke send_rpc_cmd on the same cached socket (as llama-server's tensor loader does during model load), bytes from their respective messages can interleave mid-frame and corrupt the wire protocol. The race window is too narrow to fire reliably on localhost loopback but opens predictably under WAN latency. Add a per-socket recursive_mutex (recursive because the response variant of send_rpc_cmd calls the no-response variant from inside its own already-locked section) and lock it at the top of each send_rpc_cmd body. Tested over a Cloudflare WS relay between hosts on different networks: 1.2 GB Phi-3-mini weights now load cleanly in 122s vs hanging or producing 'Remote RPC server crashed or returned malformed response' errors before the patch.
|
Hi @dingleberry61, thanks for your contribution! Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:
Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below. |
Author
|
Disclosure: I used Claude (Anthropic's coding assistant) to help me Rewriting the description below in my own voice to be clear about |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this fixes
I was building a thing that uses llama.cpp's RPC backend across
networks (tunneling rpc-server through a WebSocket relay so two
machines on different ISPs can split a model's layers). Hit a bug
where weight loading over WAN would either hang or trip "Remote RPC
server crashed or returned malformed response." Took a few days to
find.
The race only fires under WAN latency. On localhost, the OS write
buffer drains so fast that one thread is done before the next starts.
Over WAN, the buffer takes longer to drain — the window opens, and
bytes from concurrent SET_TENSOR writes interleave mid-frame. The
receiver reads what it thinks is a length field but is actually the
middle of another command's payload. Protocol broken from there on.
llama-server's tensor loader fires those concurrent SET_TENSOR calls
from multiple threads on the same cached socket. There's no
synchronization on the write side.
The fix
Per-socket
std::recursive_mutexkeyed onsocket_t *, locked atthe top of each
send_rpc_cmdbody. Recursive because the responsevariant calls the no-response variant from inside its own locked
section — non-recursive deadlocks there.
Testing
Tunneled rpc-server traffic over a WebSocket relay between two Macs
on different ISPs (RTT ~30-80ms). Before: load fails within the first
few hundred MB. After: 1.2 GB Phi-3-mini weights load in 122s and
chat completes correctly.
LAN behavior unchanged.
Disclosure
I used Claude to help draft this description (which is what triggered
the AI-content flag — fair). The patch itself, the bug investigation,
and the WAN testing are mine.
Open to
If you want different scoping (cleanup of the mutex map on socket
destruction, or refactoring to avoid recursive_mutex), happy to
adjust.