Skip to content

ggml-rpc: serialize send_rpc_cmd per socket to fix concurrent-send race#22530

Open
dingleberry61 wants to merge 1 commit intoggml-org:masterfrom
dingleberry61:ggml-rpc/per-socket-mutex
Open

ggml-rpc: serialize send_rpc_cmd per socket to fix concurrent-send race#22530
dingleberry61 wants to merge 1 commit intoggml-org:masterfrom
dingleberry61:ggml-rpc/per-socket-mutex

Conversation

@dingleberry61
Copy link
Copy Markdown

@dingleberry61 dingleberry61 commented Apr 29, 2026

What this fixes

I was building a thing that uses llama.cpp's RPC backend across
networks (tunneling rpc-server through a WebSocket relay so two
machines on different ISPs can split a model's layers). Hit a bug
where weight loading over WAN would either hang or trip "Remote RPC
server crashed or returned malformed response." Took a few days to
find.

The race only fires under WAN latency. On localhost, the OS write
buffer drains so fast that one thread is done before the next starts.
Over WAN, the buffer takes longer to drain — the window opens, and
bytes from concurrent SET_TENSOR writes interleave mid-frame. The
receiver reads what it thinks is a length field but is actually the
middle of another command's payload. Protocol broken from there on.

llama-server's tensor loader fires those concurrent SET_TENSOR calls
from multiple threads on the same cached socket. There's no
synchronization on the write side.

The fix

Per-socket std::recursive_mutex keyed on socket_t *, locked at
the top of each send_rpc_cmd body. Recursive because the response
variant calls the no-response variant from inside its own locked
section — non-recursive deadlocks there.

Testing

Tunneled rpc-server traffic over a WebSocket relay between two Macs
on different ISPs (RTT ~30-80ms). Before: load fails within the first
few hundred MB. After: 1.2 GB Phi-3-mini weights load in 122s and
chat completes correctly.

LAN behavior unchanged.

Disclosure

I used Claude to help draft this description (which is what triggered
the AI-content flag — fair). The patch itself, the bug investigation,
and the WAN testing are mine.

Open to

If you want different scoping (cleanup of the mutex map on socket
destruction, or refactoring to avoid recursive_mutex), happy to
adjust.

When multiple threads invoke send_rpc_cmd on the same cached socket
(as llama-server's tensor loader does during model load), bytes from
their respective messages can interleave mid-frame and corrupt the
wire protocol. The race window is too narrow to fire reliably on
localhost loopback but opens predictably under WAN latency.

Add a per-socket recursive_mutex (recursive because the response
variant of send_rpc_cmd calls the no-response variant from inside
its own already-locked section) and lock it at the top of each
send_rpc_cmd body. Tested over a Cloudflare WS relay between hosts
on different networks: 1.2 GB Phi-3-mini weights now load cleanly
in 122s vs hanging or producing 'Remote RPC server crashed or
returned malformed response' errors before the patch.
@dingleberry61 dingleberry61 requested a review from a team as a code owner April 29, 2026 18:56
@ggml-gh-bot
Copy link
Copy Markdown

ggml-gh-bot Bot commented Apr 29, 2026

Hi @dingleberry61, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

  • AI-generated content: This project does not accept PRs, descriptions or commit messages that are fully or predominantly AI-generated. If you have used AI to assist you in writing code, please make sure to disclose that explicitly.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

@dingleberry61
Copy link
Copy Markdown
Author

Disclosure: I used Claude (Anthropic's coding assistant) to help me
draft the commit message and PR description. The bug investigation,
the fix, and the WAN testing are mine — but the prose is AI-polished
and that's a fair flag.

Rewriting the description below in my own voice to be clear about
what I actually did vs what was AI-assisted writing.

@github-actions github-actions Bot added the ggml changes relating to the ggml tensor library for machine learning label Apr 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant