feat(server): add native mixed-backend draft placement#246
Conversation
There was a problem hiding this comment.
1 issue found across 12 files
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
b3fc073 to
6d2b9de
Compare
howard0su
left a comment
There was a problem hiding this comment.
is this remote daemon work around the problem that CUDA and HIP cannot load into single process?
| } | ||
|
|
||
| bool arch_supports_remote_draft(const std::string & arch) { | ||
| return arch == "qwen35"; |
There was a problem hiding this comment.
why gemma4 cannot support this?
There was a problem hiding this comment.
I'll make it supported from another PR after gemma4 and other model backend feature fixed and merged into main.
| @@ -0,0 +1,45 @@ | |||
| // Standalone DFlash draft IPC daemon entry point. | |||
There was a problem hiding this comment.
this is not proper place for this binary main. put it to src/ipc folder.
There was a problem hiding this comment.
Agree, let me move it
| " --draft-gpu <N> Draft GPU device (default: 0)\n" | ||
| " --target-device <backend:gpu> Target device (default: auto:0)\n" | ||
| " --draft-device <backend:gpu> Draft device (default: auto:0)\n" | ||
| " --draft-ipc-bin <path> Remote draft IPC daemon for mixed backends\n" |
There was a problem hiding this comment.
if this is IPC on single machine, why we need a process to handle IPC? Why we need a daemon?
There was a problem hiding this comment.
For single backend, IPC won't be called
| sizeof(uint16_t) * (size_t)feat_hidden); | ||
| float * dst = slice.data() + (size_t)t * feat_hidden; | ||
| for (int h = 0; h < feat_hidden; ++h) { | ||
| dst[h] = bf16_bits_to_f32(bf16[h]); |
There was a problem hiding this comment.
why convert to f32? if we really want to, use ggml convert functions.
There was a problem hiding this comment.
It's on my PR plan for vram optimization afterwards, for now we keep a unified F32 to maximize the compatiblity for all the hardware setup (including legacy ones like Turin/Volta/GCN)
| } else if (std::strncmp(argv[i], "--stream-fd=", 12) == 0) { | ||
| stream_fd = std::atoi(argv[i] + 12); | ||
| } else if (std::strcmp(argv[i], "--stream-fd") == 0) { | ||
| if (i + 1 < argc) stream_fd = std::atoi(argv[++i]); |
There was a problem hiding this comment.
why no use mmap? stream fd seems slow to send big chunk of data.
There was a problem hiding this comment.
The stream fd is only used for small control/status messages, not for the large feature payload. The current first-pass transport writes feature/noise tensors to temporary files and sends paths over the control channel.
But yes I agree mmap/shared memory is the better next step for reducing host-copy and filesystem overhead, let me mark it to the follow-up vram optimization plan.
yes, they cannot be merged into one single process and we'll keep it as-is to avoid possible crash. An IPC daemon will only be introduced to communicate when mixed backend loaded (for single backend IPC won't be loaded) |
6d2b9de to
f077a63
Compare
|
@weicj can you rebase and solve conflicts please |
I've seen Gemma4 merged, will make a separate HIP runtime fix first, and update this one. |
f077a63 to
6271ade
Compare
|
@davide221 updated, make sure #258 is merged before this one, when review passed :) |
Summary
This PR adds the native C++ mixed backend foundation for draft/target placement.
It builds on the backend-device placement shape from #236 and is aligned with the
dflash::commonnamespace rename from #241. The native server can now describe:The lower-level rule is unchanged: one native server process still owns one compiled ggml backend. When target and draft use different backends, draft execution moves behind the existing remote draft IPC transport instead of loading both CUDA and HIP runtimes into the same process.
Problem
After #236, the native server had a clean
DevicePlacementsurface, but mixed target/draft placement was still rejected before backend creation.That blocked heterogeneous setups where the target model runs on one backend/runtime and the draft model runs on another, even though the DFlash draft side already had an IPC boundary that can carry the draft execution out of process.
Changes
Added shared remote draft configuration
RemoteDraftConfigunderplacement/.BackendArgswithremote_draft.Added native server mixed-backend CLI plumbing
--draft-ipc-bin.--draft-ipc-work-dir.--draft-ipc-ring-cap.--draft-device <backend:gpu>.Updated server validation
--draft <path>and--draft-ipc-bin.--draft-ipc-binis rejected for same-backend placement.Added backend capability surface
ModelBackend::supports_remote_draft().arch_supports_remote_draft()for early native server validation.falsecapability and fail explicitly instead of silently falling back.Connected the first adapter
RemoteDraftConfig.DFlashDraftIpcClientinstead of loading draft weights into the target process.Added config logging
Notes
placement/directory now owns both device placement and the remote draft transport config.