feat(server): add native mixed-backend draft placement by weicj · Pull Request #246 · Luce-Org/lucebox-hub

weicj · 2026-05-21T18:20:12Z

Summary

This PR adds the native C++ mixed backend foundation for draft/target placement.

It builds on the backend-device placement shape from #236 and is aligned with the dflash::common namespace rename from #241. The native server can now describe:

dflash_server target.gguf \
  --draft draft.gguf \
  --target-device cuda:0 \
  --draft-device hip:0 \
  --draft-ipc-bin /path/to/hip/dflash_draft_ipc_daemon

The lower-level rule is unchanged: one native server process still owns one compiled ggml backend. When target and draft use different backends, draft execution moves behind the existing remote draft IPC transport instead of loading both CUDA and HIP runtimes into the same process.

Problem

After #236, the native server had a clean DevicePlacement surface, but mixed target/draft placement was still rejected before backend creation.

That blocked heterogeneous setups where the target model runs on one backend/runtime and the draft model runs on another, even though the DFlash draft side already had an IPC boundary that can carry the draft execution out of process.

Changes

Added shared remote draft configuration
- Added RemoteDraftConfig under placement/.
- Extended BackendArgs with remote_draft.
- Kept this model-family neutral at the placement/config layer so server/factory code does not need Qwen-specific placement logic.
Added native server mixed-backend CLI plumbing
- Added --draft-ipc-bin.
- Added --draft-ipc-work-dir.
- Added --draft-ipc-ring-cap.
- Kept draft GPU selection sourced from --draft-device <backend:gpu>.
Updated server validation
- Same-backend draft/target keeps the local in-process path.
- Mixed target/draft backend now requires both --draft <path> and --draft-ipc-bin.
- --draft-ipc-bin is rejected for same-backend placement.
- Target layer split across backends remains rejected in this PR.
- Unsupported model architectures return a clear remote-draft capability error before backend startup.
Added backend capability surface
- Added ModelBackend::supports_remote_draft().
- Added arch_supports_remote_draft() for early native server validation.
- Unsupported model families keep the default false capability and fail explicitly instead of silently falling back.
Connected the first adapter
- Wired Qwen35/Qwen36 backend config to the shared RemoteDraftConfig.
- When remote draft is enabled, Qwen35 starts DFlashDraftIpcClient instead of loading draft weights into the target process.
- Prefill and replay feature captures are synchronized to the remote draft process.
- Draft park/unpark/shutdown now treat the remote draft daemon as its own lifecycle and do not free local draft weights that were never loaded.
- Local same-backend draft execution keeps the existing feature mirror path.
Added config logging
- Server startup now prints target device, draft device, local vs remote draft execution, and remote draft IPC settings when used.

Notes

The placement/ directory now owns both device placement and the remote draft transport config.
The implementation keeps backend/runtime placement generic; Qwen35/Qwen36 is only the first adapter because it already has the DFlash feature boundary.
Other model families, including Gemma, can opt in through the same backend capability once their adapter exposes a compatible draft/target feature boundary.
Local hardware validation completed end-to-end generation in both mixed directions on a Tesla P4 CUDA card and an AMD Pro VII HIP card. The target used for this smoke validation was a Q1 21B target pruned from the 27B family, with a little over 5GB of target weights; it is only used to validate the mixed-backend execution path, not as a quality or performance reference.

cubic-dev-ai

1 issue found across 12 files

_{Reply with feedback, questions, or to request a fix.

Re-trigger cubic}

howard0su

is this remote daemon work around the problem that CUDA and HIP cannot load into single process?

howard0su · 2026-05-21T23:47:59Z

 }

+bool arch_supports_remote_draft(const std::string & arch) {
+    return arch == "qwen35";


why gemma4 cannot support this?

I'll make it supported from another PR after gemma4 and other model backend feature fixed and merged into main.

howard0su · 2026-05-21T23:50:03Z

@@ -0,0 +1,45 @@
+// Standalone DFlash draft IPC daemon entry point.


this is not proper place for this binary main. put it to src/ipc folder.

Agree, let me move it

howard0su · 2026-05-21T23:53:44Z

-        "  --draft-gpu <N>      Draft GPU device (default: 0)\n"
+        "  --target-device <backend:gpu>  Target device (default: auto:0)\n"
+        "  --draft-device <backend:gpu>   Draft device (default: auto:0)\n"
+        "  --draft-ipc-bin <path>         Remote draft IPC daemon for mixed backends\n"


if this is IPC on single machine, why we need a process to handle IPC? Why we need a daemon?

For single backend, IPC won't be called

howard0su · 2026-05-21T23:55:16Z

+                                    sizeof(uint16_t) * (size_t)feat_hidden);
+            float * dst = slice.data() + (size_t)t * feat_hidden;
+            for (int h = 0; h < feat_hidden; ++h) {
+                dst[h] = bf16_bits_to_f32(bf16[h]);


why convert to f32? if we really want to, use ggml convert functions.

It's on my PR plan for vram optimization afterwards, for now we keep a unified F32 to maximize the compatiblity for all the hardware setup (including legacy ones like Turin/Volta/GCN)

howard0su · 2026-05-21T23:57:34Z

+        } else if (std::strncmp(argv[i], "--stream-fd=", 12) == 0) {
+            stream_fd = std::atoi(argv[i] + 12);
+        } else if (std::strcmp(argv[i], "--stream-fd") == 0) {
+            if (i + 1 < argc) stream_fd = std::atoi(argv[++i]);


why no use mmap? stream fd seems slow to send big chunk of data.

The stream fd is only used for small control/status messages, not for the large feature payload. The current first-pass transport writes feature/noise tensors to temporary files and sends paths over the control channel.

But yes I agree mmap/shared memory is the better next step for reducing host-copy and filesystem overhead, let me mark it to the follow-up vram optimization plan.

weicj · 2026-05-22T09:07:36Z

is this remote daemon work around the problem that CUDA and HIP cannot load into single process?

yes, they cannot be merged into one single process and we'll keep it as-is to avoid possible crash. An IPC daemon will only be introduced to communicate when mixed backend loaded (for single backend IPC won't be loaded)

davide221 · 2026-05-22T12:19:43Z

@weicj can you rebase and solve conflicts please

weicj · 2026-05-22T12:36:58Z

@weicj can you rebase and solve conflicts please

I've seen Gemma4 merged, will make a separate HIP runtime fix first, and update this one.

weicj · 2026-05-22T14:41:12Z

@davide221 updated, make sure #258 is merged before this one, when review passed :)

cubic-dev-ai Bot reviewed May 21, 2026

View reviewed changes

Comment thread dflash/src/server/server_main.cpp

weicj force-pushed the feat-cpp-server-mixed-backend-support-clean branch from b3fc073 to 6d2b9de Compare May 21, 2026 18:31

howard0su suggested changes May 21, 2026

View reviewed changes

weicj force-pushed the feat-cpp-server-mixed-backend-support-clean branch from 6d2b9de to f077a63 Compare May 22, 2026 10:03

feat(server): add native mixed-backend draft placement

6271ade

weicj force-pushed the feat-cpp-server-mixed-backend-support-clean branch from f077a63 to 6271ade Compare May 22, 2026 14:03

weicj requested a review from howard0su May 22, 2026 14:37

		@@ -0,0 +1,45 @@
		// Standalone DFlash draft IPC daemon entry point.

Conversation

weicj commented May 21, 2026

Summary

Problem

Changes

Notes

Uh oh!

cubic-dev-ai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

howard0su left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

weicj commented May 22, 2026

Uh oh!

davide221 commented May 22, 2026

Uh oh!

weicj commented May 22, 2026

Uh oh!

weicj commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cubic-dev-ai Bot left a comment •

edited

Loading