Skip to content

feat(server): add native mixed-backend draft placement#246

Open
weicj wants to merge 1 commit into
Luce-Org:mainfrom
weicj:feat-cpp-server-mixed-backend-support-clean
Open

feat(server): add native mixed-backend draft placement#246
weicj wants to merge 1 commit into
Luce-Org:mainfrom
weicj:feat-cpp-server-mixed-backend-support-clean

Conversation

@weicj
Copy link
Copy Markdown
Contributor

@weicj weicj commented May 21, 2026

Summary

This PR adds the native C++ mixed backend foundation for draft/target placement.

It builds on the backend-device placement shape from #236 and is aligned with the dflash::common namespace rename from #241. The native server can now describe:

dflash_server target.gguf \
  --draft draft.gguf \
  --target-device cuda:0 \
  --draft-device hip:0 \
  --draft-ipc-bin /path/to/hip/dflash_draft_ipc_daemon

The lower-level rule is unchanged: one native server process still owns one compiled ggml backend. When target and draft use different backends, draft execution moves behind the existing remote draft IPC transport instead of loading both CUDA and HIP runtimes into the same process.

Problem

After #236, the native server had a clean DevicePlacement surface, but mixed target/draft placement was still rejected before backend creation.

That blocked heterogeneous setups where the target model runs on one backend/runtime and the draft model runs on another, even though the DFlash draft side already had an IPC boundary that can carry the draft execution out of process.

Changes

  1. Added shared remote draft configuration

    • Added RemoteDraftConfig under placement/.
    • Extended BackendArgs with remote_draft.
    • Kept this model-family neutral at the placement/config layer so server/factory code does not need Qwen-specific placement logic.
  2. Added native server mixed-backend CLI plumbing

    • Added --draft-ipc-bin.
    • Added --draft-ipc-work-dir.
    • Added --draft-ipc-ring-cap.
    • Kept draft GPU selection sourced from --draft-device <backend:gpu>.
  3. Updated server validation

    • Same-backend draft/target keeps the local in-process path.
    • Mixed target/draft backend now requires both --draft <path> and --draft-ipc-bin.
    • --draft-ipc-bin is rejected for same-backend placement.
    • Target layer split across backends remains rejected in this PR.
    • Unsupported model architectures return a clear remote-draft capability error before backend startup.
  4. Added backend capability surface

    • Added ModelBackend::supports_remote_draft().
    • Added arch_supports_remote_draft() for early native server validation.
    • Unsupported model families keep the default false capability and fail explicitly instead of silently falling back.
  5. Connected the first adapter

    • Wired Qwen35/Qwen36 backend config to the shared RemoteDraftConfig.
    • When remote draft is enabled, Qwen35 starts DFlashDraftIpcClient instead of loading draft weights into the target process.
    • Prefill and replay feature captures are synchronized to the remote draft process.
    • Draft park/unpark/shutdown now treat the remote draft daemon as its own lifecycle and do not free local draft weights that were never loaded.
    • Local same-backend draft execution keeps the existing feature mirror path.
  6. Added config logging

    • Server startup now prints target device, draft device, local vs remote draft execution, and remote draft IPC settings when used.

Notes

  • The placement/ directory now owns both device placement and the remote draft transport config.
  • The implementation keeps backend/runtime placement generic; Qwen35/Qwen36 is only the first adapter because it already has the DFlash feature boundary.
  • Other model families, including Gemma, can opt in through the same backend capability once their adapter exposes a compatible draft/target feature boundary.
  • Local hardware validation completed end-to-end generation in both mixed directions on a Tesla P4 CUDA card and an AMD Pro VII HIP card. The target used for this smoke validation was a Q1 21B target pruned from the 27B family, with a little over 5GB of target weights; it is only used to validate the mixed-backend execution path, not as a quality or performance reference.

Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 12 files

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

Comment thread dflash/src/server/server_main.cpp
@weicj weicj force-pushed the feat-cpp-server-mixed-backend-support-clean branch from b3fc073 to 6d2b9de Compare May 21, 2026 18:31
Copy link
Copy Markdown
Contributor

@howard0su howard0su left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this remote daemon work around the problem that CUDA and HIP cannot load into single process?

}

bool arch_supports_remote_draft(const std::string & arch) {
return arch == "qwen35";
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why gemma4 cannot support this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll make it supported from another PR after gemma4 and other model backend feature fixed and merged into main.

@@ -0,0 +1,45 @@
// Standalone DFlash draft IPC daemon entry point.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not proper place for this binary main. put it to src/ipc folder.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, let me move it

" --draft-gpu <N> Draft GPU device (default: 0)\n"
" --target-device <backend:gpu> Target device (default: auto:0)\n"
" --draft-device <backend:gpu> Draft device (default: auto:0)\n"
" --draft-ipc-bin <path> Remote draft IPC daemon for mixed backends\n"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this is IPC on single machine, why we need a process to handle IPC? Why we need a daemon?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For single backend, IPC won't be called

sizeof(uint16_t) * (size_t)feat_hidden);
float * dst = slice.data() + (size_t)t * feat_hidden;
for (int h = 0; h < feat_hidden; ++h) {
dst[h] = bf16_bits_to_f32(bf16[h]);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why convert to f32? if we really want to, use ggml convert functions.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's on my PR plan for vram optimization afterwards, for now we keep a unified F32 to maximize the compatiblity for all the hardware setup (including legacy ones like Turin/Volta/GCN)

} else if (std::strncmp(argv[i], "--stream-fd=", 12) == 0) {
stream_fd = std::atoi(argv[i] + 12);
} else if (std::strcmp(argv[i], "--stream-fd") == 0) {
if (i + 1 < argc) stream_fd = std::atoi(argv[++i]);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why no use mmap? stream fd seems slow to send big chunk of data.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The stream fd is only used for small control/status messages, not for the large feature payload. The current first-pass transport writes feature/noise tensors to temporary files and sends paths over the control channel.

But yes I agree mmap/shared memory is the better next step for reducing host-copy and filesystem overhead, let me mark it to the follow-up vram optimization plan.

@weicj
Copy link
Copy Markdown
Contributor Author

weicj commented May 22, 2026

is this remote daemon work around the problem that CUDA and HIP cannot load into single process?

yes, they cannot be merged into one single process and we'll keep it as-is to avoid possible crash. An IPC daemon will only be introduced to communicate when mixed backend loaded (for single backend IPC won't be loaded)

@weicj weicj force-pushed the feat-cpp-server-mixed-backend-support-clean branch from 6d2b9de to f077a63 Compare May 22, 2026 10:03
@davide221
Copy link
Copy Markdown
Contributor

@weicj can you rebase and solve conflicts please

@weicj
Copy link
Copy Markdown
Contributor Author

weicj commented May 22, 2026

@weicj can you rebase and solve conflicts please

I've seen Gemma4 merged, will make a separate HIP runtime fix first, and update this one.

@weicj weicj force-pushed the feat-cpp-server-mixed-backend-support-clean branch from f077a63 to 6271ade Compare May 22, 2026 14:03
@weicj weicj requested a review from howard0su May 22, 2026 14:37
@weicj
Copy link
Copy Markdown
Contributor Author

weicj commented May 22, 2026

@davide221 updated, make sure #258 is merged before this one, when review passed :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants