feat(dflash): VRAM management improvements for C++ server by howard0su · Pull Request #245 · Luce-Org/lucebox-hub

howard0su · 2026-05-21T16:21:13Z

Two changes to reduce VRAM pressure in the C++ server, especially on 22–24 GB GPUs:

--lazy-draft: park decode draft when idle

Park the decode draft model (~3.3 GB) between requests to free VRAM for pflash compression. Mirrors the existing Python server.py --lazy-draft behavior.

Flow: startup → park draft | request → compress → free pflash drafter → unpark draft → generate → park draft

Adds --no-lazy-draft flag to disable (enabled by default).

Release scratch VRAM buffers between requests

The target gallocr, LM-head projection gallocr, and BSA persistent CUDA buffers grow monotonically but never shrink. After a large-prompt request, subsequent smaller requests suffer VRAM pressure causing KV cache spill to system RAM and ~2× decode slowdown.

Adds ModelBackend::release_scratch() called after each HTTP request completes. Freed buffers are lazily recreated at the exact size needed on the next request.

cubic-dev-ai

No issues found across 6 files

_{Re-trigger cubic}

davide221 · 2026-05-21T17:28:08Z

on by default seems a bit of a stretch considering that pflash is not production ready, what do you think?

Park the decode draft model (~3.3 GB) when idle to free VRAM for pflash compression. Before generate, free the pflash drafter and unpark the decode draft; after generate, park draft again. Flow: startup → park draft | request → compress → free pflash drafter → unpark draft → generate → park draft Saves ~3.3 GB VRAM on idle, enabling longer context on 22 GB GPUs. Port of Python server.py --lazy-draft behavior to the C++ in-process server. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The target gallocr, LM-head projection gallocr, and BSA persistent CUDA buffers grow monotonically with request size but never shrink. After a large-prompt request (e.g. agent 2k tokens), subsequent smaller requests suffer VRAM pressure causing KV cache spill to system RAM and ~2x decode slowdown. Add ModelBackend::release_scratch() called after each HTTP request completes. Qwen35Backend implementation frees: - sg_.alloc (target graph allocator) - proj_sg_.alloc (LM-head projection allocator) - BSA persistent device buffers (blockmask, head_mask_type, softmax_lse) All are lazily recreated at the exact size needed on the next request. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

cubic-dev-ai Bot reviewed May 21, 2026

View reviewed changes

howard0su and others added 2 commits May 22, 2026 08:32

howard0su force-pushed the lazy branch from 05c7498 to 4ab8114 Compare May 22, 2026 01:02

Make lazy-draft default to off

88d5b62

howard0su force-pushed the lazy branch from 4ab8114 to 88d5b62 Compare May 22, 2026 03:39

davide221 merged commit efb7ff0 into Luce-Org:main May 22, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(dflash): VRAM management improvements for C++ server#245

feat(dflash): VRAM management improvements for C++ server#245
davide221 merged 3 commits into
Luce-Org:mainfrom
howard0su:lazy

howard0su commented May 21, 2026

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

davide221 commented May 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

howard0su commented May 21, 2026

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

davide221 commented May 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants