Skip to content

feat(dflash): VRAM management improvements for C++ server#245

Merged
davide221 merged 3 commits into
Luce-Org:mainfrom
howard0su:lazy
May 22, 2026
Merged

feat(dflash): VRAM management improvements for C++ server#245
davide221 merged 3 commits into
Luce-Org:mainfrom
howard0su:lazy

Conversation

@howard0su
Copy link
Copy Markdown
Contributor

Two changes to reduce VRAM pressure in the C++ server, especially on 22–24 GB GPUs:

  1. --lazy-draft: park decode draft when idle

Park the decode draft model (~3.3 GB) between requests to free VRAM for pflash compression. Mirrors the existing Python server.py --lazy-draft behavior.

Flow: startup → park draft | request → compress → free pflash drafter → unpark draft → generate → park draft

Adds --no-lazy-draft flag to disable (enabled by default).

  1. Release scratch VRAM buffers between requests

The target gallocr, LM-head projection gallocr, and BSA persistent CUDA buffers grow monotonically but never shrink. After a large-prompt request, subsequent smaller requests suffer VRAM pressure causing KV cache spill to system RAM and ~2× decode slowdown.

Adds ModelBackend::release_scratch() called after each HTTP request completes. Freed buffers are lazily recreated at the exact size needed on the next request.

Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 6 files

Re-trigger cubic

@davide221
Copy link
Copy Markdown
Contributor

on by default seems a bit of a stretch considering that pflash is not production ready, what do you think?

howard0su and others added 2 commits May 22, 2026 08:32
Park the decode draft model (~3.3 GB) when idle to free VRAM for pflash
compression. Before generate, free the pflash drafter and unpark the decode
draft; after generate, park draft again.

Flow: startup → park draft | request → compress → free pflash drafter →
unpark draft → generate → park draft

Saves ~3.3 GB VRAM on idle, enabling longer context on 22 GB GPUs.
Port of Python server.py --lazy-draft behavior to the C++ in-process server.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The target gallocr, LM-head projection gallocr, and BSA persistent
CUDA buffers grow monotonically with request size but never shrink.
After a large-prompt request (e.g. agent 2k tokens), subsequent
smaller requests suffer VRAM pressure causing KV cache spill to
system RAM and ~2x decode slowdown.

Add ModelBackend::release_scratch() called after each HTTP request
completes. Qwen35Backend implementation frees:
- sg_.alloc (target graph allocator)
- proj_sg_.alloc (LM-head projection allocator)
- BSA persistent device buffers (blockmask, head_mask_type, softmax_lse)

All are lazily recreated at the exact size needed on the next request.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@davide221 davide221 merged commit efb7ff0 into Luce-Org:main May 22, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants