feat(dflash): VRAM management improvements for C++ server#245
Merged
Conversation
Contributor
|
on by default seems a bit of a stretch considering that pflash is not production ready, what do you think? |
Park the decode draft model (~3.3 GB) when idle to free VRAM for pflash compression. Before generate, free the pflash drafter and unpark the decode draft; after generate, park draft again. Flow: startup → park draft | request → compress → free pflash drafter → unpark draft → generate → park draft Saves ~3.3 GB VRAM on idle, enabling longer context on 22 GB GPUs. Port of Python server.py --lazy-draft behavior to the C++ in-process server. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The target gallocr, LM-head projection gallocr, and BSA persistent CUDA buffers grow monotonically with request size but never shrink. After a large-prompt request (e.g. agent 2k tokens), subsequent smaller requests suffer VRAM pressure causing KV cache spill to system RAM and ~2x decode slowdown. Add ModelBackend::release_scratch() called after each HTTP request completes. Qwen35Backend implementation frees: - sg_.alloc (target graph allocator) - proj_sg_.alloc (LM-head projection allocator) - BSA persistent device buffers (blockmask, head_mask_type, softmax_lse) All are lazily recreated at the exact size needed on the next request. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Two changes to reduce VRAM pressure in the C++ server, especially on 22–24 GB GPUs:
Park the decode draft model (~3.3 GB) between requests to free VRAM for pflash compression. Mirrors the existing Python server.py --lazy-draft behavior.
Flow: startup → park draft | request → compress → free pflash drafter → unpark draft → generate → park draft
Adds --no-lazy-draft flag to disable (enabled by default).
The target gallocr, LM-head projection gallocr, and BSA persistent CUDA buffers grow monotonically but never shrink. After a large-prompt request, subsequent smaller requests suffer VRAM pressure causing KV cache spill to system RAM and ~2× decode slowdown.
Adds ModelBackend::release_scratch() called after each HTTP request completes. Freed buffers are lazily recreated at the exact size needed on the next request.