Commit 33e35a7
fix: release scratch VRAM buffers between requests
The target gallocr, LM-head projection gallocr, and BSA persistent
CUDA buffers grow monotonically with request size but never shrink.
After a large-prompt request (e.g. agent 2k tokens), subsequent
smaller requests suffer VRAM pressure causing KV cache spill to
system RAM and ~2x decode slowdown.
Add ModelBackend::release_scratch() called after each HTTP request
completes. Qwen35Backend implementation frees:
- sg_.alloc (target graph allocator)
- proj_sg_.alloc (LM-head projection allocator)
- BSA persistent device buffers (blockmask, head_mask_type, softmax_lse)
All are lazily recreated at the exact size needed on the next request.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>1 parent 3f10692 commit 33e35a7
4 files changed
Lines changed: 40 additions & 0 deletions
File tree
- dflash/src
- common
- qwen35
- server
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
174 | 174 | | |
175 | 175 | | |
176 | 176 | | |
| 177 | + | |
| 178 | + | |
| 179 | + | |
| 180 | + | |
177 | 181 | | |
178 | 182 | | |
179 | 183 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
13 | 13 | | |
14 | 14 | | |
15 | 15 | | |
| 16 | + | |
| 17 | + | |
16 | 18 | | |
17 | 19 | | |
18 | 20 | | |
| |||
436 | 438 | | |
437 | 439 | | |
438 | 440 | | |
| 441 | + | |
| 442 | + | |
| 443 | + | |
| 444 | + | |
| 445 | + | |
| 446 | + | |
| 447 | + | |
| 448 | + | |
| 449 | + | |
| 450 | + | |
| 451 | + | |
| 452 | + | |
| 453 | + | |
| 454 | + | |
| 455 | + | |
| 456 | + | |
| 457 | + | |
| 458 | + | |
| 459 | + | |
| 460 | + | |
| 461 | + | |
| 462 | + | |
| 463 | + | |
| 464 | + | |
| 465 | + | |
| 466 | + | |
439 | 467 | | |
440 | 468 | | |
441 | 469 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
109 | 109 | | |
110 | 110 | | |
111 | 111 | | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
112 | 116 | | |
113 | 117 | | |
114 | 118 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
730 | 730 | | |
731 | 731 | | |
732 | 732 | | |
| 733 | + | |
| 734 | + | |
| 735 | + | |
| 736 | + | |
733 | 737 | | |
734 | 738 | | |
735 | 739 | | |
| |||
0 commit comments