Commit 3f10692
feat(dflash): add --lazy-draft to C++ server
Park the decode draft model (~3.3 GB) when idle to free VRAM for pflash
compression. Before generate, free the pflash drafter and unpark the decode
draft; after generate, park draft again.
Flow: startup → park draft | request → compress → free pflash drafter →
unpark draft → generate → park draft
Saves ~3.3 GB VRAM on idle, enabling longer context on 22 GB GPUs.
Port of Python server.py --lazy-draft behavior to the C++ in-process server.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>1 parent 538bf53 commit 3f10692
3 files changed
Lines changed: 24 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
712 | 712 | | |
713 | 713 | | |
714 | 714 | | |
| 715 | + | |
| 716 | + | |
| 717 | + | |
| 718 | + | |
| 719 | + | |
| 720 | + | |
715 | 721 | | |
716 | 722 | | |
717 | 723 | | |
718 | 724 | | |
719 | 725 | | |
720 | 726 | | |
721 | 727 | | |
| 728 | + | |
| 729 | + | |
| 730 | + | |
| 731 | + | |
| 732 | + | |
722 | 733 | | |
723 | 734 | | |
724 | 735 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
56 | 56 | | |
57 | 57 | | |
58 | 58 | | |
| 59 | + | |
59 | 60 | | |
60 | 61 | | |
61 | 62 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
68 | 68 | | |
69 | 69 | | |
70 | 70 | | |
| 71 | + | |
71 | 72 | | |
72 | 73 | | |
73 | 74 | | |
| |||
140 | 141 | | |
141 | 142 | | |
142 | 143 | | |
| 144 | + | |
| 145 | + | |
143 | 146 | | |
144 | 147 | | |
145 | 148 | | |
| |||
269 | 272 | | |
270 | 273 | | |
271 | 274 | | |
| 275 | + | |
| 276 | + | |
| 277 | + | |
272 | 278 | | |
273 | 279 | | |
274 | 280 | | |
| |||
278 | 284 | | |
279 | 285 | | |
280 | 286 | | |
| 287 | + | |
| 288 | + | |
| 289 | + | |
| 290 | + | |
| 291 | + | |
| 292 | + | |
281 | 293 | | |
282 | 294 | | |
283 | 295 | | |
| |||
0 commit comments