Commit be5ac1f
committed
server : free draft/MTP resources on sleep to fix VRAM leak
The destroy() function in server_context_impl only cleaned up the main
model and context (via llama_init.reset()) but did not free the speculative
decoder (spec), draft context (ctx_dft), or draft model (model_dft).
For MTP (Multi-Token Prediction) models, ctx_dft holds GPU-allocated
resources (KV cache, compute buffers) that are not freed when entering
the sleeping state. On each sleep/resume cycle, new resources are
allocated without the old ones being freed, leading to a VRAM leak
that eventually crashes the server with out-of-memory errors.
Fix by explicitly resetting spec, ctx_dft, and model_dft in destroy()
before resetting llama_init, ensuring proper cleanup order to avoid
use-after-free.
ref: ggml-org#23395
Assisted-by: llama.cpp:local pi1 parent a8681a0 commit be5ac1f
1 file changed
Lines changed: 4 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
701 | 701 | | |
702 | 702 | | |
703 | 703 | | |
| 704 | + | |
| 705 | + | |
| 706 | + | |
| 707 | + | |
704 | 708 | | |
705 | 709 | | |
706 | 710 | | |
| |||
0 commit comments