Skip to content

Commit be5ac1f

Browse files
committed
server : free draft/MTP resources on sleep to fix VRAM leak
The destroy() function in server_context_impl only cleaned up the main model and context (via llama_init.reset()) but did not free the speculative decoder (spec), draft context (ctx_dft), or draft model (model_dft). For MTP (Multi-Token Prediction) models, ctx_dft holds GPU-allocated resources (KV cache, compute buffers) that are not freed when entering the sleeping state. On each sleep/resume cycle, new resources are allocated without the old ones being freed, leading to a VRAM leak that eventually crashes the server with out-of-memory errors. Fix by explicitly resetting spec, ctx_dft, and model_dft in destroy() before resetting llama_init, ensuring proper cleanup order to avoid use-after-free. ref: ggml-org#23395 Assisted-by: llama.cpp:local pi
1 parent a8681a0 commit be5ac1f

1 file changed

Lines changed: 4 additions & 0 deletions

File tree

tools/server/server-context.cpp

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -701,6 +701,10 @@ struct server_context_impl {
701701
bool sleeping = false;
702702

703703
void destroy() {
704+
spec.reset();
705+
ctx_dft.reset();
706+
model_dft.reset();
707+
704708
llama_init.reset();
705709

706710
ctx_tgt = nullptr;

0 commit comments

Comments
 (0)