server: Adaptive VRAM Eviction for Multimodal Vision Encoders via Dry-Run Profiling#25074
Open
1506086927 wants to merge 17 commits into
Open
server: Adaptive VRAM Eviction for Multimodal Vision Encoders via Dry-Run Profiling#250741506086927 wants to merge 17 commits into
1506086927 wants to merge 17 commits into
Conversation
Added mmproj swap pool functionality to manage LLM layers during vision encoding and improved memory management for mmproj.
Added function declaration for freeing vision buffer.
Added function to retrieve all tensors of the vision encoder model for memory management.
Added option to configure number of LLM layers to evict to host RAM when mmproj is active, with support for auto-detection based on free VRAM.
Added n_mmproj_swap variable for mmproj swap pool configuration.
Added functions to get and free vision encoder tensors in the multimodal model.
Updated comments from Chinese to English for better clarity and understanding.
Refactor memory management in llama_mmproj_pool to use pipelining for PCIe full-duplex parallelism, enhancing performance and preventing VRAM read/write pollution.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
This PR introduces a highly precise Adaptive VRAM Eviction Mechanism (
mmproj-swap) for multimodal models to prevent the Vision Encoder'sCompute Bufferfrom spilling into Shared System Memory, which causes catastrophic PCIe bus thrashing.The Problem: The PCIe Thrashing Trap
In multimodal inference, VRAM is heavily contested by LLM Weights, LLM KV Cache, and Vision Encoder (Weights + Compute Buffer). The Vision Compute Buffer is highly dynamic and scales squarely with image resolution. When users maximize their KV Cache, VRAM becomes saturated. Upon a vision request, the CUDA driver silently pages the Vision Compute Buffer into Host RAM. Computing intense Matrix Multiplications ($O(N^3)$) over the PCIe bus results in extreme performance degradation (Based on real-world testing: first image encoding is ~4x slower, subsequent encodings are ~10x slower).
The Solution: Dry-Run Profiling & Safe Eviction
Instead of hardcoding memory buffers, this PR uses a Dry-Run Profiling approach:
dynamic_overhead_bytes(Total Profiled Mem - Vision Weights).automode (--mmproj-swap-layers -1) is used, the system calculates exactly how many LLM layers need to be evicted to Host RAM to clear enough Dedicated VRAM for the Vision Compute Buffer (plus a 5% safety margin).This trades a fast$O(N)$ PCIe memory copy (swapping inactive LLM weights) for avoiding a devastating $O(N^3)$ PCIe compute penalty.
Additional information
🛠️ How to use
A new CLI argument is fully integrated:
N = 0: Disabled (Default).N > 0: Hardcoded number of LLM layers to evict.N = -1: Auto-adaptive mode (Recommended). Dynamically calculates and evicts the exact number of layers needed based on actual memory profiling.📂 File-by-File Breakdown
llama_cpp/common/llama_mmproj_pool.h&.cpp: Updated_initsignature to acceptdynamic_overhead_bytes. Overhauled the-1(auto) logic to computetarget_eviction_sizesafely.llama_cpp/tools/server/server-context.cpp: Executesmtmd_get_memory_usage()to dry-run the vision model inload_model(). Wraps the vision batch encoding with an RAII guard (MmprojSwapGuard) to guarantee safe swapping.llama_cpp/tools/mtmd/mtmd.h&.cpp: Exposesmtmd_get_memory_usage()for external profiling. Addsmtmd_free_vision_buffer().llama_cpp/tools/mtmd/clip.h&.cpp: Implements the underlyingreserve_compute_metalogic using a dummyggml_cgraphto compute peak buffer requirements.llama_cpp/common/arg.cpp&common.h: Added parsing for--mmproj-swap-layers.llama_cpp/src/llama-impl.h&llama-model.h: Exposes internal tensor iterators to allow the pool to scan and target specific VRAM tensors for eviction.🛡️ Safety & Stability
ggmlcompute buffers, avoiding lifecycle crashes (double free / segfaults).--no-mmproj-offloadis active, the swap is intelligently bypassed to prevent redundant Host-to-Host copying.1.05to account for CUDA memory fragmentation. A robust memory defragmentation hook inggml-backendwould be a safer long-term solution.ggml_backend_dev_t.llama_kv_cachequeries intodynamic_overhead_bytes.Requirements