server: Adaptive VRAM Eviction for Multimodal Vision Encoders via Dry-Run Profiling by 1506086927 · Pull Request #25074 · ggml-org/llama.cpp

1506086927 · 2026-06-27T05:02:36Z

Overview

This PR introduces a highly precise Adaptive VRAM Eviction Mechanism (mmproj-swap) for multimodal models to prevent the Vision Encoder's Compute Buffer from spilling into Shared System Memory, which causes catastrophic PCIe bus thrashing.

The Problem: The PCIe Thrashing Trap
In multimodal inference, VRAM is heavily contested by LLM Weights, LLM KV Cache, and Vision Encoder (Weights + Compute Buffer). The Vision Compute Buffer is highly dynamic and scales squarely with image resolution. When users maximize their KV Cache, VRAM becomes saturated. Upon a vision request, the CUDA driver silently pages the Vision Compute Buffer into Host RAM. Computing intense Matrix Multiplications ($O(N^3)$) over the PCIe bus results in extreme performance degradation (Based on real-world testing: first image encoding is ~4x slower, subsequent encodings are ~10x slower).

The Solution: Dry-Run Profiling & Safe Eviction
Instead of hardcoding memory buffers, this PR uses a Dry-Run Profiling approach:

Measure: During startup, we use a dummy graph to measure the exact peak memory (Compute Buffer) required by the vision encoder for the user's configured image resolution.
Calculate: We isolate the dynamic_overhead_bytes (Total Profiled Mem - Vision Weights).
Evict: When auto mode (--mmproj-swap-layers -1) is used, the system calculates exactly how many LLM layers need to be evicted to Host RAM to clear enough Dedicated VRAM for the Vision Compute Buffer (plus a 5% safety margin).
Swap in/out: When processing an image chunk, the designated LLM layers are swapped to RAM, the vision encoder runs 100% inside Dedicated VRAM, and the LLM layers are seamlessly restored afterwards.

This trades a fast $O(N)$ PCIe memory copy (swapping inactive LLM weights) for avoiding a devastating $O(N^3)$ PCIe compute penalty.

Additional information

🛠️ How to use

A new CLI argument is fully integrated:

--mmproj-swap-layers N

N = 0 : Disabled (Default).
N > 0 : Hardcoded number of LLM layers to evict.
N = -1 : Auto-adaptive mode (Recommended). Dynamically calculates and evicts the exact number of layers needed based on actual memory profiling.

📂 File-by-File Breakdown

llama_cpp/common/llama_mmproj_pool.h & .cpp: Updated _init signature to accept dynamic_overhead_bytes. Overhauled the -1 (auto) logic to compute target_eviction_size safely.
llama_cpp/tools/server/server-context.cpp: Executes mtmd_get_memory_usage() to dry-run the vision model in load_model(). Wraps the vision batch encoding with an RAII guard (MmprojSwapGuard) to guarantee safe swapping.
llama_cpp/tools/mtmd/mtmd.h & .cpp: Exposes mtmd_get_memory_usage() for external profiling. Adds mtmd_free_vision_buffer().
llama_cpp/tools/mtmd/clip.h & .cpp: Implements the underlying reserve_compute_meta logic using a dummy ggml_cgraph to compute peak buffer requirements.
llama_cpp/common/arg.cpp & common.h: Added parsing for --mmproj-swap-layers.
llama_cpp/src/llama-impl.h & llama-model.h: Exposes internal tensor iterators to allow the pool to scan and target specific VRAM tensors for eviction.

🛡️ Safety & Stability

No compute buffer hijacking: We do not attempt to reuse or steal ggml compute buffers, avoiding lifecycle crashes (double free / segfaults).
Graceful fallback: If dynamic profiling fails, the system falls back to a safe, hardcoded 300MB buffer overhead and emits a warning.
Conflict resolution: If --no-mmproj-offload is active, the swap is intelligently bypassed to prevent redundant Host-to-Host copying.

⚠️ Limitations & Future Work

Hardcoded 5% Fragmentation Margin: We multiply the theoretical requirement by 1.05 to account for CUDA memory fragmentation. A robust memory defragmentation hook in ggml-backend would be a safer long-term solution.
Multi-GPU (Tensor Split) Limitations: The current logic assumes a single contiguous VRAM space. Future iterations should map required vision VRAM per device and evict LLM layers from specific ggml_backend_dev_t.
Overhead of Swapping per Vision Request: Eviction happens for every image chunk. Implementing an "Eviction Hysteresis" (Sticky Swap) to keep LLM layers in Host RAM during heavy visual chat sessions could minimize redundant PCIe transfers.
KV Cache Allocation Ambiguity: Relies on upfront static allocation for KV Cache. Future dynamic KV allocation strategies will require integrating llama_kv_cache queries into dynamic_overhead_bytes.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES - AI was used to help structure the code logic for the dynamic overhead calculation, analyze the performance logs, and draft this PR description. All performance data, code architectures, and edge-case verifications are based on manual real-world testing.

Added mmproj swap pool functionality to manage LLM layers during vision encoding and improved memory management for mmproj.

Added function declaration for freeing vision buffer.

Added function to retrieve all tensors of the vision encoder model for memory management.

Added option to configure number of LLM layers to evict to host RAM when mmproj is active, with support for auto-detection based on free VRAM.

Added n_mmproj_swap variable for mmproj swap pool configuration.

Added functions to get and free vision encoder tensors in the multimodal model.

Updated comments from Chinese to English for better clarity and understanding.

Refactor memory management in llama_mmproj_pool to use pipelining for PCIe full-duplex parallelism, enhancing performance and preventing VRAM read/write pollution.

1506086927 added 12 commits June 27, 2026 12:48

Implement mmproj swap pool for vision encoding

ac6cc47

Added mmproj swap pool functionality to manage LLM layers during vision encoding and improved memory management for mmproj.

Declare mtmd_free_vision_buffer function

506e267

Added function declaration for freeing vision buffer.

Add clip_get_all_tensors function

61ac223

Added function to retrieve all tensors of the vision encoder model for memory management.

Add mmproj swap layers option to common arguments

fcdc4fc

Added option to configure number of LLM layers to evict to host RAM when mmproj is active, with support for auto-detection based on free VRAM.

Update llama_log_internal to use LLAMA_API

51a536b

Add LLAMA_API to llama_internal_get_tensor_map

4fed028

Add n_mmproj_swap for mmproj swap pool

23d54a4

Added n_mmproj_swap variable for mmproj swap pool configuration.

Fix template syntax in clip.cpp

ac83867

Implement functions for vision tensor management

56e7ade

Added functions to get and free vision encoder tensors in the multimodal model.

Add llama_mmproj_pool header with pool management structs

5b39a2b

Add llama_mmproj_pool implementation

482318c

Merge branch 'ggml-org:master' into master

73df86f

1506086927 requested review from a team, CISC and ggerganov as code owners June 27, 2026 05:02

1506086927 added 2 commits June 27, 2026 13:13

Translate comments from Chinese to English

f183f3b

Update comments to English in llama_mmproj_pool.cpp

d436e4b

github-actions Bot added server mtmd Related to multimodal functionality (video/image/audio) labels Jun 27, 2026

1506086927 added 3 commits June 27, 2026 13:24

Translate comments in llama_mmproj_pool.cpp to English

2b98b6b

Updated comments from Chinese to English for better clarity and understanding.

Improve memory management with pipelining strategy

6ae6f97

Refactor memory management in llama_mmproj_pool to use pipelining for PCIe full-duplex parallelism, enhancing performance and preventing VRAM read/write pollution.

Update llama_mmproj_pool.cpp

ce05379

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

server: Adaptive VRAM Eviction for Multimodal Vision Encoders via Dry-Run Profiling#25074

server: Adaptive VRAM Eviction for Multimodal Vision Encoders via Dry-Run Profiling#25074
1506086927 wants to merge 17 commits into
ggml-org:masterfrom
1506086927:master

1506086927 commented Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

1506086927 commented Jun 27, 2026

Overview

Additional information

🛠️ How to use

📂 File-by-File Breakdown

🛡️ Safety & Stability

⚠️ Limitations & Future Work

Requirements

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant