Skip to content

server: Adaptive VRAM Eviction for Multimodal Vision Encoders via Dry-Run Profiling#25074

Open
1506086927 wants to merge 17 commits into
ggml-org:masterfrom
1506086927:master
Open

server: Adaptive VRAM Eviction for Multimodal Vision Encoders via Dry-Run Profiling#25074
1506086927 wants to merge 17 commits into
ggml-org:masterfrom
1506086927:master

Conversation

@1506086927

Copy link
Copy Markdown

Overview

This PR introduces a highly precise Adaptive VRAM Eviction Mechanism (mmproj-swap) for multimodal models to prevent the Vision Encoder's Compute Buffer from spilling into Shared System Memory, which causes catastrophic PCIe bus thrashing.

The Problem: The PCIe Thrashing Trap
In multimodal inference, VRAM is heavily contested by LLM Weights, LLM KV Cache, and Vision Encoder (Weights + Compute Buffer). The Vision Compute Buffer is highly dynamic and scales squarely with image resolution. When users maximize their KV Cache, VRAM becomes saturated. Upon a vision request, the CUDA driver silently pages the Vision Compute Buffer into Host RAM. Computing intense Matrix Multiplications ($O(N^3)$) over the PCIe bus results in extreme performance degradation (Based on real-world testing: first image encoding is ~4x slower, subsequent encodings are ~10x slower).

The Solution: Dry-Run Profiling & Safe Eviction
Instead of hardcoding memory buffers, this PR uses a Dry-Run Profiling approach:

  1. Measure: During startup, we use a dummy graph to measure the exact peak memory (Compute Buffer) required by the vision encoder for the user's configured image resolution.
  2. Calculate: We isolate the dynamic_overhead_bytes (Total Profiled Mem - Vision Weights).
  3. Evict: When auto mode (--mmproj-swap-layers -1) is used, the system calculates exactly how many LLM layers need to be evicted to Host RAM to clear enough Dedicated VRAM for the Vision Compute Buffer (plus a 5% safety margin).
  4. Swap in/out: When processing an image chunk, the designated LLM layers are swapped to RAM, the vision encoder runs 100% inside Dedicated VRAM, and the LLM layers are seamlessly restored afterwards.

This trades a fast $O(N)$ PCIe memory copy (swapping inactive LLM weights) for avoiding a devastating $O(N^3)$ PCIe compute penalty.

Additional information

🛠️ How to use

A new CLI argument is fully integrated:

--mmproj-swap-layers N
  • N = 0 : Disabled (Default).
  • N > 0 : Hardcoded number of LLM layers to evict.
  • N = -1 : Auto-adaptive mode (Recommended). Dynamically calculates and evicts the exact number of layers needed based on actual memory profiling.

📂 File-by-File Breakdown

  • llama_cpp/common/llama_mmproj_pool.h & .cpp: Updated _init signature to accept dynamic_overhead_bytes. Overhauled the -1 (auto) logic to compute target_eviction_size safely.
  • llama_cpp/tools/server/server-context.cpp: Executes mtmd_get_memory_usage() to dry-run the vision model in load_model(). Wraps the vision batch encoding with an RAII guard (MmprojSwapGuard) to guarantee safe swapping.
  • llama_cpp/tools/mtmd/mtmd.h & .cpp: Exposes mtmd_get_memory_usage() for external profiling. Adds mtmd_free_vision_buffer().
  • llama_cpp/tools/mtmd/clip.h & .cpp: Implements the underlying reserve_compute_meta logic using a dummy ggml_cgraph to compute peak buffer requirements.
  • llama_cpp/common/arg.cpp & common.h: Added parsing for --mmproj-swap-layers.
  • llama_cpp/src/llama-impl.h & llama-model.h: Exposes internal tensor iterators to allow the pool to scan and target specific VRAM tensors for eviction.

🛡️ Safety & Stability

  • No compute buffer hijacking: We do not attempt to reuse or steal ggml compute buffers, avoiding lifecycle crashes (double free / segfaults).
  • Graceful fallback: If dynamic profiling fails, the system falls back to a safe, hardcoded 300MB buffer overhead and emits a warning.
  • Conflict resolution: If --no-mmproj-offload is active, the swap is intelligently bypassed to prevent redundant Host-to-Host copying.

⚠️ Limitations & Future Work

  1. Hardcoded 5% Fragmentation Margin: We multiply the theoretical requirement by 1.05 to account for CUDA memory fragmentation. A robust memory defragmentation hook in ggml-backend would be a safer long-term solution.
  2. Multi-GPU (Tensor Split) Limitations: The current logic assumes a single contiguous VRAM space. Future iterations should map required vision VRAM per device and evict LLM layers from specific ggml_backend_dev_t.
  3. Overhead of Swapping per Vision Request: Eviction happens for every image chunk. Implementing an "Eviction Hysteresis" (Sticky Swap) to keep LLM layers in Host RAM during heavy visual chat sessions could minimize redundant PCIe transfers.
  4. KV Cache Allocation Ambiguity: Relies on upfront static allocation for KV Cache. Future dynamic KV allocation strategies will require integrating llama_kv_cache queries into dynamic_overhead_bytes.

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES - AI was used to help structure the code logic for the dynamic overhead calculation, analyze the performance logs, and draft this PR description. All performance data, code architectures, and edge-case verifications are based on manual real-world testing.

Added mmproj swap pool functionality to manage LLM layers during vision encoding and improved memory management for mmproj.
Added function declaration for freeing vision buffer.
Added function to retrieve all tensors of the vision encoder model for memory management.
Added option to configure number of LLM layers to evict to host RAM when mmproj is active, with support for auto-detection based on free VRAM.
Added n_mmproj_swap variable for mmproj swap pool configuration.
Added functions to get and free vision encoder tensors in the multimodal model.
@1506086927 1506086927 requested review from a team, CISC and ggerganov as code owners June 27, 2026 05:02
@github-actions github-actions Bot added server mtmd Related to multimodal functionality (video/image/audio) labels Jun 27, 2026
Updated comments from Chinese to English for better clarity and understanding.
Refactor memory management in llama_mmproj_pool to use pipelining for PCIe full-duplex parallelism, enhancing performance and preventing VRAM read/write pollution.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

mtmd Related to multimodal functionality (video/image/audio) server

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant