Skip to content

Latest commit

 

History

History
69 lines (43 loc) · 4.47 KB

File metadata and controls

69 lines (43 loc) · 4.47 KB
title KV Cache Offloading
subtitle CPU and disk offloading integrations for vLLM in Dynamo

KV Cache Offloading

Dynamo supports multiple KV cache offloading backends for vLLM, allowing you to extend effective KV cache capacity beyond GPU memory using CPU RAM and disk storage. Each backend integrates through vLLM's connector interface and works with both aggregated and disaggregated serving.

Backend Source
KVBM Dynamo
LMCache GitHub
FlexKV GitHub

KVBM

KVBM (KV Block Manager) is Dynamo's built-in KV cache offloading system. It provides a three-layer architecture (LLM runtime, logical block management, NIXL transport) with support for CPU and disk cache tiers, and integrates natively with Dynamo's KV-aware routing and disaggregated serving.

Deployment Launch Script
Aggregated agg_kvbm.sh
Aggregated + KV routing agg_kvbm_router.sh
Disaggregated (1P1D) disagg_kvbm.sh
Disaggregated (2P2D) disagg_kvbm_2p2d.sh
Disaggregated + KV routing disagg_kvbm_router.sh

For configuration details, see the KVBM Guide.

LMCache

LMCache is an open-source KV cache engine that provides prefill-once, reuse-everywhere caching with multi-level storage backends (CPU RAM, local storage, Redis, GDS, InfiniStore/Mooncake).

Deployment Launch Script
Aggregated agg_lmcache.sh
Aggregated (multiprocess metrics) agg_lmcache_multiproc.sh
Disaggregated disagg_lmcache.sh

For configuration details, see the LMCache Integration Guide.

FlexKV

FlexKV is a scalable, distributed KV cache runtime developed by Tencent Cloud's TACO team. It supports multi-level caching (GPU, CPU, SSD), distributed KV cache reuse across nodes, and high-performance I/O via io_uring and GPUDirect Storage.

Deployment Launch Script
Aggregated agg_flexkv.sh
Aggregated + KV routing agg_flexkv_router.sh
Disaggregated disagg_flexkv.sh

For configuration details, see the FlexKV Integration Guide.

See Also