| title | FlexKV |
|---|
FlexKV is a scalable, distributed runtime for KV cache offloading developed by Tencent Cloud's TACO team and NVIDIA in collaboration with the community. It acts as a unified KV caching layer for inference engines like SGLang, TensorRT-LLM, and vllm.
- Multi-level caching: CPU memory, local SSD, and scalable storage (cloud storage) for KV cache offloading
- Distributed KV cache reuse: Share KV cache across multiple nodes using distributed RadixTree
- High-performance I/O: Supports io_uring and GPU Direct Storage (GDS) for accelerated data transfer
- Asynchronous operations: Get and put operations can overlap with computation through prefetching
- Dynamo installed with vLLM support
- Infrastructure services running:
docker compose -f dev/docker-compose.yml up -d
- FlexKV installed:
git clone https://github.com/taco-project/FlexKV.git cd FlexKV ./build.sh - Optional: SSD offloading dependencies (only required for CPU + SSD tiered offloading):
apt install liburing-dev libxxhash-dev
Set the DYNAMO_USE_FLEXKV environment variable and use the --kv-transfer-config flag:
export DYNAMO_USE_FLEXKV=1
python -m dynamo.vllm --model Qwen/Qwen3-0.6B --kv-transfer-config '{"kv_connector":"FlexKVConnectorV1","kv_role":"kv_both"}'# Terminal 1: Start frontend
python -m dynamo.frontend &
# Terminal 2: Start vLLM worker with FlexKV
DYNAMO_USE_FLEXKV=1 \
FLEXKV_CPU_CACHE_GB=32 \
python -m dynamo.vllm --model Qwen/Qwen3-0.6B --kv-transfer-config '{"kv_connector":"FlexKVConnectorV1","kv_role":"kv_both"}'For multi-worker deployments with KV-aware routing to maximize cache reuse:
# Terminal 1: Start frontend with KV router
python -m dynamo.frontend \
--router-mode kv \
--router-reset-states &
# Terminal 2: Worker 1
DYNAMO_USE_FLEXKV=1 \
FLEXKV_CPU_CACHE_GB=32 \
FLEXKV_SERVER_RECV_PORT="ipc:///tmp/flexkv_server_0" \
CUDA_VISIBLE_DEVICES=0 \
python -m dynamo.vllm \
--model Qwen/Qwen3-0.6B \
--kv-transfer-config '{"kv_connector":"FlexKVConnectorV1","kv_role":"kv_both"}' \
--gpu-memory-utilization 0.2 \
--kv-events-config '{"publisher":"zmq","topic":"kv-events","endpoint":"tcp://*:20080","enable_kv_cache_events":true}' &
# Terminal 3: Worker 2
DYNAMO_USE_FLEXKV=1 \
FLEXKV_CPU_CACHE_GB=32 \
FLEXKV_SERVER_RECV_PORT="ipc:///tmp/flexkv_server_1" \
CUDA_VISIBLE_DEVICES=1 \
python -m dynamo.vllm \
--model Qwen/Qwen3-0.6B \
--kv-transfer-config '{"kv_connector":"FlexKVConnectorV1","kv_role":"kv_both"}' \
--gpu-memory-utilization 0.2 \
--kv-events-config '{"publisher":"zmq","topic":"kv-events","endpoint":"tcp://*:20081","enable_kv_cache_events":true}'Note: Disaggregated FlexKV serving is experimental. The prefill worker must use
PdConnectorwith two sub-connectors:FlexKVConnectorV1(KV cache offloading) andNixlConnector(P/D KV transfer). UsingFlexKVConnectorV1alone as the top-level connector in disaggregated mode is not supported and will result in aTypeError.
FlexKV can be used with disaggregated prefill/decode serving. The prefill worker uses FlexKV for KV cache offloading, while NIXL handles KV transfer between prefill and decode workers. The PdConnector wraps both connectors so they work together.
| Role | Connector | Description |
|---|---|---|
| Decode worker | NixlConnector |
Pulls KV blocks from prefill worker via NIXL |
| Prefill worker | PdConnector wrapping [FlexKVConnectorV1, NixlConnector] |
FlexKV offloads/onboards KV blocks; NIXL serves them to decode |
# Terminal 1: Start frontend
python -m dynamo.frontend &
# Terminal 2: Decode worker (without FlexKV)
CUDA_VISIBLE_DEVICES=0 python -m dynamo.vllm --model Qwen/Qwen3-0.6B \
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}' &
# Terminal 3: Prefill worker (with FlexKV + NIXL via PdConnector)
DYN_VLLM_KV_EVENT_PORT=20081 \
VLLM_NIXL_SIDE_CHANNEL_PORT=20097 \
DYNAMO_USE_FLEXKV=1 \
FLEXKV_CPU_CACHE_GB=32 \
CUDA_VISIBLE_DEVICES=1 \
python -m dynamo.vllm \
--model Qwen/Qwen3-0.6B \
--is-prefill-worker \
--kv-transfer-config '{"kv_connector":"PdConnector","kv_role":"kv_both","kv_connector_extra_config":{"connectors":[{"kv_connector":"FlexKVConnectorV1","kv_role":"kv_both"},{"kv_connector":"NixlConnector","kv_role":"kv_both"}]},"kv_connector_module_path":"kvbm.vllm_integration.connector"}' \
--kv-events-config '{"publisher":"zmq","topic":"kv-events","endpoint":"tcp://*:20081","enable_kv_cache_events":true}'You can also use the provided launch script directly:
examples/backends/vllm/launch/disagg_flexkv.sh| Variable | Description | Default |
|---|---|---|
DYNAMO_USE_FLEXKV |
Enable FlexKV integration | 0 (disabled) |
FLEXKV_CPU_CACHE_GB |
CPU memory cache size in GB | Required |
FLEXKV_CONFIG_PATH |
Path to FlexKV YAML config file | Not set |
FLEXKV_SERVER_RECV_PORT |
IPC port for FlexKV server | Auto |
For simple CPU memory offloading:
unset FLEXKV_CONFIG_PATH
export FLEXKV_CPU_CACHE_GB=32For multi-tier offloading with SSD storage, create a configuration file:
cat > ./flexkv_config.yml <<EOF
cpu_cache_gb: 32
ssd_cache_gb: 1024
ssd_cache_dir: /data0/flexkv_ssd/;/data1/flexkv_ssd/
enable_gds: false
EOF
export FLEXKV_CONFIG_PATH="./flexkv_config.yml"| Option | Description |
|---|---|
cpu_cache_gb |
CPU memory cache size in GB |
ssd_cache_gb |
SSD cache size in GB |
ssd_cache_dir |
SSD cache directories (semicolon-separated for multiple SSDs) |
enable_gds |
Enable GPU Direct Storage for SSD I/O |
Note: For full configuration options, see the FlexKV Configuration Reference.
FlexKV supports distributed KV cache reuse to share cache across multiple nodes. This enables:
- Distributed RadixTree: Each node maintains a local snapshot of the global index
- Lease Mechanism: Ensures data validity during cross-node transfers
- RDMA-based Transfer: Uses Mooncake Transfer Engine for high-performance KV cache transfer
For setup instructions, see the FlexKV Distributed Reuse Guide.
FlexKV consists of three core modules:
Initializes the three-level cache (GPU → CPU → SSD/Cloud). It groups multiple tokens into blocks and stores KV cache at the block level, maintaining the same KV shape as in GPU memory.
The control plane that determines data transfer direction and identifies source/destination block IDs. Includes:
- RadixTree for prefix matching
- Memory pool to track space usage and trigger eviction
The data plane that executes data transfers:
- Multi-threading for parallel transfers
- High-performance I/O (io_uring, GDS)
- Asynchronous operations overlapping with computation
curl localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-0.6B",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": false,
"max_tokens": 30
}'