Skip to content

Commit 50f61ba

Browse files
authored
Update Kvcache.md
1 parent 248f291 commit 50f61ba

1 file changed

Lines changed: 103 additions & 5 deletions

File tree

_articles/Kvcache.md

Lines changed: 103 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,7 @@ This article explains:
2727
- [Three KV Cache Strategies](#three-kv-cache-strategies)
2828
- [KV Quantization Backends](#kv-quantization-backends)
2929
- [KV Offload Execution Timeline](#kv-offload-execution-timeline)
30+
- [How to Enable KV Cache Strategies in Config](#how-to-enable-kv-cache-strategies-in-config)
3031
- [Recommended Usage Strategies](#recommended-usage-strategies)
3132
- [Conclusion](#conclusion)
3233

@@ -249,6 +250,108 @@ Offload needs a clear synchronization relationship between load, compute, and wr
249250

250251
---
251252

253+
## How to Enable KV Cache Strategies in Config
254+
255+
In LightX2V, KV Cache settings are placed under `ar_config`, because they belong to the autoregressive inference path rather than the static model weights. Weight offload settings stay at the top level of the config, because they control where model weights are stored and when they are moved to GPU.
256+
257+
### Enable KV Quantization
258+
259+
KV quantization is configured through `ar_config.kv_quant`. For example, KIVI int4 KV Cache can be enabled as:
260+
261+
```json
262+
{
263+
"ar_config": {
264+
"local_attn_size": 21,
265+
"num_frame_per_chunk": 3,
266+
"sink_size": 3,
267+
"kv_quant": {
268+
"quant_scheme": "kivi",
269+
"k_cache_type": "int4",
270+
"v_cache_type": "int4",
271+
"group_size": 64
272+
},
273+
"kv_offload": false
274+
}
275+
}
276+
```
277+
278+
For SageQuant, the attention backend should also use the SageAttention path that directly consumes quantized KV:
279+
280+
```json
281+
{
282+
"self_attn_1_type": "sage_attn2_k_int8_v_fp8",
283+
"ar_config": {
284+
"kv_quant": {
285+
"calibrate": false,
286+
"calib_path": "/path/to/calib_kv.pt",
287+
"quant_scheme": "sage",
288+
"k_cache_type": "int8",
289+
"v_cache_type": "fp8"
290+
},
291+
"kv_offload": false
292+
}
293+
}
294+
```
295+
296+
### Enable KV Offload
297+
298+
KV offload is controlled by `ar_config.kv_offload`. It can be used without weight offload, which means model weights remain managed by the normal path, while part of the dynamic KV Cache is moved through the KV offload path.
299+
300+
```json
301+
{
302+
"cpu_offload": false,
303+
"ar_config": {
304+
"local_attn_size": 21,
305+
"num_frame_per_chunk": 3,
306+
"sink_size": 3,
307+
"kv_offload": true
308+
}
309+
}
310+
```
311+
312+
KV offload can also be combined with KV quantization:
313+
314+
```json
315+
{
316+
"cpu_offload": false,
317+
"ar_config": {
318+
"kv_quant": {
319+
"quant_scheme": "kivi",
320+
"k_cache_type": "int4",
321+
"v_cache_type": "int4",
322+
"group_size": 64
323+
},
324+
"kv_offload": true
325+
}
326+
}
327+
```
328+
329+
### Enable KV Offload + Weight Offload
330+
331+
When GPU memory is more constrained, KV offload can be combined with weight offload. In this case, `ar_config.kv_offload` controls KV Cache movement, while top-level `cpu_offload` and `offload_granularity` control model weight movement.
332+
333+
```json
334+
{
335+
"cpu_offload": true,
336+
"offload_granularity": "block",
337+
"t5_cpu_offload": true,
338+
"vae_cpu_offload": true,
339+
"ar_config": {
340+
"kv_quant": {
341+
"quant_scheme": "kivi",
342+
"k_cache_type": "int4",
343+
"v_cache_type": "int4",
344+
"group_size": 64
345+
},
346+
"kv_offload": true
347+
}
348+
}
349+
```
350+
351+
This combination targets two different memory sources at the same time: weight offload reduces static model-weight residency on GPU, while KV offload reduces the dynamic historical-state residency created during autoregressive generation.
352+
353+
---
354+
252355
## Recommended Usage Strategies
253356

254357
KV Cache strategies can be selected based on GPU memory and model size.
@@ -339,8 +442,3 @@ The Lingbot World Fast measurements show the same pattern in practice. On H200,
339442

340443
As autoregressive video generation and real-time world models continue to evolve, KV Cache will become an increasingly important part of inference systems. For consumer GPUs, weight offload addresses static weight memory pressure, while KV Cache management addresses dynamic historical-state memory pressure. Combining the two is what makes larger long-sequence video models practical on local devices.
341444

342-
343-
https://github.com/user-attachments/assets/67efded8-65d5-4d0b-9a64-71c369e96e9c
344-
345-
346-

0 commit comments

Comments
 (0)