You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -249,6 +250,108 @@ Offload needs a clear synchronization relationship between load, compute, and wr
249
250
250
251
---
251
252
253
+
## How to Enable KV Cache Strategies in Config
254
+
255
+
In LightX2V, KV Cache settings are placed under `ar_config`, because they belong to the autoregressive inference path rather than the static model weights. Weight offload settings stay at the top level of the config, because they control where model weights are stored and when they are moved to GPU.
256
+
257
+
### Enable KV Quantization
258
+
259
+
KV quantization is configured through `ar_config.kv_quant`. For example, KIVI int4 KV Cache can be enabled as:
260
+
261
+
```json
262
+
{
263
+
"ar_config": {
264
+
"local_attn_size": 21,
265
+
"num_frame_per_chunk": 3,
266
+
"sink_size": 3,
267
+
"kv_quant": {
268
+
"quant_scheme": "kivi",
269
+
"k_cache_type": "int4",
270
+
"v_cache_type": "int4",
271
+
"group_size": 64
272
+
},
273
+
"kv_offload": false
274
+
}
275
+
}
276
+
```
277
+
278
+
For SageQuant, the attention backend should also use the SageAttention path that directly consumes quantized KV:
279
+
280
+
```json
281
+
{
282
+
"self_attn_1_type": "sage_attn2_k_int8_v_fp8",
283
+
"ar_config": {
284
+
"kv_quant": {
285
+
"calibrate": false,
286
+
"calib_path": "/path/to/calib_kv.pt",
287
+
"quant_scheme": "sage",
288
+
"k_cache_type": "int8",
289
+
"v_cache_type": "fp8"
290
+
},
291
+
"kv_offload": false
292
+
}
293
+
}
294
+
```
295
+
296
+
### Enable KV Offload
297
+
298
+
KV offload is controlled by `ar_config.kv_offload`. It can be used without weight offload, which means model weights remain managed by the normal path, while part of the dynamic KV Cache is moved through the KV offload path.
299
+
300
+
```json
301
+
{
302
+
"cpu_offload": false,
303
+
"ar_config": {
304
+
"local_attn_size": 21,
305
+
"num_frame_per_chunk": 3,
306
+
"sink_size": 3,
307
+
"kv_offload": true
308
+
}
309
+
}
310
+
```
311
+
312
+
KV offload can also be combined with KV quantization:
313
+
314
+
```json
315
+
{
316
+
"cpu_offload": false,
317
+
"ar_config": {
318
+
"kv_quant": {
319
+
"quant_scheme": "kivi",
320
+
"k_cache_type": "int4",
321
+
"v_cache_type": "int4",
322
+
"group_size": 64
323
+
},
324
+
"kv_offload": true
325
+
}
326
+
}
327
+
```
328
+
329
+
### Enable KV Offload + Weight Offload
330
+
331
+
When GPU memory is more constrained, KV offload can be combined with weight offload. In this case, `ar_config.kv_offload` controls KV Cache movement, while top-level `cpu_offload` and `offload_granularity` control model weight movement.
332
+
333
+
```json
334
+
{
335
+
"cpu_offload": true,
336
+
"offload_granularity": "block",
337
+
"t5_cpu_offload": true,
338
+
"vae_cpu_offload": true,
339
+
"ar_config": {
340
+
"kv_quant": {
341
+
"quant_scheme": "kivi",
342
+
"k_cache_type": "int4",
343
+
"v_cache_type": "int4",
344
+
"group_size": 64
345
+
},
346
+
"kv_offload": true
347
+
}
348
+
}
349
+
```
350
+
351
+
This combination targets two different memory sources at the same time: weight offload reduces static model-weight residency on GPU, while KV offload reduces the dynamic historical-state residency created during autoregressive generation.
352
+
353
+
---
354
+
252
355
## Recommended Usage Strategies
253
356
254
357
KV Cache strategies can be selected based on GPU memory and model size.
@@ -339,8 +442,3 @@ The Lingbot World Fast measurements show the same pattern in practice. On H200,
339
442
340
443
As autoregressive video generation and real-time world models continue to evolve, KV Cache will become an increasingly important part of inference systems. For consumer GPUs, weight offload addresses static weight memory pressure, while KV Cache management addresses dynamic historical-state memory pressure. Combining the two is what makes larger long-sequence video models practical on local devices.
0 commit comments