Skip to content

Commit 248f291

Browse files
authored
Update Kvcache.md
1 parent ea42ebb commit 248f291

1 file changed

Lines changed: 27 additions & 18 deletions

File tree

_articles/Kvcache.md

Lines changed: 27 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -297,30 +297,31 @@ Even on A100 / H100 / H200, KV Cache management remains useful. It can:
297297

298298
The following tables are reserved for Lingbot World Fast measurements. The recommended setup is to keep the same input conditions, such as resolution, frame count, prompt, seed, GPU, and inference configuration, then record VRAM, latency, and generated video for each strategy. This makes it easier to compare the benefits and costs of different KV Cache strategies.
299299

300-
### Baseline Comparison
301300

302-
| Method | KV Quant | KV Offload | Weight Offload | Peak VRAM | Total Time | Avg Iter Time | Video / Result |
303-
|---|---|---|---|---:|---:|---:|---|
304-
| Original Lingbot implementation | - | - | - | | | | |
305-
| LightX2V | - | - | - | | | | |
301+
### Baseline and Optimization Comparison on a Single H200
306302

307-
### KV Quantization Comparison
303+
The first comparison uses a single H200 GPU. It shows the difference between the original Lingbot implementation and LightX2V under the same generation setting, and then compares KV quantization and KV offload. The 161-frame case is especially useful for showing how KV Cache optimization changes the memory/speed trade-off.
308304

309-
| Method | KV Quant | KV Offload | Weight Offload | Peak VRAM | Total Time | Avg Iter Time | Video / Result |
310-
|---|---|---|---|---:|---:|---:|---|
311-
| LightX2V + SageQuant | SageQuant | - | - | | | | |
312-
| LightX2V + KIVI | KIVI | - | - | | | | |
313-
| LightX2V + TurboQuant | TurboQuant | - | - | | | | |
305+
| Method | Frames | KV Quant (int4) | KV Offload | Peak VRAM | Inference Time | Video / Result |
306+
|---|---:|---|---|---:|---:|---|
307+
| Original | 81 | - | - | ~100G | ~92s | <video src="https://github.com/user-attachments/assets/dd774030-9696-4464-a458-0762edc43f27" width="200px"></video>|
308+
| LightX2V | 81 | - | - | ~100G | ~56s | <video src="https://github.com/user-attachments/assets/1ada7b44-28e5-4fda-9dc8-310a03d803ab" width="200px"></video>|
309+
| Original | 161 | - | - | OOM | - | - |
310+
| LightX2V | 161 | - | - | ~100G | ~110s |<video src="https://github.com/user-attachments/assets/e4bac46a-3fef-4165-9c22-e2b2466b147b" width="200px"></video>|
311+
| LightX2V | 161 | Enabled | - | ~70G | ~151s |<video src="https://github.com/user-attachments/assets/b38edff3-3912-4ee7-989b-e93f91e68e1e" width="200px"></video>|
312+
| LightX2V | 161 | Enabled | Enabled | ~54G | ~255s |<video src="https://github.com/user-attachments/assets/e669b938-1879-47b0-8da7-3e0d2d90b972" width="200px"></video>|
313+
314+
### Long-Video Generation on a Consumer GPU
315+
316+
The second comparison highlights one of the most practical goals of KV Cache optimization: generating a one-minute video on a consumer GPU. In this RTX 5090 case, the original Lingbot implementation cannot fit in memory, while LightX2V can run the one-minute generation on a single consumer GPU by combining KV quantization, KV offload, and weight offload.
317+
318+
| Method | Frames | KV Quant (int4) | KV Offload | Weight Offload | Peak VRAM | Inference Time | Video / Result |
319+
|---|---:|---|---|---|---:|---:|---|
320+
| Original | 961 | - | - | - | OOM | - | - |
321+
| LightX2V | 961 | Enabled | Enabled | Enabled | | | |
314322

315-
### Offload Combination Comparison
316323

317-
| Method | KV Quant | KV Offload | Weight Offload | Peak VRAM | Total Time | Avg Iter Time | Video / Result |
318-
|---|---|---|---|---:|---:|---:|---|
319-
| LightX2V + KV Offload | - | Enabled | - | | | | |
320-
| LightX2V + KV Offload + Weight Offload | - | Enabled | Enabled | | | | |
321-
| LightX2V + SageQuant + KV Offload + Weight Offload | SageQuant | Enabled | Enabled | | | | |
322324

323-
---
324325

325326
## Conclusion
326327

@@ -334,4 +335,12 @@ From an engineering perspective, LightX2V provides three layers of abstraction:
334335
2. **Rolling / Local Attention** controls the history window and prevents KV from growing without bound;
335336
3. **KV Quantization + KV Offload** reduce KV Cache VRAM usage, while double buffering and asynchronous streams reduce transfer overhead as much as possible.
336337

338+
The Lingbot World Fast measurements show the same pattern in practice. On H200, LightX2V improves the baseline inference time, while KV Cache optimization can significantly reduce peak VRAM at the cost of extra transfer or quantization overhead. On RTX 5090, combining KV quantization, KV offload, and weight offload turns a one-minute generation case from OOM into a runnable single-GPU workload.
339+
337340
As autoregressive video generation and real-time world models continue to evolve, KV Cache will become an increasingly important part of inference systems. For consumer GPUs, weight offload addresses static weight memory pressure, while KV Cache management addresses dynamic historical-state memory pressure. Combining the two is what makes larger long-sequence video models practical on local devices.
341+
342+
343+
https://github.com/user-attachments/assets/67efded8-65d5-4d0b-9a64-71c369e96e9c
344+
345+
346+

0 commit comments

Comments
 (0)