Commit cc922de
H100 K2.5: add --kv-cache-dtype fp8
The H100 INT4 launcher was running with default bf16 KV cache while the
B300 launcher already uses fp8. With max_num_seqs * max_model_len (256K)
KV pool sizing, bf16 KV pinned too much memory inside gpu-mem-util's
budget, leaving no runtime headroom for fused_marlin_moe's 896 MiB
intermediate buffer. fp8 halves the KV pool footprint, freeing tens of
GB per GPU for runtime allocations.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>1 parent b9497f1 commit cc922de
1 file changed
Lines changed: 1 addition & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
59 | 59 | | |
60 | 60 | | |
61 | 61 | | |
| 62 | + | |
62 | 63 | | |
63 | 64 | | |
64 | 65 | | |
| |||
0 commit comments