Skip to content

Commit cc922de

Browse files
cquil11claude
andcommitted
H100 K2.5: add --kv-cache-dtype fp8
The H100 INT4 launcher was running with default bf16 KV cache while the B300 launcher already uses fp8. With max_num_seqs * max_model_len (256K) KV pool sizing, bf16 KV pinned too much memory inside gpu-mem-util's budget, leaving no runtime headroom for fused_marlin_moe's 896 MiB intermediate buffer. fp8 halves the KV pool footprint, freeing tens of GB per GPU for runtime allocations. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
1 parent b9497f1 commit cc922de

1 file changed

Lines changed: 1 addition & 0 deletions

File tree

benchmarks/single_node/agentic/kimik2.5_int4_h100.sh

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -59,6 +59,7 @@ vllm serve $MODEL \
5959
--reasoning-parser kimi_k2 \
6060
--tool-call-parser kimi_k2 \
6161
--compilation_config.pass_config.fuse_allreduce_rms true \
62+
--kv-cache-dtype fp8 \
6263
--trust-remote-code \
6364
$OFFLOAD_ARGS > "$SERVER_LOG" 2>&1 &
6465
SERVER_PID=$!

0 commit comments

Comments
 (0)