fix: Kimi B300 image + H100/H200 MAX_MODEL_LEN caps

cquil11 · claude · cquil11 · commit 53df78a41d1c · 2026-05-14T09:34:54.000-05:00
Round 1 of the marathon left 3 of 4 targets fully broken. Fixes:

- B300 (kimik2.5-fp4-b300-vllm): v0.20.2 (cu129) lacks flashinfer FP4
  MoE kernels for B300's reported SM (the trtllm_fp4_block_scale_moe
  asserts "Only SM 10.x and 11.x are supported"; B300 reports sm_12x).
  Revert to v0.20.0-cu130, the Blackwell-targeted build that already
  works for the INT4 B300 sister.

- H100 (kimik2.5_int4_h100.sh): 80 GB HBM is too tight for Kimi K2.5
  INT4's native 1M-token context; vLLM bails with "No available memory
  for the cache blocks" at engine init. Add MAX_MODEL_LEN=32768 default.

- H200 (kimik2.5_int4_h200.sh): same shape of problem at higher cap;
  141 GB HBM. Add MAX_MODEL_LEN=131072 default.

MI355X had 17/18 pass; just retrying the 1 transient failure (170s
short — looked like a vLLM worker startup race, not a real bench
failure).

Co-Authored-By: Claude Opus 4.7 &lt;noreply@anthropic.com&gt;
diff --git a/.github/configs/nvidia-master.yaml b/.github/configs/nvidia-master.yaml
@@ -2612,7 +2612,12 @@ kimik2.5-fp4-b200-vllm:
 # does not have a B300-specific recipe, so this config reuses the existing
 # Kimi-K2.5 FP4 B200 vLLM recipe as-is until B300-specific tuning is available.
 kimik2.5-fp4-b300-vllm:
-  image: vllm/vllm-openai:v0.20.2
+  # v0.20.2 (cu129) lacks the flashinfer kernels for B300's reported SM
+  # (sm_12x); workers hit "Only SM 10.x and 11.x are supported" in the
+  # trtllm_fp4_block_scale_moe path. v0.20.0-cu130 is the Blackwell-targeted
+  # build that has the full sm_10x/sm_11x/sm_12x kernel set and is what the
+  # INT4 B300 sister already uses successfully.
+  image: vllm/vllm-openai:v0.20.0-cu130
   model: nvidia/Kimi-K2.5-NVFP4
   model-prefix: kimik2.5
   runner: b300
diff --git a/benchmarks/single_node/agentic/kimik2.5_int4_h100.sh b/benchmarks/single_node/agentic/kimik2.5_int4_h100.sh
@@ -16,6 +16,13 @@ DURATION=${DURATION:-1800}
 MAX_DELAY=${MAX_DELAY:-60}
 ADVANCE_MIN=${ADVANCE_MIN:-0.0}
 ADVANCE_MAX=${ADVANCE_MAX:-0.7}
+# H100 80 GB HBM is too small for Kimi K2.5 INT4's native 1M-token context;
+# vLLM bails at engine init with "No available memory for the cache blocks".
+# Cap at 32K so the engine can reserve KV blocks. Long-context trace turns
+# beyond this cap are truncated by vLLM at request time.
+if [ -z "${MAX_MODEL_LEN:-}" ] || [ "$MAX_MODEL_LEN" = "0" ]; then
+    MAX_MODEL_LEN=32768
+fi
 
 if [[ -n "${SLURM_JOB_ID:-}" ]]; then
     echo "JOB $SLURM_JOB_ID running on ${SLURMD_NODENAME:-unknown}"
@@ -55,6 +62,7 @@ vllm serve $MODEL \
 --port $PORT \
 --gpu-memory-utilization 0.95 \
 --tensor-parallel-size $TP \
+--max-model-len $MAX_MODEL_LEN \
 --max-num-seqs $CONC \
 --reasoning-parser kimi_k2 \
 --tool-call-parser kimi_k2 \
diff --git a/benchmarks/single_node/agentic/kimik2.5_int4_h200.sh b/benchmarks/single_node/agentic/kimik2.5_int4_h200.sh
@@ -16,6 +16,13 @@ DURATION=${DURATION:-1800}
 MAX_DELAY=${MAX_DELAY:-60}
 ADVANCE_MIN=${ADVANCE_MIN:-0.0}
 ADVANCE_MAX=${ADVANCE_MAX:-0.7}
+# H200 141 GB HBM is too tight for Kimi K2.5 INT4's native 1M-token context;
+# without a cap, vLLM either fails to allocate KV blocks at engine init or
+# aiperf's Configure-Profiling phase times out waiting for the slow KV
+# initialization. Cap at 131K for plenty of context with KV headroom.
+if [ -z "${MAX_MODEL_LEN:-}" ] || [ "$MAX_MODEL_LEN" = "0" ]; then
+    MAX_MODEL_LEN=131072
+fi
 
 if [[ -n "${SLURM_JOB_ID:-}" ]]; then
     echo "JOB $SLURM_JOB_ID running on ${SLURMD_NODENAME:-unknown}"
@@ -55,6 +62,7 @@ vllm serve $MODEL \
 --port $PORT \
 --gpu-memory-utilization 0.95 \
 --tensor-parallel-size $TP \
+--max-model-len $MAX_MODEL_LEN \
 --max-num-seqs $CONC \
 --reasoning-parser kimi_k2 \
 --tool-call-parser kimi_k2 \