Skip to content

Commit 53df78a

Browse files
cquil11claude
andcommitted
fix: Kimi B300 image + H100/H200 MAX_MODEL_LEN caps
Round 1 of the marathon left 3 of 4 targets fully broken. Fixes: - B300 (kimik2.5-fp4-b300-vllm): v0.20.2 (cu129) lacks flashinfer FP4 MoE kernels for B300's reported SM (the trtllm_fp4_block_scale_moe asserts "Only SM 10.x and 11.x are supported"; B300 reports sm_12x). Revert to v0.20.0-cu130, the Blackwell-targeted build that already works for the INT4 B300 sister. - H100 (kimik2.5_int4_h100.sh): 80 GB HBM is too tight for Kimi K2.5 INT4's native 1M-token context; vLLM bails with "No available memory for the cache blocks" at engine init. Add MAX_MODEL_LEN=32768 default. - H200 (kimik2.5_int4_h200.sh): same shape of problem at higher cap; 141 GB HBM. Add MAX_MODEL_LEN=131072 default. MI355X had 17/18 pass; just retrying the 1 transient failure (170s short — looked like a vLLM worker startup race, not a real bench failure). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
1 parent ec66535 commit 53df78a

3 files changed

Lines changed: 22 additions & 1 deletion

File tree

.github/configs/nvidia-master.yaml

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2612,7 +2612,12 @@ kimik2.5-fp4-b200-vllm:
26122612
# does not have a B300-specific recipe, so this config reuses the existing
26132613
# Kimi-K2.5 FP4 B200 vLLM recipe as-is until B300-specific tuning is available.
26142614
kimik2.5-fp4-b300-vllm:
2615-
image: vllm/vllm-openai:v0.20.2
2615+
# v0.20.2 (cu129) lacks the flashinfer kernels for B300's reported SM
2616+
# (sm_12x); workers hit "Only SM 10.x and 11.x are supported" in the
2617+
# trtllm_fp4_block_scale_moe path. v0.20.0-cu130 is the Blackwell-targeted
2618+
# build that has the full sm_10x/sm_11x/sm_12x kernel set and is what the
2619+
# INT4 B300 sister already uses successfully.
2620+
image: vllm/vllm-openai:v0.20.0-cu130
26162621
model: nvidia/Kimi-K2.5-NVFP4
26172622
model-prefix: kimik2.5
26182623
runner: b300

benchmarks/single_node/agentic/kimik2.5_int4_h100.sh

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,13 @@ DURATION=${DURATION:-1800}
1616
MAX_DELAY=${MAX_DELAY:-60}
1717
ADVANCE_MIN=${ADVANCE_MIN:-0.0}
1818
ADVANCE_MAX=${ADVANCE_MAX:-0.7}
19+
# H100 80 GB HBM is too small for Kimi K2.5 INT4's native 1M-token context;
20+
# vLLM bails at engine init with "No available memory for the cache blocks".
21+
# Cap at 32K so the engine can reserve KV blocks. Long-context trace turns
22+
# beyond this cap are truncated by vLLM at request time.
23+
if [ -z "${MAX_MODEL_LEN:-}" ] || [ "$MAX_MODEL_LEN" = "0" ]; then
24+
MAX_MODEL_LEN=32768
25+
fi
1926

2027
if [[ -n "${SLURM_JOB_ID:-}" ]]; then
2128
echo "JOB $SLURM_JOB_ID running on ${SLURMD_NODENAME:-unknown}"
@@ -55,6 +62,7 @@ vllm serve $MODEL \
5562
--port $PORT \
5663
--gpu-memory-utilization 0.95 \
5764
--tensor-parallel-size $TP \
65+
--max-model-len $MAX_MODEL_LEN \
5866
--max-num-seqs $CONC \
5967
--reasoning-parser kimi_k2 \
6068
--tool-call-parser kimi_k2 \

benchmarks/single_node/agentic/kimik2.5_int4_h200.sh

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,13 @@ DURATION=${DURATION:-1800}
1616
MAX_DELAY=${MAX_DELAY:-60}
1717
ADVANCE_MIN=${ADVANCE_MIN:-0.0}
1818
ADVANCE_MAX=${ADVANCE_MAX:-0.7}
19+
# H200 141 GB HBM is too tight for Kimi K2.5 INT4's native 1M-token context;
20+
# without a cap, vLLM either fails to allocate KV blocks at engine init or
21+
# aiperf's Configure-Profiling phase times out waiting for the slow KV
22+
# initialization. Cap at 131K for plenty of context with KV headroom.
23+
if [ -z "${MAX_MODEL_LEN:-}" ] || [ "$MAX_MODEL_LEN" = "0" ]; then
24+
MAX_MODEL_LEN=131072
25+
fi
1926

2027
if [[ -n "${SLURM_JOB_ID:-}" ]]; then
2128
echo "JOB $SLURM_JOB_ID running on ${SLURMD_NODENAME:-unknown}"
@@ -55,6 +62,7 @@ vllm serve $MODEL \
5562
--port $PORT \
5663
--gpu-memory-utilization 0.95 \
5764
--tensor-parallel-size $TP \
65+
--max-model-len $MAX_MODEL_LEN \
5866
--max-num-seqs $CONC \
5967
--reasoning-parser kimi_k2 \
6068
--tool-call-parser kimi_k2 \

0 commit comments

Comments
 (0)