Skip to content

Commit d7411dd

Browse files
cquil11claude
andcommitted
revert: drop MAX_MODEL_LEN cap from Kimi H100/H200 launchers
Per agentic benchmark design: must not cap context. Reverts the H100 MAX=16K + gpu-mem 0.85 and H200 MAX=131K caps; runs back to no --max-model-len flag at all (vLLM uses the model's native context). Any OOM / KV-init failures will be diagnosed separately rather than sidestepped via a cap. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
1 parent 957d020 commit d7411dd

2 files changed

Lines changed: 1 addition & 18 deletions

File tree

benchmarks/single_node/agentic/kimik2.5_int4_h100.sh

Lines changed: 1 addition & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -16,14 +16,6 @@ DURATION=${DURATION:-1800}
1616
MAX_DELAY=${MAX_DELAY:-60}
1717
ADVANCE_MIN=${ADVANCE_MIN:-0.0}
1818
ADVANCE_MAX=${ADVANCE_MAX:-0.7}
19-
# H100 80 GB HBM is barely enough for Kimi K2.5 INT4 (~44 GB/GPU weights at
20-
# TP=8) plus KV reservation plus MoE intermediate buffers. R2 at MAX=32K hit
21-
# CUDA OOM in fused_marlin_moe even at conc=1. Drop to 16K so KV reservation
22-
# is half, and pair with --gpu-memory-utilization 0.85 (below) to leave room
23-
# for the ~900 MiB-per-call MoE workspace.
24-
if [ -z "${MAX_MODEL_LEN:-}" ] || [ "$MAX_MODEL_LEN" = "0" ]; then
25-
MAX_MODEL_LEN=16384
26-
fi
2719

2820
if [[ -n "${SLURM_JOB_ID:-}" ]]; then
2921
echo "JOB $SLURM_JOB_ID running on ${SLURMD_NODENAME:-unknown}"
@@ -61,9 +53,8 @@ export VLLM_USE_FLASHINFER_MOE_INT4=1
6153
vllm serve $MODEL \
6254
--host 0.0.0.0 \
6355
--port $PORT \
64-
--gpu-memory-utilization 0.85 \
56+
--gpu-memory-utilization 0.95 \
6557
--tensor-parallel-size $TP \
66-
--max-model-len $MAX_MODEL_LEN \
6758
--max-num-seqs $CONC \
6859
--reasoning-parser kimi_k2 \
6960
--tool-call-parser kimi_k2 \

benchmarks/single_node/agentic/kimik2.5_int4_h200.sh

Lines changed: 0 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -16,13 +16,6 @@ DURATION=${DURATION:-1800}
1616
MAX_DELAY=${MAX_DELAY:-60}
1717
ADVANCE_MIN=${ADVANCE_MIN:-0.0}
1818
ADVANCE_MAX=${ADVANCE_MAX:-0.7}
19-
# H200 141 GB HBM is too tight for Kimi K2.5 INT4's native 1M-token context;
20-
# without a cap, vLLM either fails to allocate KV blocks at engine init or
21-
# aiperf's Configure-Profiling phase times out waiting for the slow KV
22-
# initialization. Cap at 131K for plenty of context with KV headroom.
23-
if [ -z "${MAX_MODEL_LEN:-}" ] || [ "$MAX_MODEL_LEN" = "0" ]; then
24-
MAX_MODEL_LEN=131072
25-
fi
2619

2720
if [[ -n "${SLURM_JOB_ID:-}" ]]; then
2821
echo "JOB $SLURM_JOB_ID running on ${SLURMD_NODENAME:-unknown}"
@@ -62,7 +55,6 @@ vllm serve $MODEL \
6255
--port $PORT \
6356
--gpu-memory-utilization 0.95 \
6457
--tensor-parallel-size $TP \
65-
--max-model-len $MAX_MODEL_LEN \
6658
--max-num-seqs $CONC \
6759
--reasoning-parser kimi_k2 \
6860
--tool-call-parser kimi_k2 \

0 commit comments

Comments
 (0)