Skip to content

Commit e23a541

Browse files
Oseltamivirclaude
andcommitted
Revert DSv4 B200/B300 TRT (non-MTP) to 2dd03e6 + disable SWA scratch reuse
The c914d6d image's kv_cache_manager_v2 patch was wrong: freeing SWA scratch slots on the attention-DP revert->resize(shrink) path hits finish_event=None (a deferred request never forwarded), crashing every dpa=true job and hanging the engine. Root cause is a V2-scheduler / SWA-scratch-reuse conflict: the V2 scheduler grows a context request's KV cache (incl. SWA scratch) before delay batching can defer it, so revert_allocate_context -> resize(shrink) must release scratch slots that have no finish_event. Revert both non-MTP images to feat-deepseek_v4-2dd03e6 and set TRTLLM_DSV4_ENABLE_SWA_SCRATCH_REUSE=0 in the launchers so no scratch slots are allocated and the revert shrinks cleanly. MTP configs untouched (9aa3715). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1 parent 242ab88 commit e23a541

4 files changed

Lines changed: 17 additions & 3 deletions

File tree

.github/configs/nvidia-master.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1801,7 +1801,7 @@ dsv4-fp4-b200-vllm-agentic:
18011801
- { tp: 8, ep: 8, dp-attn: true, offloading: cpu, conc-list: [64, 128, 256] }
18021802

18031803
dsv4-fp4-b200-trt:
1804-
image: ghcr.io#semianalysisai/trtllm-deepseek-v4:fix-dsv4-swa-scratch-revert-shrink-c914d6d
1804+
image: ghcr.io#semianalysisai/trtllm-deepseek-v4:feat-deepseek_v4-2dd03e6
18051805
model: deepseek-ai/DeepSeek-V4-Pro
18061806
model-prefix: dsv4
18071807
runner: b200-dsv4
@@ -3049,7 +3049,7 @@ dsv4-fp4-b300-vllm-agentic:
30493049
- { tp: 8, ep: 8, dp-attn: true, offloading: cpu, conc-list: [128, 256, 512] }
30503050

30513051
dsv4-fp4-b300-trt:
3052-
image: ghcr.io#semianalysisai/trtllm-deepseek-v4:fix-dsv4-swa-scratch-revert-shrink-c914d6d
3052+
image: ghcr.io#semianalysisai/trtllm-deepseek-v4:feat-deepseek_v4-2dd03e6
30533053
model: deepseek-ai/DeepSeek-V4-Pro
30543054
model-prefix: dsv4
30553055
runner: b300

benchmarks/single_node/fixed_seq_len/dsv4_fp4_b200_trt.sh

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,13 @@ sanitize_slurm_mpi_env_for_trtllm
4747
export NCCL_NVLS_ENABLE="${NCCL_NVLS_ENABLE:-0}"
4848
echo "NCCL_NVLS_ENABLE: $NCCL_NVLS_ENABLE"
4949

50+
# Disable DSv4 SWA scratch reuse: with attention-DP the V2 scheduler grows ctx KV
51+
# (incl. SWA scratch) before delay batching defers a request, and the resulting
52+
# revert_allocate_context -> resize(shrink) can't release the scratch of a
53+
# never-forwarded request (no finish_event), crashing every dpa=true job.
54+
export TRTLLM_DSV4_ENABLE_SWA_SCRATCH_REUSE="${TRTLLM_DSV4_ENABLE_SWA_SCRATCH_REUSE:-0}"
55+
echo "TRTLLM_DSV4_ENABLE_SWA_SCRATCH_REUSE: $TRTLLM_DSV4_ENABLE_SWA_SCRATCH_REUSE"
56+
5057
if [[ "$MODEL" != /* ]]; then
5158
hf download "$MODEL"
5259
fi

benchmarks/single_node/fixed_seq_len/dsv4_fp4_b300_trt.sh

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -59,6 +59,13 @@ sanitize_slurm_mpi_env_for_trtllm
5959
export NCCL_NVLS_ENABLE="${NCCL_NVLS_ENABLE:-0}"
6060
echo "NCCL_NVLS_ENABLE: $NCCL_NVLS_ENABLE"
6161

62+
# Disable DSv4 SWA scratch reuse: with attention-DP the V2 scheduler grows ctx KV
63+
# (incl. SWA scratch) before delay batching defers a request, and the resulting
64+
# revert_allocate_context -> resize(shrink) can't release the scratch of a
65+
# never-forwarded request (no finish_event), crashing every dpa=true job.
66+
export TRTLLM_DSV4_ENABLE_SWA_SCRATCH_REUSE="${TRTLLM_DSV4_ENABLE_SWA_SCRATCH_REUSE:-0}"
67+
echo "TRTLLM_DSV4_ENABLE_SWA_SCRATCH_REUSE: $TRTLLM_DSV4_ENABLE_SWA_SCRATCH_REUSE"
68+
6269
nvidia-smi
6370

6471
SERVER_LOG="$PWD/server.log"

perf-changelog.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3382,7 +3382,7 @@
33823382
- dsv4-fp4-b200-trt
33833383
- dsv4-fp4-b300-trt
33843384
description:
3385-
- "Update the TensorRT-LLM DeepSeek-V4-Pro image to ghcr.io/semianalysisai/trtllm-deepseek-v4:fix-dsv4-swa-scratch-revert-shrink-c914d6d (TRT-LLM feat/deepseek_v4 @ 084cf2ba + kv_cache_manager_v2 fix: free SWA scratch slots on shrink instead of asserting, which crashed the engine on attention-DP context/generation reverts at high concurrency, e.g. b300 8k1k conc>=512)"
3385+
- "Revert the non-MTP TensorRT-LLM DeepSeek-V4-Pro image to ghcr.io/semianalysisai/trtllm-deepseek-v4:feat-deepseek_v4-2dd03e6 and disable DSv4 SWA scratch reuse via TRTLLM_DSV4_ENABLE_SWA_SCRATCH_REUSE=0 in the launcher. Root cause of the prior attention-DP hangs/crashes: the V2 scheduler grows a context request's KV cache (incl. SWA scratch slots) before delay batching can defer it, so revert_allocate_context -> resize(shrink) must release scratch slots of a never-forwarded request, which has no finish_event -> crash on every dpa=true job. Disabling scratch reuse stops those slots from being allocated so the revert shrinks cleanly."
33863386
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1636
33873387

33883388
- config-keys:

0 commit comments

Comments
 (0)