Use official TRT-LLM image (1.3.0rc15.post1) for DSv4 B300 TRT (non-MTP + MTP)#1636
Use official TRT-LLM image (1.3.0rc15.post1) for DSv4 B300 TRT (non-MTP + MTP)#1636Oseltamivir wants to merge 14 commits into
Conversation
…03e6 Bumps the TensorRT-LLM DeepSeek-V4-Pro image for dsv4-fp4-b200-trt and dsv4-fp4-b300-trt to ghcr.io#semianalysisai/trtllm-deepseek-v4:feat-deepseek_v4-2dd03e6. The -mtp variants are intentionally left on feat-deepseek_v4-9aa3715. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
1 similar comment
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26783090679 |
| - config-keys: | ||
| - dsv4-fp4-b200-trt | ||
| - dsv4-fp4-b300-trt | ||
| description: | ||
| - "Update the TensorRT-LLM DeepSeek-V4-Pro image to ghcr.io/semianalysisai/trtllm-deepseek-v4:feat-deepseek_v4-2dd03e6" | ||
| pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXXX |
There was a problem hiding this comment.
🟡 The new perf-changelog.yaml entry leaves the pr-link as the unfilled template placeholder https://github.com/SemiAnalysisAI/InferenceX/pull/XXXX. It should be pull/1636 to match the actual PR number and the convention of every preceding entry, otherwise the pr-link is broken for both humans and any tooling that consumes the changelog.
Extended reasoning...
What the bug is. The diff appends a new entry to perf-changelog.yaml (lines 3358-3363) for the DSv4 TRT image bump on dsv4-fp4-b200-trt and dsv4-fp4-b300-trt. The last field of that entry is pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXXX. XXXX is the literal template placeholder — it was never filled in with the actual PR number.
Why this is a real issue. Every other recent entry in the same file follows the convention of using the real PR number — the five entries immediately above this one link to pull/1602, pull/1624, pull/1616, pull/1626, and pull/1630 respectively. The PR metadata for this change shows it is PR #1636, so the value should be https://github.com/SemiAnalysisAI/InferenceX/pull/1636. With XXXX left in place, the link does not resolve to any PR, breaking the traceability that the pr-link field exists to provide.
Impact. This does not affect the actual image bump or any sweep behavior — the runtime is unchanged. The damage is to the changelog's documentation/audit value: anyone trying to find the originating PR for these two config-key changes from the changelog hits a dead 404, and any tooling that parses pr-link (e.g., to cross-link sweep results back to PRs, generate release notes, or validate entries) will either fail or produce a broken link.
Why existing checks didn't prevent it. There appears to be no schema validation that rejects XXXX as a PR number; the YAML is valid syntactically and the link is a syntactically valid URL — it just points nowhere meaningful. The placeholder is the kind of thing only a reviewer or a numeric-PR-id linter would catch.
Fix. Replace the placeholder with the real PR number:
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1636Step-by-step proof.
- Open
perf-changelog.yamlat line 3363. - Observe the literal line:
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXXX. - Compare to the five entries immediately above (lines ending around 3332, 3338, 3344, 3350, 3356), which read
pull/1602,pull/1624,pull/1616,pull/1626,pull/1630— all real PR numbers. - Check the PR metadata in this review: PR number is
1636. - Click (or curl)
https://github.com/SemiAnalysisAI/InferenceX/pull/XXXX— it does not resolve to a PR. Clickhttps://github.com/SemiAnalysisAI/InferenceX/pull/1636— it resolves to this PR. The placeholder thus makes the field useless for its stated purpose.
|
|
||
| dsv4-fp4-b200-trt: | ||
| image: ghcr.io#semianalysisai/trtllm-deepseek-v4:feat-deepseek_v4-9aa3715 | ||
| image: ghcr.io#semianalysisai/trtllm-deepseek-v4:feat-deepseek_v4-2dd03e6 |
There was a problem hiding this comment.
is there any official nvidia RC that works...
There was a problem hiding this comment.
Image is from dsv4 branch: https://github.com/NVIDIA/TensorRT-LLM/tree/feat/deepseek_v4
Main dsv4 failing DPA: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/26786937394
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26783097365 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26786056973 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26786107993 |
… (non-MTP) Swap dsv4-fp4-b200-trt and dsv4-fp4-b300-trt from the custom ghcr.io semianalysis feat/deepseek_v4 build to the official nvcr.io#nvidia/tensorrt-llm/release:1.3.0rc15.post1 to test whether the official RC can serve DeepSeek-V4-Pro. The -mtp variants are unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26786937394 |
…on-MTP) The official nvcr.io tensorrt-llm/release:1.3.0rc15.post1 loads DSv4-Pro but its DP-attention path deadlocks/crashes under concurrent load (every dpa=true job hung or failed; only pure-TP conc-1 points passed). Revert to the stable custom build until upstream fixes DSv4 + attention-DP (NVIDIA/TensorRT-LLM#13431). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26803566770 |
1 similar comment
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26803566770 |
Bump dsv4-fp4-b200-trt and dsv4-fp4-b300-trt to ghcr.io#semianalysisai/trtllm-deepseek-v4:fix-dsv4-swa-scratch-revert-shrink-c914d6d (TRT-LLM feat/deepseek_v4 @ 084cf2ba + kv_cache_manager_v2 fix). This resolves the engine crash on attention-DP context/generation reverts at high concurrency (the b300 8k1k conc>=512 "LLM is shutting down" hang). The -mtp variants stay on feat-deepseek_v4-9aa3715. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26803566770 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26811531104 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26811681728 |
…reuse The c914d6d image's kv_cache_manager_v2 patch was wrong: freeing SWA scratch slots on the attention-DP revert->resize(shrink) path hits finish_event=None (a deferred request never forwarded), crashing every dpa=true job and hanging the engine. Root cause is a V2-scheduler / SWA-scratch-reuse conflict: the V2 scheduler grows a context request's KV cache (incl. SWA scratch) before delay batching can defer it, so revert_allocate_context -> resize(shrink) must release scratch slots that have no finish_event. Revert both non-MTP images to feat-deepseek_v4-2dd03e6 and set TRTLLM_DSV4_ENABLE_SWA_SCRATCH_REUSE=0 in the launchers so no scratch slots are allocated and the revert shrinks cleanly. MTP configs untouched (9aa3715). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1f70cac to
e23a541
Compare
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26843313476 |
3 similar comments
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26843313476 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26843313476 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26843313476 |
B200 reverts to feat-deepseek_v4-9aa3715: the 2dd03e6 image OOMs on B200's smaller HBM at conc-256 once SWA scratch reuse is disabled. Only B300 moves to 2dd03e6 + TRTLLM_DSV4_ENABLE_SWA_SCRATCH_REUSE=0 in its launcher. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26912996470 |
…TP + MTP) Point dsv4-fp4-b300-trt and dsv4-fp4-b300-trt-mtp at the official nvcr.io#nvidia/tensorrt-llm/release:1.3.0rc15.post1 (from the custom feat/deepseek_v4 builds 2dd03e6 / 9aa3715) and drop the TRTLLM_DSV4_ENABLE_SWA_SCRATCH_REUSE=0 launcher workaround so the official image runs with native behavior. B200 TRT unchanged (9aa3715). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26914210927 |
1 similar comment
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26914210927 |
…PA crash The previous sweep crashed all dpa=true jobs with CUDA_ERROR_ILLEGAL_ADDRESS on rc15.post1 without the SWA scratch workaround. Re-add TRTLLM_DSV4_ENABLE_SWA_SCRATCH_REUSE=0 to both B300 TRT launchers (non-MTP and MTP) to determine whether the DPA crash is the same SWA-scratch bug or a separate FMHA kernel issue. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| - dsv4-fp4-b300-trt-mtp | ||
| description: | ||
| - "Point the B300 TensorRT-LLM DeepSeek-V4-Pro configs (non-MTP dsv4-fp4-b300-trt and MTP dsv4-fp4-b300-trt-mtp) at the official NVIDIA release image nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc15.post1, replacing the custom ghcr.io semianalysis feat/deepseek_v4 builds (2dd03e6 and 9aa3715 respectively), to evaluate the official RC for DeepSeek-V4-Pro. Also drops the TRTLLM_DSV4_ENABLE_SWA_SCRATCH_REUSE=0 launcher workaround (specific to the custom build) so the official image runs with its native behavior. B200 TRT is unchanged (stays on feat-deepseek_v4-9aa3715)." | ||
| pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1636 |
There was a problem hiding this comment.
Duplicate PR changelog entries
Low Severity
This commit adds two separate perf-changelog.yaml blocks for the same PR link and the same dsv4-fp4-b300-trt / dsv4-fp4-b300-trt-mtp config keys, with conflicting descriptions. That duplicates maintenance and leaves readers unsure which entry reflects the shipped change.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit c2381b7. Configure here.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 2 total unresolved issues (including 1 from previous review).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit b09619e. Configure here.
| - dsv4-fp4-b300-trt-mtp | ||
| description: | ||
| - "Point the B300 TensorRT-LLM DeepSeek-V4-Pro configs (non-MTP dsv4-fp4-b300-trt and MTP dsv4-fp4-b300-trt-mtp) at the official NVIDIA release image nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc15.post1, replacing the custom ghcr.io semianalysis feat/deepseek_v4 builds (2dd03e6 and 9aa3715 respectively), to evaluate the official RC for DeepSeek-V4-Pro. Also drops the TRTLLM_DSV4_ENABLE_SWA_SCRATCH_REUSE=0 launcher workaround (specific to the custom build) so the official image runs with its native behavior. B200 TRT is unchanged (stays on feat-deepseek_v4-9aa3715)." | ||
| pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1636 |
There was a problem hiding this comment.
Changelog contradicts launcher workaround
Medium Severity
The new perf-changelog entry states the TRTLLM_DSV4_ENABLE_SWA_SCRATCH_REUSE=0 launcher workaround was dropped so the official image uses native behavior, but the same PR adds that export (default 0) to both B300 TRT launchers, so sweeps still disable SWA scratch reuse.
Additional Locations (2)
Reviewed by Cursor Bugbot for commit b09619e. Configure here.
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26999118817 |


Points both B300 DSv4 TRT configs at the official NVIDIA release image and adds the MTP sibling to the sweep:
dsv4-fp4-b300-trt(non-MTP):feat-deepseek_v4-2dd03e6→nvcr.io#nvidia/tensorrt-llm/release:1.3.0rc15.post1dsv4-fp4-b300-trt-mtp(MTP):feat-deepseek_v4-9aa3715→nvcr.io#nvidia/tensorrt-llm/release:1.3.0rc15.post1This drops the custom
ghcr.iosemianalysisfeat/deepseek_v4builds in favor of the official RC, to evaluate whether the official image can serve DeepSeek-V4-Pro (non-MTP and MTP). The non-MTP launcher'sTRTLLM_DSV4_ENABLE_SWA_SCRATCH_REUSE=0workaround (specific to the custom build) is removed so the official image runs with its native behavior, matching the MTP launcher which never had it.Known risk
A prior run of
1.3.0rc15.post1with attention-DP (dpa=true) served a couple of iterations and then crashed withCUDA_ERROR_ILLEGAL_ADDRESSinkv_cache_manager.free_resources(run 26786937394) — a different failure from the custom build's SWA-scratch-revert crash. Sodpa=truejobs may still fail on the official image; the pure-TP (dpa=false) cases are more likely to pass. MTP on the official RC is untested. This sweep is what tells us where it stands.Scope
B200 TRT is unchanged (stays on
feat-deepseek_v4-9aa3715); its OOM follow-up is tracked separately.🤖 Generated with Claude Code
Note
Medium Risk
Changes only benchmark images and launcher env vars, but official RC plus
dp-attn: truesearch points have previously hit CUDA illegal-address crashes, so sweep stability is uncertain until results land.Overview
B300 DSv4 TensorRT-LLM fixed-seq-len configs
dsv4-fp4-b300-trtanddsv4-fp4-b300-trt-mtpnow usenvcr.io/nvidia/tensorrt-llm/release:1.3.0rc15.post1instead of customghcr.io/semianalysisai/trtllm-deepseek-v4tags, so sweeps evaluate the official RC for non-MTP and MTP on B300. B200 TRT configs are untouched.Both
dsv4_fp4_b300_trt.shanddsv4_fp4_b300_trt_mtp.shexportTRTLLM_DSV4_ENABLE_SWA_SCRATCH_REUSE=0by default (with echo logging) to see whetherrc15.post1attention-DP failures match the old SWA-scratch issue or a separate FMHA/kernel path.perf-changelog.yamlrecords the image move and the SWA-scratch diagnostic toggle for PR #1636.Reviewed by Cursor Bugbot for commit b09619e. Bugbot is set up for automated code reviews on this repo. Configure here.