[None][feat] DSv4: model, tokenizer, and integration coverage by lfr-0531 · Pull Request #15414 · NVIDIA/TensorRT-LLM

lfr-0531 · 2026-06-16T11:21:18Z

Description

Draft PR for PR-14751 split PR-9. This lands the DeepSeek-V4 model/API/tokenizer/parser wiring plus DSv4 docs, examples, CI stage wiring, integration test-list entries, attribution metadata, and focused unit coverage.

This PR intentionally depends on the earlier split PRs:

PR-1 runtime/KV foundations: [None][feat] DSv4 prep: runtime cache foundations #15378
PR-2 compressor/mHC primitives: [None][feat] DSv4 prep: compressor and mHC primitives #15379
PR-3 IndexerTopK/TopK primitives: [None][feat] DSv4 prep: IndexerTopK and TopK primitives #15381
PR-4 attention op plumbing: [None][feat] DSv4 prep: attention op plumbing #15384
PR-5 attention fusion custom ops: [None][perf] DSv4 prep: attention fusion custom ops #15390
PR-6 DSv4 sparse cache manager: [None][feat] DSv4: sparse cache manager adapter #15394
PR-7 sparse MLA backend: [None][feat] DSv4: sparse MLA attention backend #15409
PR-8 MoE/routing stack: [None][feat] DSv4 prep: MoE routing and backend support #15402

Notable scope notes:

tensorrt_llm/_torch/models/modeling_gemma4.py is intentionally absent; the unnecessary PR8 Gemma4 compatibility hunk was removed.
Earlier primitive/cache/sparse-attention/MoE implementation files are not part of PR-9's own commits.
base_worker.py is not part of the latest PR-9 extraction; DSv4 disaggregated coverage is represented by DS stage/test-list wiring and should be validated in CI or on a model-share host.

Test plan

Built wheel with ccache:
- CCACHE_DIR=/home/scratch.fanrongl_coreai/ccache_trtllm python ./scripts/build_wheel.py --trt_root /usr/local/tensorrt --benchmarks --use_ccache --cuda_architectures "90-real;100-real" --configure_cmake
- Installed build/tensorrt_llm-1.3.0rc19-cp312-cp312-linux_x86_64.whl with pip install --force-reinstall --no-deps.
- Build exited 0. Attribution generation warned ninja -t inputs returned no results for wheel targets, but wheel creation succeeded.
Import/path smoke with PYTHONPATH=$PWD confirmed tensorrt_llm and tensorrt_llm.bindings resolve from the PR-9 worktree, and verified DSv4 sparse config, MEGAMOE_DEEPGEMM, and the deepseek_v4 tokenizer alias.
After checking nvidia-smi, ran on idle CUDA_VISIBLE_DEVICES=4:
- pytest tests/unittest/_torch/modeling/test_modeling_deepseekv4.py tests/unittest/_torch/modules/test_engram.py
- Result: 62 passed, 2 warnings.
Ran:
- pytest tests/unittest/_torch/test_model_config.py tests/unittest/llmapi/test_deepseek_v4_tokenizer.py tests/unittest/llmapi/test_reasoning_parser.py tests/unittest/llmapi/apps/test_tool_parsers.py tests/unittest/llmapi/test_llm_args.py tests/unittest/api_stability
- Result: 617 passed, 3 skipped, 4 warnings.
git diff --check HEAD passed.
Scope-leakage grep passed; no modeling_gemma4.py, earlier C++ primitive/cache/sparse-attention/MoE implementation files, or base_worker.py in the PR-9 own manifest.
python -m py_compile examples/llm-api/quickstart_advanced.py scripts/test_to_stage_mapping.py passed.
python scripts/test_to_stage_mapping.py --tests "tests/integration/defs/accuracy/test_llm_api_pytorch.py::TestDeepSeekV4Flash::test_auto_dtype tests/integration/defs/accuracy/test_llm_api_pytorch.py::TestDeepSeekV4Flash::test_nvfp4_4gpus_static_eplb tests/integration/defs/accuracy/test_llm_api_pytorch.py::TestDeepSeekV4FlashBase::test_auto_dtype" returned:
- DGX_B200-4_GPUs-PyTorch-DS-1
- DGX_B300-4_GPUs-PyTorch-1
- DGX_B300-4_GPUs-PyTorch-DS-1
Targeted pre-commit on changed files passed with SKIP=type-check; the local type-check hook was attempted but exceeded 5 minutes while mypy was checking tensorrt_llm/_torch/pyexecutor/sampler.py, so it was interrupted to avoid a hang. All other targeted hooks passed, including test-list validation.

Local waivers / caveats

DSv4 4-GPU accuracy nodeids were not run locally because LLM_MODELS_ROOT is unset and no local DeepSeek-V4-Flash / DeepSeek-V4-Flash-Base model directory was found under common model roots. These should be validated in the DS CI stages or on a host with the model share mounted.
Full Sphinx docs build was not run because the active venv does not have sphinx installed.

Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>

Add DSv4 attention fusion helper ops and focused tests.\n\nThis extracts q-norm, MLA RoPE, fused inverse-RoPE FP8 quantization, and packed FP8 quantization registration from PR-14751 without pulling in the sparse MLA backend, MoE routing, or DSv4 model/tokenizer files. Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>

Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>

lfr-0531 · 2026-06-16T11:21:33Z

/bot run --extra-stage "DGX_B200-4_GPUs-PyTorch-DS-1,DGX_B300-4_GPUs-PyTorch-DS-1,GB200-4_GPUs-PyTorch-DS-1"

tensorrt-cicd · 2026-06-16T11:27:01Z

PR_Github #54577 [ run ] triggered by Bot. Commit: 2f33f8b Link to invocation

tensorrt-cicd · 2026-06-16T14:16:33Z

PR_Github #54577 [ run ] completed with state SUCCESS. Commit: 2f33f8b
/LLM/main/L0_MergeRequest_PR pipeline #43620 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>

This reverts commit a7a5679. Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>

This reverts commit 42aeb73. Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>

Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>

- gate_forward: mark out_weights/out_indices as Tensor(a!)/(b!) so the in-place op is sound under functionalization / CUDA graphs. - DeepSeekV4 hashed routing: substitute an empty bias tensor when the bias callable returns None, avoiding a dispatch failure on Tensor bias. - TRTLLMGen FP8 block-scale: reject a per-expert swiglu_limit tensor; only the uniform swiglu_limit_scalar is consumed. - MegaMoE DG weight dedup: record rebuild metadata before releasing redundant params so dynamic-EPLB reload can restore them. - fp8_utils swiglu clamp: gate on swiglu_limit > 0.0 to match silu_and_mul and avoid a destructive 0.0 clamp. Signed-off-by: xxi <xxi@nvidia.com>

…e-routing Signed-off-by: xxi <xxi@nvidia.com> # Conflicts: # tensorrt_llm/_torch/modules/fused_moe/__init__.py

…A pointer shape The DSv4 routing prep changed BaseMoeRoutingMethod.apply to take an extra input_ids argument and updated all MoE backend call sites to pass it positionally, but two routing-method subclasses defined outside routing.py were not updated: - Gemma4MoeRoutingMethod (modeling_gemma4.py) - Step3p7MoeRoutingMethod (modeling_step3p7.py) Their one-argument apply() now raises a TypeError when invoked by the backends, breaking the Gemma4 MoE model tests. Add the optional input_ids parameter to match the base contract. Also restore the .reshape(-1, 3) on the routed-expert MoE LoRA weight_pointers in CutlassFusedMoE._extract_moe_lora_tensors. The pointer table is built flat ([num_seqs * 3]) in PyTorchModelEngine._build_lora_params and the C++ fused_moe op asserts a [num_seqs, 3] shape (moeOp.cpp), so dropping the reshape breaks the eager MoE LoRA path. Signed-off-by: xxi <xxi@nvidia.com>

…failure The DSv4 prep bumped DeepGEMM to 67fc648, whose sm100_fp4_mqa_logits.cuh splits the TMEM accumulator load into two halves of kNumHeads/2 but still asserts the load width N is in {32, 64}. For the num_heads=32 indexer shape this drives N=16, tripping the static_assert and failing the runtime NVCC JIT compile (test_dsa_fp4_indexer ::test_fp4_mqa_logits_shape_and_topk_intersection[32] on B200/B300). DeepGEMM 245dc5d6 restores the N==16 TMEM loader (DeepGEMM NVIDIA#353, paged-MQA fixes) while preserving set_pdl(); 67fc648 is a direct ancestor so this is a purely additive bump. This is the same tag already adopted on feat/deepseek_v4 (PR NVIDIA#14940). Updates the two attribution data files to match, keeping the trt-llm-oss-compliance check green. Signed-off-by: xxi <xxi@nvidia.com>

Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>

Signed-off-by: Fanrong Leng <fanrongl@nvidia.com> # Conflicts: # 3rdparty/fetch_content.json # cpp/tensorrt_llm/thop/moeGateOp.cpp # scripts/attribution/data/dependency_metadata.yml # scripts/attribution/data/files_to_dependency.yml # tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py # tensorrt_llm/_torch/modules/fused_moe/quantization.py # tensorrt_llm/_torch/modules/fused_moe/routing.py # tensorrt_llm/quantization/utils/fp8_utils.py

lfr-0531 added 14 commits June 16, 2026 06:45

[None][feat] DSv4 prep: runtime cache foundations

dae4143

Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>

[None][feat] DSv4 prep: compressor and mHC primitives

7381699

Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>

[None][feat] DSv4: sparse cache manager adapter

d58937e

Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>

[None][feat] DSv4 prep: IndexerTopK and TopK primitives

c01efc3

Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>

[None][feat] DSv4 prep: attention op plumbing

5125977

Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>

[None][perf] address attention fusion review comments

f05eb11

Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>

[None][perf] restore fused fp8 quant SM gate

1c20b17

Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>

[None][feat] Update shared sparse MLA backend

b0d8590

Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>

[None][feat] Add DeepSeek V4 sparse MLA backend

37cf02d

Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>

[None][test] fix sparse MLA verification regressions

4711a56

Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>

[None][feat] DSv4 prep: MoE routing and backend support

eee4ccd

Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>

[None][feat] DSv4 model and API support

2202e8c

Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>

[None][test] DSv4 integration coverage and docs

2f33f8b

Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>

github-actions Bot assigned lfr-0531 Jun 16, 2026

lfr-0531 and others added 12 commits June 17, 2026 07:54

[None][feat] DSv4 prep: MoE routing and backend support

5f201da

Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>

Revert "[None][feat] DSv4 prep: MoE routing and backend support"

e6fdbaf

This reverts commit a7a5679. Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>

Reapply "[None][feat] DSv4 prep: MoE routing and backend support"

8aa1b08

This reverts commit 42aeb73. Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>

[None][chore] Remove MoE LoRA hunk from PR8

d2ff3ed

Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>

[None][refactor] Remove generic CUTLASS changes from PR8

efbe014

Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>

Merge remote-tracking branch 'origin/main' into user/fanrongl/dsv4-mo…

e7ae0ea

…e-routing Signed-off-by: xxi <xxi@nvidia.com> # Conflicts: # tensorrt_llm/_torch/modules/fused_moe/__init__.py

Merge branch 'main' into user/fanrongl/dsv4-moe-routing

e6cbe06

Merge latest main into DSv4 sparse MLA branch

a83afbf

Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>

Fix DeepSeek V4 sparse params import

946504c

Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>

lfr-0531 added 3 commits June 23, 2026 11:20

Fix DSv4 sparse metadata compatibility

0a87d7e

Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>

Merge latest DSv4 sparse MLA branch into DSv4 model API

7c38d06

Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[None][feat] DSv4: model, tokenizer, and integration coverage#15414

[None][feat] DSv4: model, tokenizer, and integration coverage#15414
lfr-0531 wants to merge 29 commits into
NVIDIA:mainfrom
lfr-0531:user/fanrongl/dsv4-model-api

lfr-0531 commented Jun 16, 2026

Uh oh!

lfr-0531 commented Jun 16, 2026

Uh oh!

tensorrt-cicd commented Jun 16, 2026

Uh oh!

tensorrt-cicd commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

lfr-0531 commented Jun 16, 2026

Description

Test plan

Local waivers / caveats

Uh oh!

lfr-0531 commented Jun 16, 2026

Uh oh!

tensorrt-cicd commented Jun 16, 2026

Uh oh!

tensorrt-cicd commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants