Skip to content

[None][feat] DSv4: model, tokenizer, and integration coverage#15414

Draft
lfr-0531 wants to merge 29 commits into
NVIDIA:mainfrom
lfr-0531:user/fanrongl/dsv4-model-api
Draft

[None][feat] DSv4: model, tokenizer, and integration coverage#15414
lfr-0531 wants to merge 29 commits into
NVIDIA:mainfrom
lfr-0531:user/fanrongl/dsv4-model-api

Conversation

@lfr-0531

Copy link
Copy Markdown
Collaborator

Description

Draft PR for PR-14751 split PR-9. This lands the DeepSeek-V4 model/API/tokenizer/parser wiring plus DSv4 docs, examples, CI stage wiring, integration test-list entries, attribution metadata, and focused unit coverage.

This PR intentionally depends on the earlier split PRs:

Notable scope notes:

  • tensorrt_llm/_torch/models/modeling_gemma4.py is intentionally absent; the unnecessary PR8 Gemma4 compatibility hunk was removed.
  • Earlier primitive/cache/sparse-attention/MoE implementation files are not part of PR-9's own commits.
  • base_worker.py is not part of the latest PR-9 extraction; DSv4 disaggregated coverage is represented by DS stage/test-list wiring and should be validated in CI or on a model-share host.

Test plan

  • Built wheel with ccache:
    • CCACHE_DIR=/home/scratch.fanrongl_coreai/ccache_trtllm python ./scripts/build_wheel.py --trt_root /usr/local/tensorrt --benchmarks --use_ccache --cuda_architectures "90-real;100-real" --configure_cmake
    • Installed build/tensorrt_llm-1.3.0rc19-cp312-cp312-linux_x86_64.whl with pip install --force-reinstall --no-deps.
    • Build exited 0. Attribution generation warned ninja -t inputs returned no results for wheel targets, but wheel creation succeeded.
  • Import/path smoke with PYTHONPATH=$PWD confirmed tensorrt_llm and tensorrt_llm.bindings resolve from the PR-9 worktree, and verified DSv4 sparse config, MEGAMOE_DEEPGEMM, and the deepseek_v4 tokenizer alias.
  • After checking nvidia-smi, ran on idle CUDA_VISIBLE_DEVICES=4:
    • pytest tests/unittest/_torch/modeling/test_modeling_deepseekv4.py tests/unittest/_torch/modules/test_engram.py
    • Result: 62 passed, 2 warnings.
  • Ran:
    • pytest tests/unittest/_torch/test_model_config.py tests/unittest/llmapi/test_deepseek_v4_tokenizer.py tests/unittest/llmapi/test_reasoning_parser.py tests/unittest/llmapi/apps/test_tool_parsers.py tests/unittest/llmapi/test_llm_args.py tests/unittest/api_stability
    • Result: 617 passed, 3 skipped, 4 warnings.
  • git diff --check HEAD passed.
  • Scope-leakage grep passed; no modeling_gemma4.py, earlier C++ primitive/cache/sparse-attention/MoE implementation files, or base_worker.py in the PR-9 own manifest.
  • python -m py_compile examples/llm-api/quickstart_advanced.py scripts/test_to_stage_mapping.py passed.
  • python scripts/test_to_stage_mapping.py --tests "tests/integration/defs/accuracy/test_llm_api_pytorch.py::TestDeepSeekV4Flash::test_auto_dtype tests/integration/defs/accuracy/test_llm_api_pytorch.py::TestDeepSeekV4Flash::test_nvfp4_4gpus_static_eplb tests/integration/defs/accuracy/test_llm_api_pytorch.py::TestDeepSeekV4FlashBase::test_auto_dtype" returned:
    • DGX_B200-4_GPUs-PyTorch-DS-1
    • DGX_B300-4_GPUs-PyTorch-1
    • DGX_B300-4_GPUs-PyTorch-DS-1
  • Targeted pre-commit on changed files passed with SKIP=type-check; the local type-check hook was attempted but exceeded 5 minutes while mypy was checking tensorrt_llm/_torch/pyexecutor/sampler.py, so it was interrupted to avoid a hang. All other targeted hooks passed, including test-list validation.

Local waivers / caveats

  • DSv4 4-GPU accuracy nodeids were not run locally because LLM_MODELS_ROOT is unset and no local DeepSeek-V4-Flash / DeepSeek-V4-Flash-Base model directory was found under common model roots. These should be validated in the DS CI stages or on a host with the model share mounted.
  • Full Sphinx docs build was not run because the active venv does not have sphinx installed.

lfr-0531 added 14 commits June 16, 2026 06:45
Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
Add DSv4 attention fusion helper ops and focused tests.\n\nThis extracts q-norm, MLA RoPE, fused inverse-RoPE FP8 quantization, and packed FP8 quantization registration from PR-14751 without pulling in the sparse MLA backend, MoE routing, or DSv4 model/tokenizer files.

Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
@lfr-0531

Copy link
Copy Markdown
Collaborator Author

/bot run --extra-stage "DGX_B200-4_GPUs-PyTorch-DS-1,DGX_B300-4_GPUs-PyTorch-DS-1,GB200-4_GPUs-PyTorch-DS-1"

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54577 [ run ] triggered by Bot. Commit: 2f33f8b Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54577 [ run ] completed with state SUCCESS. Commit: 2f33f8b
/LLM/main/L0_MergeRequest_PR pipeline #43620 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

lfr-0531 and others added 12 commits June 17, 2026 07:54
Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
This reverts commit a7a5679.

Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
This reverts commit 42aeb73.

Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
- gate_forward: mark out_weights/out_indices as Tensor(a!)/(b!) so the in-place op is sound under functionalization / CUDA graphs.
- DeepSeekV4 hashed routing: substitute an empty bias tensor when the bias callable returns None, avoiding a dispatch failure on Tensor bias.
- TRTLLMGen FP8 block-scale: reject a per-expert swiglu_limit tensor; only the uniform swiglu_limit_scalar is consumed.
- MegaMoE DG weight dedup: record rebuild metadata before releasing redundant params so dynamic-EPLB reload can restore them.
- fp8_utils swiglu clamp: gate on swiglu_limit > 0.0 to match silu_and_mul and avoid a destructive 0.0 clamp.

Signed-off-by: xxi <xxi@nvidia.com>
…e-routing

Signed-off-by: xxi <xxi@nvidia.com>

# Conflicts:
#	tensorrt_llm/_torch/modules/fused_moe/__init__.py
…A pointer shape

The DSv4 routing prep changed BaseMoeRoutingMethod.apply to take an
extra input_ids argument and updated all MoE backend call sites to pass
it positionally, but two routing-method subclasses defined outside
routing.py were not updated:

- Gemma4MoeRoutingMethod (modeling_gemma4.py)
- Step3p7MoeRoutingMethod (modeling_step3p7.py)

Their one-argument apply() now raises a TypeError when invoked by the
backends, breaking the Gemma4 MoE model tests. Add the optional
input_ids parameter to match the base contract.

Also restore the .reshape(-1, 3) on the routed-expert MoE LoRA
weight_pointers in CutlassFusedMoE._extract_moe_lora_tensors. The
pointer table is built flat ([num_seqs * 3]) in
PyTorchModelEngine._build_lora_params and the C++ fused_moe op asserts a
[num_seqs, 3] shape (moeOp.cpp), so dropping the reshape breaks the
eager MoE LoRA path.

Signed-off-by: xxi <xxi@nvidia.com>
…failure

The DSv4 prep bumped DeepGEMM to 67fc648, whose sm100_fp4_mqa_logits.cuh
splits the TMEM accumulator load into two halves of kNumHeads/2 but still
asserts the load width N is in {32, 64}. For the num_heads=32 indexer
shape this drives N=16, tripping the static_assert and failing the
runtime NVCC JIT compile (test_dsa_fp4_indexer
::test_fp4_mqa_logits_shape_and_topk_intersection[32] on B200/B300).

DeepGEMM 245dc5d6 restores the N==16 TMEM loader (DeepGEMM NVIDIA#353,
paged-MQA fixes) while preserving set_pdl(); 67fc648 is a direct ancestor
so this is a purely additive bump. This is the same tag already adopted
on feat/deepseek_v4 (PR NVIDIA#14940). Updates the two attribution data files
to match, keeping the trt-llm-oss-compliance check green.

Signed-off-by: xxi <xxi@nvidia.com>
Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
lfr-0531 added 3 commits June 23, 2026 11:20
Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
Signed-off-by: Fanrong Leng <fanrongl@nvidia.com>

# Conflicts:
#	3rdparty/fetch_content.json
#	cpp/tensorrt_llm/thop/moeGateOp.cpp
#	scripts/attribution/data/dependency_metadata.yml
#	scripts/attribution/data/files_to_dependency.yml
#	tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py
#	tensorrt_llm/_torch/modules/fused_moe/quantization.py
#	tensorrt_llm/_torch/modules/fused_moe/routing.py
#	tensorrt_llm/quantization/utils/fp8_utils.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants