[None][feat] DSv4: model, tokenizer, and integration coverage#15414
Draft
lfr-0531 wants to merge 29 commits into
Draft
[None][feat] DSv4: model, tokenizer, and integration coverage#15414lfr-0531 wants to merge 29 commits into
lfr-0531 wants to merge 29 commits into
Conversation
Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
Add DSv4 attention fusion helper ops and focused tests.\n\nThis extracts q-norm, MLA RoPE, fused inverse-RoPE FP8 quantization, and packed FP8 quantization registration from PR-14751 without pulling in the sparse MLA backend, MoE routing, or DSv4 model/tokenizer files. Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
Collaborator
Author
|
/bot run --extra-stage "DGX_B200-4_GPUs-PyTorch-DS-1,DGX_B300-4_GPUs-PyTorch-DS-1,GB200-4_GPUs-PyTorch-DS-1" |
Collaborator
|
PR_Github #54577 [ run ] triggered by Bot. Commit: |
Collaborator
|
PR_Github #54577 [ run ] completed with state
|
Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
This reverts commit a7a5679. Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
This reverts commit 42aeb73. Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
- gate_forward: mark out_weights/out_indices as Tensor(a!)/(b!) so the in-place op is sound under functionalization / CUDA graphs. - DeepSeekV4 hashed routing: substitute an empty bias tensor when the bias callable returns None, avoiding a dispatch failure on Tensor bias. - TRTLLMGen FP8 block-scale: reject a per-expert swiglu_limit tensor; only the uniform swiglu_limit_scalar is consumed. - MegaMoE DG weight dedup: record rebuild metadata before releasing redundant params so dynamic-EPLB reload can restore them. - fp8_utils swiglu clamp: gate on swiglu_limit > 0.0 to match silu_and_mul and avoid a destructive 0.0 clamp. Signed-off-by: xxi <xxi@nvidia.com>
…e-routing Signed-off-by: xxi <xxi@nvidia.com> # Conflicts: # tensorrt_llm/_torch/modules/fused_moe/__init__.py
…A pointer shape The DSv4 routing prep changed BaseMoeRoutingMethod.apply to take an extra input_ids argument and updated all MoE backend call sites to pass it positionally, but two routing-method subclasses defined outside routing.py were not updated: - Gemma4MoeRoutingMethod (modeling_gemma4.py) - Step3p7MoeRoutingMethod (modeling_step3p7.py) Their one-argument apply() now raises a TypeError when invoked by the backends, breaking the Gemma4 MoE model tests. Add the optional input_ids parameter to match the base contract. Also restore the .reshape(-1, 3) on the routed-expert MoE LoRA weight_pointers in CutlassFusedMoE._extract_moe_lora_tensors. The pointer table is built flat ([num_seqs * 3]) in PyTorchModelEngine._build_lora_params and the C++ fused_moe op asserts a [num_seqs, 3] shape (moeOp.cpp), so dropping the reshape breaks the eager MoE LoRA path. Signed-off-by: xxi <xxi@nvidia.com>
…failure
The DSv4 prep bumped DeepGEMM to 67fc648, whose sm100_fp4_mqa_logits.cuh
splits the TMEM accumulator load into two halves of kNumHeads/2 but still
asserts the load width N is in {32, 64}. For the num_heads=32 indexer
shape this drives N=16, tripping the static_assert and failing the
runtime NVCC JIT compile (test_dsa_fp4_indexer
::test_fp4_mqa_logits_shape_and_topk_intersection[32] on B200/B300).
DeepGEMM 245dc5d6 restores the N==16 TMEM loader (DeepGEMM NVIDIA#353,
paged-MQA fixes) while preserving set_pdl(); 67fc648 is a direct ancestor
so this is a purely additive bump. This is the same tag already adopted
on feat/deepseek_v4 (PR NVIDIA#14940). Updates the two attribution data files
to match, keeping the trt-llm-oss-compliance check green.
Signed-off-by: xxi <xxi@nvidia.com>
Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
Signed-off-by: Fanrong Leng <fanrongl@nvidia.com> # Conflicts: # 3rdparty/fetch_content.json # cpp/tensorrt_llm/thop/moeGateOp.cpp # scripts/attribution/data/dependency_metadata.yml # scripts/attribution/data/files_to_dependency.yml # tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py # tensorrt_llm/_torch/modules/fused_moe/quantization.py # tensorrt_llm/_torch/modules/fused_moe/routing.py # tensorrt_llm/quantization/utils/fp8_utils.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Draft PR for PR-14751 split PR-9. This lands the DeepSeek-V4 model/API/tokenizer/parser wiring plus DSv4 docs, examples, CI stage wiring, integration test-list entries, attribution metadata, and focused unit coverage.
This PR intentionally depends on the earlier split PRs:
Notable scope notes:
tensorrt_llm/_torch/models/modeling_gemma4.pyis intentionally absent; the unnecessary PR8 Gemma4 compatibility hunk was removed.base_worker.pyis not part of the latest PR-9 extraction; DSv4 disaggregated coverage is represented by DS stage/test-list wiring and should be validated in CI or on a model-share host.Test plan
CCACHE_DIR=/home/scratch.fanrongl_coreai/ccache_trtllm python ./scripts/build_wheel.py --trt_root /usr/local/tensorrt --benchmarks --use_ccache --cuda_architectures "90-real;100-real" --configure_cmakebuild/tensorrt_llm-1.3.0rc19-cp312-cp312-linux_x86_64.whlwithpip install --force-reinstall --no-deps.ninja -t inputs returned no results for wheel targets, but wheel creation succeeded.PYTHONPATH=$PWDconfirmedtensorrt_llmandtensorrt_llm.bindingsresolve from the PR-9 worktree, and verified DSv4 sparse config,MEGAMOE_DEEPGEMM, and thedeepseek_v4tokenizer alias.nvidia-smi, ran on idleCUDA_VISIBLE_DEVICES=4:pytest tests/unittest/_torch/modeling/test_modeling_deepseekv4.py tests/unittest/_torch/modules/test_engram.pypytest tests/unittest/_torch/test_model_config.py tests/unittest/llmapi/test_deepseek_v4_tokenizer.py tests/unittest/llmapi/test_reasoning_parser.py tests/unittest/llmapi/apps/test_tool_parsers.py tests/unittest/llmapi/test_llm_args.py tests/unittest/api_stabilitygit diff --check HEADpassed.modeling_gemma4.py, earlier C++ primitive/cache/sparse-attention/MoE implementation files, orbase_worker.pyin the PR-9 own manifest.python -m py_compile examples/llm-api/quickstart_advanced.py scripts/test_to_stage_mapping.pypassed.python scripts/test_to_stage_mapping.py --tests "tests/integration/defs/accuracy/test_llm_api_pytorch.py::TestDeepSeekV4Flash::test_auto_dtype tests/integration/defs/accuracy/test_llm_api_pytorch.py::TestDeepSeekV4Flash::test_nvfp4_4gpus_static_eplb tests/integration/defs/accuracy/test_llm_api_pytorch.py::TestDeepSeekV4FlashBase::test_auto_dtype"returned:DGX_B200-4_GPUs-PyTorch-DS-1DGX_B300-4_GPUs-PyTorch-1DGX_B300-4_GPUs-PyTorch-DS-1SKIP=type-check; the localtype-checkhook was attempted but exceeded 5 minutes while mypy was checkingtensorrt_llm/_torch/pyexecutor/sampler.py, so it was interrupted to avoid a hang. All other targeted hooks passed, including test-list validation.Local waivers / caveats
LLM_MODELS_ROOTis unset and no localDeepSeek-V4-Flash/DeepSeek-V4-Flash-Basemodel directory was found under common model roots. These should be validated in the DS CI stages or on a host with the model share mounted.sphinxinstalled.