Skip to content

[None][feat] AutoDeploy: Add DeepSeekV4 Flash Support#15019

Open
bmarimuthu-nv wants to merge 28 commits into
NVIDIA:mainfrom
nv-auto-deploy:bala/dsv4-p1
Open

[None][feat] AutoDeploy: Add DeepSeekV4 Flash Support#15019
bmarimuthu-nv wants to merge 28 commits into
NVIDIA:mainfrom
nv-auto-deploy:bala/dsv4-p1

Conversation

@bmarimuthu-nv

@bmarimuthu-nv bmarimuthu-nv commented Jun 5, 2026

Copy link
Copy Markdown
Collaborator

Summary by CodeRabbit

  • New Features

    • Added DeepSeek V4 Flash model support with sparse attention and MXFP4 quantized mixture-of-experts.
    • Extended FP8 quantization for linear layers with new input scale formatting options.
    • Added grouped linear operations for enhanced performance optimization.
  • Documentation

    • Updated custom operations registry to reflect new attention and quantization operators.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@bmarimuthu-nv

Copy link
Copy Markdown
Collaborator Author

@CodeRabbit review

@coderabbitai

coderabbitai Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai

coderabbitai Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

DeepSeek V4 AutoDeploy support was added across registry YAMLs, custom ops, checkpoint/layout readers, transforms, and tests. The PR wires sparse attention, MXFP4 MoE, grouped FP8 quantization, sharding hints, and DeepSeek V4 model loading into the auto-deploy flow.

Changes

DeepSeek V4 AutoDeploy

Layer / File(s) Summary
Registry and example configs
examples/auto_deploy/model_registry/configs/*, examples/auto_deploy/model_registry/models.yaml, tensorrt_llm/_torch/auto_deploy/config/default.yaml, tensorrt_llm/_torch/auto_deploy/custom_ops/README.md, tensorrt_llm/_torch/auto_deploy/custom_ops/attention/__init__.py, tensorrt_llm/_torch/auto_deploy/compile/*
New DeepSeek V4 registry YAMLs, model registry wiring, and related auto-deploy docs/config defaults were added.
Custom ops and helpers
tensorrt_llm/_torch/auto_deploy/custom_ops/..., tensorrt_llm/_torch/auto_deploy/utils/...
New DeepSeek sparse attention, MXFP4 MoE, grouped linear, quantization, sharding, and tensor utility ops were added or extended.
Model loading and checkpoint layouts
tensorrt_llm/_torch/auto_deploy/models/...
DeepSeek V4 model classes, quantized checkpoint layouts, safetensors metadata readers, and quant-config reader behavior were added.
Transforms and sharding
tensorrt_llm/_torch/auto_deploy/transform/library/..., tensorrt_llm/_torch/auto_deploy/utils/dist_config.py
DeepSeek V4 cached-attention insertion, MXFP4 lowering, quantization rewrites, RMSNorm fusion, and sharding IR changes were added.
Model and transform tests
tests/integration/..., tests/unittest/...
DeepSeek V4 integration, semantic, quantization, sharding, compilation, and fusion tests were added or updated.

Sequence Diagram(s)

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)
tensorrt_llm/_torch/auto_deploy/transform/library/sharding.py (1)

1482-1501: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Honor explicit dist_backend="torch" in _get_dist_ops.

Line 1493 currently selects TRTLLM ops whenever they are available, even when the caller explicitly requests "torch". With this PR’s new backend threading, that can silently ignore config and route to the wrong distributed op.

Suggested fix
 def _get_dist_ops(backend: str):
@@
-    if backend == "trtllm" or is_trtllm_op_available():
+    if backend == "trtllm":
+        return (
+            torch.ops.auto_deploy.trtllm_dist_all_gather.default,
+            torch.ops.auto_deploy.trtllm_dist_all_reduce.default,
+        )
+    if backend == "torch":
+        return (
+            torch.ops.auto_deploy.torch_dist_all_gather.default,
+            torch.ops.auto_deploy.torch_dist_all_reduce.default,
+        )
+    if backend == "auto" and is_trtllm_op_available():
         return (
             torch.ops.auto_deploy.trtllm_dist_all_gather.default,
             torch.ops.auto_deploy.trtllm_dist_all_reduce.default,
         )
     return (
         torch.ops.auto_deploy.torch_dist_all_gather.default,
         torch.ops.auto_deploy.torch_dist_all_reduce.default,
     )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/auto_deploy/transform/library/sharding.py` around lines
1482 - 1501, The function _get_dist_ops currently prefers TRT-LLM ops whenever
they exist, ignoring an explicit dist_backend="torch"; change the selection
logic so that TRT-LLM ops are chosen only when backend == "trtllm" or backend ==
"auto" and is_trtllm_op_available() is true, otherwise return the Torch ops for
backend == "torch" (or any other non-trtllm value). Keep the existing handling
of enum-like inputs (checking backend.value) and the existing return values
(torch.ops.auto_deploy.trtllm_dist_all_gather.default /
trtllm_dist_all_reduce.default vs
torch.ops.auto_deploy.torch_dist_all_gather.default /
torch_dist_all_reduce.default).
tensorrt_llm/_torch/auto_deploy/utils/dist_config.py (1)

132-157: ⚠️ Potential issue | 🔴 Critical

Add dist_backend support to DistConfig.from_sharding_params()

from_sharding_params() in tensorrt_llm/_torch/auto_deploy/utils/dist_config.py does not accept dist_backend, but IRShardingConfig._init_dist_config() calls it with dist_backend=self.dist_backend (in tensorrt_llm/_torch/auto_deploy/transform/library/sharding_ir.py). This will raise a TypeError and also prevents overriding the default "auto" backend. Add an optional dist_backend parameter and forward it into the DistConfig(...) construction.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/auto_deploy/utils/dist_config.py` around lines 132 - 157,
from_sharding_params currently doesn't accept or forward dist_backend which
causes a TypeError when IRShardingConfig._init_dist_config calls
from_sharding_params(dist_backend=...), so add an optional parameter
dist_backend: str = "auto" to the from_sharding_params signature and pass
dist_backend=dist_backend into the DistConfig(...) constructor invocation so the
DistConfig created reflects the requested backend and avoids the TypeError.
tensorrt_llm/_torch/auto_deploy/utils/node_utils.py (1)

663-669: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Also register the grouped FineGrained FP8 op as a fake-quant linear.

Adding it only to is_finegrained_fp8_linear_op() is incomplete: is_any_lin_op() still goes through is_fake_quantized_linear_op(), so helpers like get_weight_shape() and layer/subgraph discovery will continue to skip grouped FineGrained FP8 nodes.

Suggested follow-up
 def is_fake_quantized_linear_op(node: Node) -> bool:
     quantized_linear_op = {
         torch.ops.auto_deploy.torch_fake_quant_fp8_linear,
         torch.ops.auto_deploy.torch_fake_quant_nvfp4_linear,
         torch.ops.auto_deploy.torch_fake_quant_finegrained_fp8_linear,
+        torch.ops.auto_deploy.torch_fake_quant_grouped_finegrained_fp8_linear,
     }

     return is_op(node, quantized_linear_op)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/auto_deploy/utils/node_utils.py` around lines 663 - 669,
The grouped FineGrained FP8 op
(torch.ops.auto_deploy.torch_fake_quant_grouped_finegrained_fp8_linear) needs to
be treated as a fake-quant linear as well so it is discovered by is_any_lin_op
-> is_fake_quantized_linear_op flows; update the fake-quant linear
registry/check by adding that symbol to whatever list or predicate used by
is_fake_quantized_linear_op (or the helper it calls) so functions like
get_weight_shape and layer/subgraph discovery will include grouped FineGrained
FP8 nodes; search for is_fake_quantized_linear_op and add
torch.ops.auto_deploy.torch_fake_quant_grouped_finegrained_fp8_linear to the
same checks used for other fake-quant linear ops (keeping
is_finegrained_fp8_linear_op unchanged).
tensorrt_llm/_torch/auto_deploy/custom_ops/quantization/torch_quant.py (1)

707-715: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Stop inferring FineGrained FP8 block size from the scale tensor shape.

triton.cdiv(N, scale_n) / triton.cdiv(K, scale_k) recovers the tail-expanded size, not the checkpoint’s canonical block size. For a 576-row weight stored on a 128-row grid, this computes 116, so the Triton path and the BF16 dequant fallback both apply each scale row to the wrong weights. The grouped path repeats the same inference, so tailed FineGrained FP8 matrices will still mis-dequantize.

Please preserve the declared block size (128x128 here, or a value threaded from checkpoint_layout) instead of recomputing it from weight_scale_inv.shape.

Also applies to: 791-799, 874-897

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/auto_deploy/custom_ops/quantization/torch_quant.py`
around lines 707 - 715, The code is incorrectly inferring FineGrained FP8 block
size by using triton.cdiv on weight_scale_inv.shape (e.g., N,K =
weight_quantized.shape; scale_n,scale_k = weight_scale_inv.shape; block_n =
triton.cdiv(N, scale_n); block_k = triton.cdiv(K, scale_k)), which recovers a
tail-expanded size instead of the canonical block size; change the logic to use
the declared block size (the canonical checkpoint block dimensions provided by
the checkpoint_layout or a threaded block_size argument) wherever block_size is
computed/used (including the instances around the block_size assignment and the
grouped path at the other ranges), stop deriving block_size from
weight_scale_inv.shape, and thread or pass the correct block_size into
_safe_act_quant and any other consumers so scale rows are applied to the
intended 128x128 (or checkpoint-provided) blocks.
🧹 Nitpick comments (5)
tensorrt_llm/_torch/auto_deploy/utils/dist_config.py (1)

44-44: ⚡ Quick win

Consider adding a field description.

The new dist_backend field lacks a Field(description="..."). As per coding guidelines, user-facing Pydantic fields should include descriptions. While this is consistent with the existing fields in this class, consider adding descriptions to clarify the purpose and valid values for users.

📝 Suggested improvement
-    dist_backend: Literal["auto", "torch", "trtllm"] = Field(default="auto")
+    dist_backend: Literal["auto", "torch", "trtllm"] = Field(
+        default="auto",
+        description="Distributed backend for communication primitives. "
+                    "'auto' selects based on runtime; 'torch' uses PyTorch distributed; "
+                    "'trtllm' uses TensorRT-LLM native backend."
+    )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/auto_deploy/utils/dist_config.py` at line 44, Add a
user-facing description to the Pydantic field dist_backend by replacing
Field(default="auto") with Field(default="auto", description="Choose distributed
backend: 'auto' to auto-detect, 'torch' to use PyTorch distributed, or 'trtllm'
to use the TRT-LLM backend."), matching the style and wording used for other
fields in the same config class (dist_backend is the field to update).

Source: Coding guidelines

tensorrt_llm/_torch/auto_deploy/transform/library/kvcache.py (1)

630-638: ⚡ Quick win

Add an explicit return type annotation on _apply.

Suggested patch
-    def _apply(self, *args, **kwargs):
+    def _apply(self, *args, **kwargs) -> Tuple[GraphModule, TransformInfo]:
         if self.config.backend is None:
             self.config.backend = "deepseek_v4_sparse"
         elif self.config.backend != "deepseek_v4_sparse":
             raise ValueError(
                 "insert_cached_deepseek_v4_sparse_attention only supports "
                 f"backend='deepseek_v4_sparse', got {self.config.backend!r}."
             )
         return super()._apply(*args, **kwargs)

As per coding guidelines, always annotate Python function return types.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/auto_deploy/transform/library/kvcache.py` around lines
630 - 638, The _apply method is missing a return type annotation; update the def
_apply signature to include an explicit return type (e.g., -> Any or the
specific return type returned by super()._apply) so it complies with typing
guidelines, and import typing.Any if needed; ensure the annotated function still
sets self.config.backend and calls/returns super()._apply(*args, **kwargs).

Source: Coding guidelines

tensorrt_llm/_torch/auto_deploy/models/custom/modeling_deepseek_v4.py (1)

1604-1617: ⚡ Quick win

Add return annotations on the new public model accessors.

These methods are part of the exported model surface and currently skip return typing, which makes the new class harder to use with the repo’s type-checking conventions.

💡 Suggested fix
-    def get_input_embeddings(self):
+    def get_input_embeddings(self) -> nn.Embedding:
         return self.embed
 
-    def set_input_embeddings(self, new_embeddings):
+    def set_input_embeddings(self, new_embeddings: nn.Embedding) -> None:
         self.embed = new_embeddings
 
-    def get_output_embeddings(self):
+    def get_output_embeddings(self) -> nn.Linear:
         return self.head
 
-    def set_output_embeddings(self, new_embeddings):
+    def set_output_embeddings(self, new_embeddings: nn.Linear) -> None:
         self.head = new_embeddings
 
-    def get_decoder(self):
+    def get_decoder(self) -> "DeepseekV4ForCausalLM":
         return self

As per coding guidelines, "Always annotate Python function return types; use None if function does not return anything`."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/auto_deploy/models/custom/modeling_deepseek_v4.py` around
lines 1604 - 1617, The new public accessors lack return type annotations; add
explicit type hints: annotate get_input_embeddings as -> Optional[nn.Module] (or
-> nn.Module if embed is always present), set_input_embeddings as -> None,
get_output_embeddings as -> Optional[nn.Module] (or -> nn.Module),
set_output_embeddings as -> None, and get_decoder as -> "DeepseekModel" (or ->
nn.Module/self type) to match the class; also ensure typing.Optional and
torch.nn as nn (or the appropriate model class name) are imported so the
annotations type-check correctly.

Source: Coding guidelines

tests/unittest/auto_deploy/multigpu/transformations/library/test_apply_sharding_hints.py (2)

76-84: ⚡ Quick win

Add return annotations to the new helpers and methods.

These new FX/sharding helpers all omit -> annotations, even though this repo requires return types on Python functions. That makes the GraphModule/optimizer contracts harder to follow in a file that is already very meta-heavy.

♻️ Representative fix
 class DeepSeekV4IRContractBlock(nn.Module):
-    def __init__(self):
+    def __init__(self) -> None:
         super().__init__()
         ...

-    def forward(self, x):
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
         ...

 def _make_optimizer(
     world_size: int,
     rank: int = 0,
     dist_backend: str | None = None,
     *,
     simple_shard_only: bool = False,
-):
+) -> InferenceOptimizer:
     ...

-def _export_deepseek_v4_contract_block():
+def _export_deepseek_v4_contract_block() -> GraphModule:
     ...

-def _register_mxfp4_checkpoint_layout_hooks(gm, checkpoint_layout):
+def _register_mxfp4_checkpoint_layout_hooks(gm, checkpoint_layout) -> GraphModule:
     ...

As per coding guidelines, "Always annotate Python function return types; use None if function does not return anything." Based on learnings, Python 3.10+ features can be used throughout the repo, including tests.

Also applies to: 218-323, 430-826

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@tests/unittest/auto_deploy/multigpu/transformations/library/test_apply_sharding_hints.py`
around lines 76 - 84, Several new functions/methods in this test (notably the
class __init__ and forward shown, plus the FX/sharding helper functions
referenced elsewhere) are missing return type annotations; update each function
signature to include an explicit return annotation (use -> None for methods that
do not return a value, or the appropriate concrete type for helpers that return
tensors/GraphModule/etc.). Look for and update signatures for __init__, forward,
and the helper functions mentioned around lines 218-323 and 430-826 so each has
a return annotation while preserving parameter names and existing logic (e.g.,
def forward(self, x) -> torch.Tensor: or def helper(...) -> None / -> SomeType
as appropriate).

Sources: Coding guidelines, Learnings


326-427: ⚡ Quick win

Add -> None to the new test cases.

The new tests follow the repo’s naming/style rules, but they still miss explicit return annotations. In this codebase, tests are expected to annotate returns too.

♻️ Representative fix
-def test_apply_hints_grouped_fp8_linear_trusts_group_sharded_view_input():
+def test_apply_hints_grouped_fp8_linear_trusts_group_sharded_view_input() -> None:
     ...

-def test_deepseek_v4_ir_contract_linear_view_sparse_attention():
+def test_deepseek_v4_ir_contract_linear_view_sparse_attention() -> None:
     ...

-def test_stacked_mxfp4_routing_driven_rank1_preserves_expert_start_with_torch_backend():
+def test_stacked_mxfp4_routing_driven_rank1_preserves_expert_start_with_torch_backend() -> None:
     ...

As per coding guidelines, "Always annotate Python function return types; use None if function does not return anything." Based on learnings, Python 3.10+ features can be used throughout the repo, including tests.

Also applies to: 483-495, 829-1172

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@tests/unittest/auto_deploy/multigpu/transformations/library/test_apply_sharding_hints.py`
around lines 326 - 427, Several new test functions are missing explicit return
annotations; add "-> None" to each test function signature (e.g.,
test_apply_hints_grouped_fp8_linear_trusts_group_sharded_view_input,
test_apply_hints_grouped_fp8_linear_slices_plain_global_input_groups,
test_simple_shard_only_does_not_ordinary_shard_grouped_fp8_linear,
test_apply_hints_default_dist_backend_uses_auto_selection,
test_apply_hints_torch_dist_backend_forces_torch_all_reduce,
test_deepseek_v4_ir_contract_linear_view_sparse_attention and the other tests
referenced in ranges 483-495 and 829-1172) so each def line reads like "def
<test_name>(...) -> None:"; keep signatures and bodies unchanged otherwise.

Sources: Coding guidelines, Learnings

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@tensorrt_llm/_torch/auto_deploy/custom_ops/attention/deepseek_v4_sparse_attention.py`:
- Line 231: The local variable compressed_len is declared but never used; change
the unpacking from "batch_size, compressed_len, ratio, _ = tensor.shape" to
ignore that value (e.g., "batch_size, _, ratio, _ = tensor.shape") so the unused
name is removed and lint warnings disappear; update the unpacking wherever
"tensor.shape" is decomposed in this scope to use "_" instead of
"compressed_len".

In `@tensorrt_llm/_torch/auto_deploy/models/checkpoint_metadata.py`:
- Around line 68-72: The code currently only catches json.JSONDecodeError when
reading safetensors index and should also normalize UnicodeDecodeError into
QuantizedCheckpointLayoutError so corrupted/invalid UTF-8 is reported as a
layout error (ensure autodetect_quant_config_reader() sees it). Update the
try/except in the block that opens the index (the one that raises
QuantizedCheckpointLayoutError) to catch both json.JSONDecodeError and
UnicodeDecodeError (e.g., except (json.JSONDecodeError, UnicodeDecodeError) as
error) and re-raise QuantizedCheckpointLayoutError(f"Invalid safetensors index
JSON: {path}") from error; apply the same change to the similar handler around
the other block referenced (the one at the 90-95 region) so all malformed
safetensors decoding errors are normalized.

In `@tensorrt_llm/_torch/auto_deploy/models/custom/modeling_deepseek_v4.py`:
- Around line 789-801: The code currently constructs self.experts =
nn.ModuleList([...DeepseekV4MLP...]) before checking ad_use_mxfp4_experts, which
unnecessarily instantiates large dense experts for the MXFP4 path; change the
logic so that if self.ad_use_mxfp4_experts is True you skip creating
DeepseekV4MLP instances (set self.experts = nn.ModuleList() or defer creation)
and still call self._register_mxfp4_runtime_buffers() and
self._register_load_state_dict_pre_hook(self._load_mxfp4_checkpoint_experts);
adjust any code that expects len(self.experts) so it can handle an empty
ModuleList until the packed expert buffers are loaded.

In `@tensorrt_llm/_torch/auto_deploy/transform/library/mxfp4_moe.py`:
- Line 287: The loop binding layout names to extracted args uses
zip({_MXFP4_LAYOUT_ARG_NAMES}, expert_args) which can silently truncate
mismatches; change this to a strict check by either using
zip({_MXFP4_LAYOUT_ARG_NAMES}, expert_args, strict=True) or pre-validate lengths
(raise a descriptive exception if len(_MXFP4_LAYOUT_ARG_NAMES) !=
len(expert_args)) before the loop so any schema drift fails loudly; update the
loop over name and arg (the for ... in zip(...) block) to use the chosen strict
approach and include an informative error message mentioning
_MXFP4_LAYOUT_ARG_NAMES and expert_args.
- Around line 573-575: The call to _get_mxfp4_expert_dims currently unpacks into
num_experts, hidden_size, intermediate_size but those values aren’t used
(RUF059); change the unpacking to either drop unused values or prefix them with
an underscore (for example _num_experts, _hidden_size, _intermediate_size or
simply _, _, _) so the linter stops complaining and intent is clear; locate the
call to _get_mxfp4_expert_dims in the mxfp4_moe code and update the left-hand
side of the assignment accordingly.

In `@tests/integration/defs/accuracy/test_llm_api_autodeploy.py`:
- Around line 72-80: The loop that resolves
DEEPSEEK_V4_FLASH_MODEL_DIR/DEEPSEEK_V4_MODEL_DIR currently checks candidates =
(model_path, model_path / "DeepSeek-V4-Flash") which returns the root model_path
first; change the candidate order to check the subdirectory first (e.g.,
(model_path / "DeepSeek-V4-Flash", model_path)) so the DeepSeek-V4-Flash
directory is preferred when present, keeping the rest of the logic (env_var,
model_path, exists() check, and returning str(candidate)) the same.

In `@tests/unittest/auto_deploy/singlegpu/custom_ops/moe/test_torch_mxfp4_moe.py`:
- Around line 741-745: The test uses regex patterns in pytest.raises(...,
match="...") that contain metacharacters; change the string literals to raw
strings (prefix with r) for both calls to pytest.raises that reference
_resolve_mxfp4_expert_block_size (the two occurrences: one passing
{"expert_block_size": 64} and one passing _UnsupportedBlockSizeLayout()), so the
match argument becomes a raw string (e.g., r"...") to avoid
escaping/interpretation issues and satisfy the linter.

---

Outside diff comments:
In `@tensorrt_llm/_torch/auto_deploy/custom_ops/quantization/torch_quant.py`:
- Around line 707-715: The code is incorrectly inferring FineGrained FP8 block
size by using triton.cdiv on weight_scale_inv.shape (e.g., N,K =
weight_quantized.shape; scale_n,scale_k = weight_scale_inv.shape; block_n =
triton.cdiv(N, scale_n); block_k = triton.cdiv(K, scale_k)), which recovers a
tail-expanded size instead of the canonical block size; change the logic to use
the declared block size (the canonical checkpoint block dimensions provided by
the checkpoint_layout or a threaded block_size argument) wherever block_size is
computed/used (including the instances around the block_size assignment and the
grouped path at the other ranges), stop deriving block_size from
weight_scale_inv.shape, and thread or pass the correct block_size into
_safe_act_quant and any other consumers so scale rows are applied to the
intended 128x128 (or checkpoint-provided) blocks.

In `@tensorrt_llm/_torch/auto_deploy/transform/library/sharding.py`:
- Around line 1482-1501: The function _get_dist_ops currently prefers TRT-LLM
ops whenever they exist, ignoring an explicit dist_backend="torch"; change the
selection logic so that TRT-LLM ops are chosen only when backend == "trtllm" or
backend == "auto" and is_trtllm_op_available() is true, otherwise return the
Torch ops for backend == "torch" (or any other non-trtllm value). Keep the
existing handling of enum-like inputs (checking backend.value) and the existing
return values (torch.ops.auto_deploy.trtllm_dist_all_gather.default /
trtllm_dist_all_reduce.default vs
torch.ops.auto_deploy.torch_dist_all_gather.default /
torch_dist_all_reduce.default).

In `@tensorrt_llm/_torch/auto_deploy/utils/dist_config.py`:
- Around line 132-157: from_sharding_params currently doesn't accept or forward
dist_backend which causes a TypeError when IRShardingConfig._init_dist_config
calls from_sharding_params(dist_backend=...), so add an optional parameter
dist_backend: str = "auto" to the from_sharding_params signature and pass
dist_backend=dist_backend into the DistConfig(...) constructor invocation so the
DistConfig created reflects the requested backend and avoids the TypeError.

In `@tensorrt_llm/_torch/auto_deploy/utils/node_utils.py`:
- Around line 663-669: The grouped FineGrained FP8 op
(torch.ops.auto_deploy.torch_fake_quant_grouped_finegrained_fp8_linear) needs to
be treated as a fake-quant linear as well so it is discovered by is_any_lin_op
-> is_fake_quantized_linear_op flows; update the fake-quant linear
registry/check by adding that symbol to whatever list or predicate used by
is_fake_quantized_linear_op (or the helper it calls) so functions like
get_weight_shape and layer/subgraph discovery will include grouped FineGrained
FP8 nodes; search for is_fake_quantized_linear_op and add
torch.ops.auto_deploy.torch_fake_quant_grouped_finegrained_fp8_linear to the
same checks used for other fake-quant linear ops (keeping
is_finegrained_fp8_linear_op unchanged).

---

Nitpick comments:
In `@tensorrt_llm/_torch/auto_deploy/models/custom/modeling_deepseek_v4.py`:
- Around line 1604-1617: The new public accessors lack return type annotations;
add explicit type hints: annotate get_input_embeddings as -> Optional[nn.Module]
(or -> nn.Module if embed is always present), set_input_embeddings as -> None,
get_output_embeddings as -> Optional[nn.Module] (or -> nn.Module),
set_output_embeddings as -> None, and get_decoder as -> "DeepseekModel" (or ->
nn.Module/self type) to match the class; also ensure typing.Optional and
torch.nn as nn (or the appropriate model class name) are imported so the
annotations type-check correctly.

In `@tensorrt_llm/_torch/auto_deploy/transform/library/kvcache.py`:
- Around line 630-638: The _apply method is missing a return type annotation;
update the def _apply signature to include an explicit return type (e.g., -> Any
or the specific return type returned by super()._apply) so it complies with
typing guidelines, and import typing.Any if needed; ensure the annotated
function still sets self.config.backend and calls/returns super()._apply(*args,
**kwargs).

In `@tensorrt_llm/_torch/auto_deploy/utils/dist_config.py`:
- Line 44: Add a user-facing description to the Pydantic field dist_backend by
replacing Field(default="auto") with Field(default="auto", description="Choose
distributed backend: 'auto' to auto-detect, 'torch' to use PyTorch distributed,
or 'trtllm' to use the TRT-LLM backend."), matching the style and wording used
for other fields in the same config class (dist_backend is the field to update).

In
`@tests/unittest/auto_deploy/multigpu/transformations/library/test_apply_sharding_hints.py`:
- Around line 76-84: Several new functions/methods in this test (notably the
class __init__ and forward shown, plus the FX/sharding helper functions
referenced elsewhere) are missing return type annotations; update each function
signature to include an explicit return annotation (use -> None for methods that
do not return a value, or the appropriate concrete type for helpers that return
tensors/GraphModule/etc.). Look for and update signatures for __init__, forward,
and the helper functions mentioned around lines 218-323 and 430-826 so each has
a return annotation while preserving parameter names and existing logic (e.g.,
def forward(self, x) -> torch.Tensor: or def helper(...) -> None / -> SomeType
as appropriate).
- Around line 326-427: Several new test functions are missing explicit return
annotations; add "-> None" to each test function signature (e.g.,
test_apply_hints_grouped_fp8_linear_trusts_group_sharded_view_input,
test_apply_hints_grouped_fp8_linear_slices_plain_global_input_groups,
test_simple_shard_only_does_not_ordinary_shard_grouped_fp8_linear,
test_apply_hints_default_dist_backend_uses_auto_selection,
test_apply_hints_torch_dist_backend_forces_torch_all_reduce,
test_deepseek_v4_ir_contract_linear_view_sparse_attention and the other tests
referenced in ranges 483-495 and 829-1172) so each def line reads like "def
<test_name>(...) -> None:"; keep signatures and bodies unchanged otherwise.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: abd8c747-53f9-4de8-8268-c0cf335189c9

📥 Commits

Reviewing files that changed from the base of the PR and between 86f9602 and 30aaf48.

📒 Files selected for processing (46)
  • examples/auto_deploy/model_registry/configs/deepseek_v4_pr1_5layer.yaml
  • examples/auto_deploy/model_registry/configs/deepseek_v4_pr1_full.yaml
  • examples/auto_deploy/model_registry/configs/deepseek_v4_pr1_single_rank_smoke.yaml
  • examples/auto_deploy/model_registry/models.yaml
  • tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py
  • tensorrt_llm/_torch/auto_deploy/compile/piecewise_utils.py
  • tensorrt_llm/_torch/auto_deploy/config/default.yaml
  • tensorrt_llm/_torch/auto_deploy/custom_ops/README.md
  • tensorrt_llm/_torch/auto_deploy/custom_ops/attention/__init__.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/attention/deepseek_v4_sparse_attention.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/attention/torch_attention.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/mxfp4_moe.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/torch_moe.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/linear/linear.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/quantization/torch_quant.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/sharding_ops.py
  • tensorrt_llm/_torch/auto_deploy/models/checkpoint_metadata.py
  • tensorrt_llm/_torch/auto_deploy/models/custom/__init__.py
  • tensorrt_llm/_torch/auto_deploy/models/custom/modeling_deepseek_v4.py
  • tensorrt_llm/_torch/auto_deploy/models/quant_checkpoint_layout.py
  • tensorrt_llm/_torch/auto_deploy/models/quant_config_reader.py
  • tensorrt_llm/_torch/auto_deploy/transform/library/kvcache.py
  • tensorrt_llm/_torch/auto_deploy/transform/library/mxfp4_moe.py
  • tensorrt_llm/_torch/auto_deploy/transform/library/quantization.py
  • tensorrt_llm/_torch/auto_deploy/transform/library/rms_norm.py
  • tensorrt_llm/_torch/auto_deploy/transform/library/sharding.py
  • tensorrt_llm/_torch/auto_deploy/transform/library/sharding_ir.py
  • tensorrt_llm/_torch/auto_deploy/utils/dist_config.py
  • tensorrt_llm/_torch/auto_deploy/utils/node_utils.py
  • tensorrt_llm/_torch/auto_deploy/utils/quantization_utils.py
  • tests/integration/defs/accuracy/test_llm_api_autodeploy.py
  • tests/unittest/_torch/auto_deploy/unit/models/test_deepseek_v4_quant_checkpoint_layout.py
  • tests/unittest/_torch/auto_deploy/unit/quantization/test_deepseek_v4_finegrained_fp8_linear.py
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_deepseek_v4_modeling.py
  • tests/unittest/auto_deploy/multigpu/transformations/library/test_apply_sharding_hints.py
  • tests/unittest/auto_deploy/singlegpu/compile/test_captured_graph.py
  • tests/unittest/auto_deploy/singlegpu/compile/test_piecewise_utils.py
  • tests/unittest/auto_deploy/singlegpu/custom_ops/attention/test_deepseek_v4_sparse_attention.py
  • tests/unittest/auto_deploy/singlegpu/custom_ops/moe/test_torch_moe_swiglu_limit.py
  • tests/unittest/auto_deploy/singlegpu/custom_ops/moe/test_torch_mxfp4_moe.py
  • tests/unittest/auto_deploy/singlegpu/custom_ops/quantization/test_quant.py
  • tests/unittest/auto_deploy/singlegpu/models/test_hf.py
  • tests/unittest/auto_deploy/singlegpu/transformations/library/test_fuse_rmsnorm.py
  • tests/unittest/auto_deploy/singlegpu/utils/test_example_configs.py
  • tests/unittest/auto_deploy/singlegpu/utils/test_node_utils_sharding.py
  • tests/unittest/auto_deploy/singlegpu/utils/test_quantization_utils.py

Comment thread tensorrt_llm/_torch/auto_deploy/models/checkpoint_metadata.py
Comment thread tensorrt_llm/_torch/auto_deploy/models/custom/modeling_deepseek_v4.py Outdated
Comment thread tensorrt_llm/_torch/auto_deploy/transform/library/mxfp4_moe.py Outdated
Comment thread tensorrt_llm/_torch/auto_deploy/transform/library/mxfp4_moe.py Outdated
Comment thread tests/integration/defs/accuracy/test_llm_api_autodeploy.py Outdated
@bmarimuthu-nv

Copy link
Copy Markdown
Collaborator Author

Branch file-change summary

I reviewed the branch diff against the merge base with upstream/main (b2bb0ad11372943e7948d9f219d2dedd3892a533) at branch head 30aaf4830c.

Diff size: 46 files, about 14.6k inserted lines and 183 deleted lines.

Status File Summary
Added examples/auto_deploy/model_registry/configs/deepseek_v4_pr1_5layer.yaml Adds the main 5-layer DeepSeek-V4-Flash PR1 smoke config using deepseek_v4_sparse, torch-cudagraph, FP8 linear quantization, MXFP4 MoE, and 8-way TP/EP sharding hints.
Added examples/auto_deploy/model_registry/configs/deepseek_v4_pr1_full.yaml Adds the full-layer DeepSeek V4 PR1 generation-quality config with the same attention/quant/sharding path and checkpoint preload disabled.
Added examples/auto_deploy/model_registry/configs/deepseek_v4_pr1_single_rank_smoke.yaml Adds a single-rank smoke override that keeps the checkpoint/quant/cached-attention path but disables sharding hints.
Modified examples/auto_deploy/model_registry/models.yaml Registers deepseek-ai/DeepSeek-V4-Flash with the PR1 5-layer registry config.
Modified tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py Minor logging cleanup for the trailing static lm_head eager fallback in piecewise CUDA graph capture.
Modified tensorrt_llm/_torch/auto_deploy/compile/piecewise_utils.py Marks auto_deploy::torch_deepseek_v4_sparse_attention_with_cache as a dynamic cached op for piecewise graph splitting/out-buffer handling.
Modified tensorrt_llm/_torch/auto_deploy/config/default.yaml Adds mxfp4_backend: auto as the default for quantize_mxfp4_moe.
Modified tensorrt_llm/_torch/auto_deploy/custom_ops/README.md Documents the new torch-reference MXFP4 MoE ops and EP variants.
Modified tensorrt_llm/_torch/auto_deploy/custom_ops/attention/__init__.py Exposes the new DeepSeek V4 sparse attention module.
Added tensorrt_llm/_torch/auto_deploy/custom_ops/attention/deepseek_v4_sparse_attention.py Adds the DeepSeek V4 sparse attention source and cached reference backend, including SWA/MHC cache handling, compression ratios 0, 4, and 128, prefill/decode paths, fake/export support, and AttentionRegistry descriptor.
Modified tensorrt_llm/_torch/auto_deploy/custom_ops/attention/torch_attention.py Casts sink-normalized attention probabilities to value dtype before the value matmul.
Modified tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/mxfp4_moe.py Adds torch-reference MXFP4 decode/routing/EP ops, routing-driven variants, DeepSeek/GPT-OSS gate-up handling, and lazy optional triton_kernels imports.
Modified tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/torch_moe.py Adds swiglu_limit support for clamping gated MLP gate/up projections before activation/product.
Modified tensorrt_llm/_torch/auto_deploy/custom_ops/linear/linear.py Adds auto_deploy::torch_grouped_linear for grouped projections such as DeepSeek V4 wo_a.
Modified tensorrt_llm/_torch/auto_deploy/custom_ops/quantization/torch_quant.py Adds UE8M0 activation-scale rounding, grouped fine-grained FP8 linear, block FP8 dequant helpers, and a TRT-LLM fine-grained FP8 linear wrapper.
Modified tensorrt_llm/_torch/auto_deploy/custom_ops/sharding_ops.py Adds tp_min_local_shape metadata to auto_deploy.view for sharding-aware parameter reshapes.
Added tensorrt_llm/_torch/auto_deploy/models/checkpoint_metadata.py Adds safetensors index/header metadata readers used to validate model-specific quantized checkpoint layouts without loading tensor payloads.
Modified tensorrt_llm/_torch/auto_deploy/models/custom/__init__.py Registers the DeepSeek V4 custom model module for lazy import.
Added tensorrt_llm/_torch/auto_deploy/models/custom/modeling_deepseek_v4.py Adds the semantic DeepSeek V4 AutoDeploy model/config/factory, checkpoint-key remapping, sparse attention wiring, compressor/indexer logic, routing, MXFP4 expert load path, and DeepSeek V4 quant checkpoint layout registration.
Added tensorrt_llm/_torch/auto_deploy/models/quant_checkpoint_layout.py Adds reusable checkpoint-layout infrastructure for fine-grained FP8 tensors and packed MXFP4 expert tensors, including validation, scale decoding, packing, and quant-config normalization.
Modified tensorrt_llm/_torch/auto_deploy/models/quant_config_reader.py Extends HF quant config detection to build model-owned checkpoint layouts, read safetensors metadata, validate layout contracts, and inject extra model/quant config.
Modified tensorrt_llm/_torch/auto_deploy/transform/library/kvcache.py Adds an opt-in cached DeepSeek V4 sparse attention transform that defaults to backend: deepseek_v4_sparse.
Modified tensorrt_llm/_torch/auto_deploy/transform/library/mxfp4_moe.py Adds MXFP4 backend selection, packed expert runtime-buffer registration, checkpoint-layout load hooks, and torch/triton lowering selection.
Modified tensorrt_llm/_torch/auto_deploy/transform/library/quantization.py Makes fine-grained FP8 quantization checkpoint-layout aware and adds targeted grouped-linear quantization.
Modified tensorrt_llm/_torch/auto_deploy/transform/library/rms_norm.py Casts FlashInfer RMSNorm FP32 weights to input dtype while preserving FX metadata.
Modified tensorrt_llm/_torch/auto_deploy/transform/library/sharding.py Extends legacy sharding support to torch MXFP4 MoE ops and schema-based MXFP4 EP rewrite.
Modified tensorrt_llm/_torch/auto_deploy/transform/library/sharding_ir.py Adds grouped FP8 sharding, view-fed parameter sharding, attention sink sharding, configurable distributed backend selection, and generalized stacked MXFP4 EP rewrites.
Modified tensorrt_llm/_torch/auto_deploy/utils/dist_config.py Adds dist_backend to distributed sharding config serialization and mapping conversion.
Modified tensorrt_llm/_torch/auto_deploy/utils/node_utils.py Recognizes grouped fine-grained FP8 and new torch MXFP4 MoE ops in helper predicates.
Modified tensorrt_llm/_torch/auto_deploy/utils/quantization_utils.py Adds fake FP8/FP4 activation quantization helpers, power-of-two scale helper, and Hadamard rotation.
Modified tests/integration/defs/accuracy/test_llm_api_autodeploy.py Adds DeepSeek V4 Flash PR1 real-checkpoint smoke tests for single-rank and 8-rank registry flows.
Added tests/unittest/_torch/auto_deploy/unit/models/test_deepseek_v4_quant_checkpoint_layout.py Tests HF reader selection of the DeepSeek V4 checkpoint layout and generic non-DeepSeek fallback.
Added tests/unittest/_torch/auto_deploy/unit/quantization/test_deepseek_v4_finegrained_fp8_linear.py Tests DeepSeek V4 FP8 scale alias handling, UE8M0 decoding, grouped-linear quantization, and generic FP8 behavior.
Added tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_deepseek_v4_modeling.py Adds broad semantic-model coverage: config, remapping, rotary/RMSNorm/MLP/routing, dense/sparse attention parity, export, sharding, MXFP4 graph handling, and factory registration.
Modified tests/unittest/auto_deploy/multigpu/transformations/library/test_apply_sharding_hints.py Adds IR sharding tests for grouped FP8, DeepSeek V4-shaped graphs, dist backend selection, list MoE all-reduce, and stacked/routing-driven MXFP4 EP load hooks.
Modified tests/unittest/auto_deploy/singlegpu/compile/test_captured_graph.py Adds monolithic decode CUDA graph fallback coverage for non-matching prefill shapes.
Modified tests/unittest/auto_deploy/singlegpu/compile/test_piecewise_utils.py Tests DeepSeek sparse cached attention dynamic-op and out-buffer classification.
Added tests/unittest/auto_deploy/singlegpu/custom_ops/attention/test_deepseek_v4_sparse_attention.py Adds source/cached sparse attention tests for sink handling, masks, duplicates, compression, prefill/decode cache behavior, CUDA graph replay, fake/export behavior, and cache transform insertion.
Added tests/unittest/auto_deploy/singlegpu/custom_ops/moe/test_torch_moe_swiglu_limit.py Tests the new swiglu_limit behavior in torch_moe.
Added tests/unittest/auto_deploy/singlegpu/custom_ops/moe/test_torch_mxfp4_moe.py Tests torch MXFP4 MoE, EP partitioning, routing-driven DeepSeek layout, CUDA graph capture, backend selection, and expert block-size validation.
Modified tests/unittest/auto_deploy/singlegpu/custom_ops/quantization/test_quant.py Adds UE8M0 fine-grained FP8 and grouped fine-grained FP8 reference tests.
Modified tests/unittest/auto_deploy/singlegpu/models/test_hf.py Adds coverage for disable_preload=True using accelerate checkpoint loading.
Modified tests/unittest/auto_deploy/singlegpu/transformations/library/test_fuse_rmsnorm.py Tests FlashInfer RMSNorm FP32-weight casting to input dtype.
Modified tests/unittest/auto_deploy/singlegpu/utils/test_example_configs.py Adds DeepSeek V4 PR1 registry dry-run validation.
Modified tests/unittest/auto_deploy/singlegpu/utils/test_node_utils_sharding.py Tests grouped fine-grained FP8 op shardability detection.
Modified tests/unittest/auto_deploy/singlegpu/utils/test_quantization_utils.py Tests fake FP8/FP4 quantization and Hadamard rotation export/shape behavior.

@bmarimuthu-nv

Copy link
Copy Markdown
Collaborator Author

Checkpoint layout handling and why DeepSeek V4 needs it

This branch adds a model-owned checkpoint-layout layer for quantized HF checkpoints. The main flow is:

  1. AutoModelForCausalLMFactory resolves the checkpoint directory and calls _load_quantization_config().
  2. HFQuantConfigReader.from_file() reads config.json.
  3. If the model registers a QuantizedCheckpointLayout, the reader reads safetensors header metadata and validates the physical checkpoint tensor contract before returning the quant config.
  4. The resulting quant config carries checkpoint_layout, which downstream FP8 and MXFP4 transforms use for load hooks, scale handling, and expert packing.

The safetensors metadata reader only reads names, dtypes, and shapes from the safetensors header. It does not load tensor payloads. That is useful here because DeepSeek V4 needs to decide early whether this checkpoint physically matches the model-specific packed layout.

What is special in DeepSeek V4:

  • quantization_config.quant_method == "fp8" is not enough to describe the full checkpoint. The DeepSeek V4 path contains fine-grained FP8 linear tensors plus packed MXFP4 routed expert tensors.
  • The FP8 linears are targeted by model-specific module-name patterns. The layout targets attention projections, the indexer projection, and shared experts, while excluding embeddings, lm head, gate, compressor, norms, HC params, attention sinks, and MTP.
  • The FP8 companion scale tensors use checkpoint names like <module>.scale, but runtime expects <module>.weight_scale_inv. The layout validates scale dtype/shape and decodes UE8M0 scales before loading.
  • The routed expert checkpoint tensors are stored per layer, per expert, and per projection: w1, w2, w3, each with weight and scale. Runtime wants packed buffers: gate_up_proj_blocks, gate_up_proj_scales, down_proj_blocks, and down_proj_scales.
  • The layout encodes DeepSeek's gate/up/down meaning: w1 is gate, w3 is up, w2 is down, and the runtime packed gate-up order is w3, then w1.

So the new infra is not just parsing another quantization method. It bridges an HF checkpoint's physical tensor layout to the runtime graph layout AutoDeploy needs, while failing early if required companion tensors, dtypes, or block shapes are missing.

@bmarimuthu-nv

Copy link
Copy Markdown
Collaborator Author

MoE forward and torch_mxfp4_moe_from_routing

The DeepSeek V4 MoE path is routing-driven.

torch_mxfp4_moe is the generic packed-MXFP4 op that takes router weights/bias and top_k; it computes routing inside the op.

torch_mxfp4_moe_from_routing is different: it receives selected_experts and routing_weights that were already computed by the model, then applies the selected packed MXFP4 experts.

That distinction matters for DeepSeek V4 because routing is model-specific:

  • DeepseekV4MoEGate.forward() computes scores with sqrt(softplus(router_logits)).
  • Early hash-routing layers use tid2eid[input_ids] to choose experts.
  • Other layers choose experts with (scores + bias).topk(top_k).
  • The selected scores are gathered, optionally normalized, and multiplied by routed_scaling_factor.

Then DeepseekV4MoE.forward() passes those precomputed selected_experts and routing_weights into torch_mxfp4_moe_from_routing.

The packed expert buffers follow the DeepSeek checkpoint/runtime layout:

  • gate_up_proj_blocks/scales contain the packed gate/up projections.
  • down_proj_blocks/scales contain the packed down projection.
  • gate_up_order="up_gate" means the runtime buffer is interpreted as up first, then gate.
  • swiglu_mode="deepseek" applies the DeepSeek SwiGLU product after the optional swiglu_limit clamp.

The output of the routed packed experts is reshaped back to the original hidden-state shape and added to the shared expert output. In short: DeepSeek computes routing in the semantic model, while the MXFP4 op focuses on applying the already-selected quantized experts.

@bmarimuthu-nv

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #52886 [ run ] triggered by Bot. Commit: 0a30c44 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #52886 [ run ] completed with state SUCCESS. Commit: 0a30c44
/LLM/main/L0_MergeRequest_PR pipeline #42136 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@bmarimuthu-nv

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #52974 [ run ] triggered by Bot. Commit: cf78bbc Link to invocation

@bmarimuthu-nv

Copy link
Copy Markdown
Collaborator Author

DeepSeek V4 Flash AutoDeploy e2e output sanity check

Ran the 10 default text prompts from examples/auto_deploy/build_and_run_ad.py after the latest fixes.

  • Build time: 351.611s
  • Total time: 615.554s
  • Outputs: 10
  • Configs: dashboard_default.yaml, world_size_8.yaml, deepseek_v4_flash.yaml
  • Commit tested: 8f54a7f934

1. Prompt

How big is the universe? 

Response:

That's one of the most profound questions we can ask. The short answer is: **we don't know, but we can measure the *observable* universe.**

Let's break it down into the two key concepts: the **Observable Universe** and the **Entire Universe**.

### 1. The Observable Universe (What we can see)

This is the part of the universe from which light has had time to reach us since the Big Bang, 13.8 billion years ago.

2. Prompt

In simple words and a single sentence, explain the concept of gravity: 

Response:

Gravity is the invisible force that pulls objects with mass towards one another, like how Earth pulls everything toward its center, keeping us on the ground and the Moon in orbit.

3. Prompt

How to fix slicing in golf? 

Response:

Fixing a slice is one of the most common goals in golf. A slice is caused by **side-spin**, which results from a combination of an **open clubface** relative to the swing path and an **outside-to-in swing path**.

Here is a step-by-step guide to fix it, starting with the easiest fix and moving to the technical swing change.

### The Simple Fix: Grip & Setup (50% of slices)

Most amateur slices happen before the club even moves.

**

4. Prompt

Where is the capital of Iceland? 

Response:

The capital of Iceland is **Reykjavík**. It is located in the southwestern part of the country, on the southern shore of Faxaflói Bay.

5. Prompt

What are the three laws of thermodynamics? 

Response:

Here are the three laws of thermodynamics, explained in their classic form along with a practical interpretation:

### 1. The First Law (Law of Conservation of Energy)
**Classic Statement:** Energy cannot be created or destroyed in an isolated system; it can only change forms or be transferred from one system to another.

**Practical Meaning:** The total amount of energy in the universe is constant. If you put heat into a system, that energy either increases the system's internal energy or does work on its

6. Prompt

Summarize the plot of Romeo and Juliet in two sentences: 

Response:

In Verona, two teenagers from feuding families fall in love at first sight and secretly marry, but a series of violent clashes and a tragic miscommunication lead to their double suicide. Their deaths ultimately shock their warring families into ending their bitter feud.

7. Prompt

Write a Python function that checks if a number is prime.

Response:

Here's a Python function that checks if a number is prime:

```python
def is_prime(n):
    """
    Check if a number is prime.
    
    Args:
        n: Integer to check
    
    Returns:
        True if n is prime, False otherwise
    """
    # Handle edge cases
    if n <= 1:
        return False
    if n <= 3:
        return True
    
    # Check if n is divisible by 2 or 3

8. Prompt

Explain the difference between a compiler and an interpreter: 

Response:

The core difference between a compiler and an interpreter lies in **when** they translate your source code (the human-readable instructions you write) into machine code (the 1s and 0s the computer understands).

Here’s the breakdown:

### The Simple Analogy: A Translator

Imagine you have a book written in English, and your friend only speaks Spanish.

- **Interpreter:** Your friend sits next to you. You read the English book aloud, one sentence at a time. Your

9. Prompt

What causes the northern lights? 

Response:

This is a great question. The northern lights (aurora borealis) and their southern counterpart (aurora australis) are caused by the interaction between the solar wind and Earth's magnetic field and atmosphere.

Here is the process broken down step-by-step:

### The Short Answer

The Sun constantly sends out a stream of charged particles (the *solar wind*). When these particles reach Earth, they are funneled by our planet's magnetic field toward the poles. There, they collide with

10. Prompt

What are the health benefits of drinking green tea?

Response:

Drinking green tea is associated with several health benefits due to its rich content of antioxidants, particularly catechins, as well as other bioactive compounds. Here are some evidence-based health benefits:

1. **Rich in Antioxidants**  
   Green tea is packed with polyphenols like epigallocatechin gallate (EGCG), which help neutralize harmful free radicals, reducing oxidative stress and damage to cells.

2. **May Support Heart Health**  
   Regular consumption of green tea has been linked to lower

@bmarimuthu-nv

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #52974 [ run ] completed with state SUCCESS. Commit: cf78bbc
/LLM/main/L0_MergeRequest_PR pipeline #42209 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Comment thread tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
@suyoggupta suyoggupta marked this pull request as ready for review June 9, 2026 19:39
@suyoggupta suyoggupta requested a review from a team as a code owner June 9, 2026 19:39
@suyoggupta suyoggupta requested a review from MrGeva June 9, 2026 19:39
@bmarimuthu-nv bmarimuthu-nv changed the title [None][feat] AutoDeploy: Add DeepSeekV4 Support [None][feat] AutoDeploy: Add DeepSeekV4 Flash Support Jun 9, 2026
@govind-ramnarayan

Copy link
Copy Markdown
Collaborator

Here is a review from claude and codex with some review guidelines I've been using recently. Figure I'll just paste it here and you can see if any of this is reasonable to add :)

[P1] Real DeepSeek V4 checkpoints may fail quantized-layout validation before load-hook key remapping.
HFQuantConfigReader reads raw safetensors tensor names and calls layout.validate_checkpoint_metadata() before the DeepSeek V4 load hook renames keys. The layout patterns expect canonical names like layers..ffn.experts..w1/w2/w3, but the model includes a rename hook for model.* prefixes and gate_proj/up_proj/down_proj aliases. The unit fixture uses already-canonical names, so it would not catch a real HF-style checkpoint. Please validate against the actual DeepSeek-V4-Flash safetensors headers or normalize tensor names with the same rename map before metadata validation.

[P2] Cached sparse attention lacks same-forward mixed prefill+decode coverage.
The production cached op has a decode-only fast path, but mixed batches go through the generic path with host metadata conversion and per-sequence cache writes. Existing tests cover context-only, decode-only, and staged prefill-then-decode, but not one call with both num_prefill > 0 and num_decode > 0. Please add a mixed BatchInfo test for ratio 0 and one compressed mode.

[P2] dist_backend="torch" does not force torch collectives.
_get_dist_ops("torch") still returns TRT-LLM collectives whenever TRT-LLM ops are available. Since this PR starts passing config.dist_backend into sharding call sites and adds a test asserting torch backend behavior, torch should bypass the availability probe; only "auto" should probe.

@bmarimuthu-nv bmarimuthu-nv requested a review from a team as a code owner June 10, 2026 04:20
@bmarimuthu-nv

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
@bmarimuthu-nv

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54102 [ run ] triggered by Bot. Commit: 94e65e4 Link to invocation

@bmarimuthu-nv

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54191 [ run ] triggered by Bot. Commit: 94e65e4 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54102 [ run ] completed with state ABORTED. Commit: 94e65e4

Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54191 [ run ] completed with state SUCCESS. Commit: 94e65e4
/LLM/main/L0_MergeRequest_PR pipeline #43270 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants