[None][feat] AutoDeploy: Add DeepSeekV4 Flash Support#15019
[None][feat] AutoDeploy: Add DeepSeekV4 Flash Support#15019bmarimuthu-nv wants to merge 28 commits into
Conversation
|
@CodeRabbit review |
✅ Action performedReview finished.
|
📝 WalkthroughWalkthroughDeepSeek V4 AutoDeploy support was added across registry YAMLs, custom ops, checkpoint/layout readers, transforms, and tests. The PR wires sparse attention, MXFP4 MoE, grouped FP8 quantization, sharding hints, and DeepSeek V4 model loading into the auto-deploy flow. ChangesDeepSeek V4 AutoDeploy
Sequence Diagram(s)✨ Finishing Touches🧪 Generate unit tests (beta)
|
There was a problem hiding this comment.
Actionable comments posted: 7
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
tensorrt_llm/_torch/auto_deploy/transform/library/sharding.py (1)
1482-1501:⚠️ Potential issue | 🟠 Major | ⚡ Quick winHonor explicit
dist_backend="torch"in_get_dist_ops.Line 1493 currently selects TRTLLM ops whenever they are available, even when the caller explicitly requests
"torch". With this PR’s new backend threading, that can silently ignore config and route to the wrong distributed op.Suggested fix
def _get_dist_ops(backend: str): @@ - if backend == "trtllm" or is_trtllm_op_available(): + if backend == "trtllm": + return ( + torch.ops.auto_deploy.trtllm_dist_all_gather.default, + torch.ops.auto_deploy.trtllm_dist_all_reduce.default, + ) + if backend == "torch": + return ( + torch.ops.auto_deploy.torch_dist_all_gather.default, + torch.ops.auto_deploy.torch_dist_all_reduce.default, + ) + if backend == "auto" and is_trtllm_op_available(): return ( torch.ops.auto_deploy.trtllm_dist_all_gather.default, torch.ops.auto_deploy.trtllm_dist_all_reduce.default, ) return ( torch.ops.auto_deploy.torch_dist_all_gather.default, torch.ops.auto_deploy.torch_dist_all_reduce.default, )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/_torch/auto_deploy/transform/library/sharding.py` around lines 1482 - 1501, The function _get_dist_ops currently prefers TRT-LLM ops whenever they exist, ignoring an explicit dist_backend="torch"; change the selection logic so that TRT-LLM ops are chosen only when backend == "trtllm" or backend == "auto" and is_trtllm_op_available() is true, otherwise return the Torch ops for backend == "torch" (or any other non-trtllm value). Keep the existing handling of enum-like inputs (checking backend.value) and the existing return values (torch.ops.auto_deploy.trtllm_dist_all_gather.default / trtllm_dist_all_reduce.default vs torch.ops.auto_deploy.torch_dist_all_gather.default / torch_dist_all_reduce.default).tensorrt_llm/_torch/auto_deploy/utils/dist_config.py (1)
132-157:⚠️ Potential issue | 🔴 CriticalAdd
dist_backendsupport toDistConfig.from_sharding_params()
from_sharding_params()intensorrt_llm/_torch/auto_deploy/utils/dist_config.pydoes not acceptdist_backend, butIRShardingConfig._init_dist_config()calls it withdist_backend=self.dist_backend(intensorrt_llm/_torch/auto_deploy/transform/library/sharding_ir.py). This will raise aTypeErrorand also prevents overriding the default"auto"backend. Add an optionaldist_backendparameter and forward it into theDistConfig(...)construction.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/_torch/auto_deploy/utils/dist_config.py` around lines 132 - 157, from_sharding_params currently doesn't accept or forward dist_backend which causes a TypeError when IRShardingConfig._init_dist_config calls from_sharding_params(dist_backend=...), so add an optional parameter dist_backend: str = "auto" to the from_sharding_params signature and pass dist_backend=dist_backend into the DistConfig(...) constructor invocation so the DistConfig created reflects the requested backend and avoids the TypeError.tensorrt_llm/_torch/auto_deploy/utils/node_utils.py (1)
663-669:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAlso register the grouped FineGrained FP8 op as a fake-quant linear.
Adding it only to
is_finegrained_fp8_linear_op()is incomplete:is_any_lin_op()still goes throughis_fake_quantized_linear_op(), so helpers likeget_weight_shape()and layer/subgraph discovery will continue to skip grouped FineGrained FP8 nodes.Suggested follow-up
def is_fake_quantized_linear_op(node: Node) -> bool: quantized_linear_op = { torch.ops.auto_deploy.torch_fake_quant_fp8_linear, torch.ops.auto_deploy.torch_fake_quant_nvfp4_linear, torch.ops.auto_deploy.torch_fake_quant_finegrained_fp8_linear, + torch.ops.auto_deploy.torch_fake_quant_grouped_finegrained_fp8_linear, } return is_op(node, quantized_linear_op)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/_torch/auto_deploy/utils/node_utils.py` around lines 663 - 669, The grouped FineGrained FP8 op (torch.ops.auto_deploy.torch_fake_quant_grouped_finegrained_fp8_linear) needs to be treated as a fake-quant linear as well so it is discovered by is_any_lin_op -> is_fake_quantized_linear_op flows; update the fake-quant linear registry/check by adding that symbol to whatever list or predicate used by is_fake_quantized_linear_op (or the helper it calls) so functions like get_weight_shape and layer/subgraph discovery will include grouped FineGrained FP8 nodes; search for is_fake_quantized_linear_op and add torch.ops.auto_deploy.torch_fake_quant_grouped_finegrained_fp8_linear to the same checks used for other fake-quant linear ops (keeping is_finegrained_fp8_linear_op unchanged).tensorrt_llm/_torch/auto_deploy/custom_ops/quantization/torch_quant.py (1)
707-715:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftStop inferring FineGrained FP8 block size from the scale tensor shape.
triton.cdiv(N, scale_n)/triton.cdiv(K, scale_k)recovers the tail-expanded size, not the checkpoint’s canonical block size. For a 576-row weight stored on a 128-row grid, this computes116, so the Triton path and the BF16 dequant fallback both apply each scale row to the wrong weights. The grouped path repeats the same inference, so tailed FineGrained FP8 matrices will still mis-dequantize.Please preserve the declared block size (128x128 here, or a value threaded from
checkpoint_layout) instead of recomputing it fromweight_scale_inv.shape.Also applies to: 791-799, 874-897
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/_torch/auto_deploy/custom_ops/quantization/torch_quant.py` around lines 707 - 715, The code is incorrectly inferring FineGrained FP8 block size by using triton.cdiv on weight_scale_inv.shape (e.g., N,K = weight_quantized.shape; scale_n,scale_k = weight_scale_inv.shape; block_n = triton.cdiv(N, scale_n); block_k = triton.cdiv(K, scale_k)), which recovers a tail-expanded size instead of the canonical block size; change the logic to use the declared block size (the canonical checkpoint block dimensions provided by the checkpoint_layout or a threaded block_size argument) wherever block_size is computed/used (including the instances around the block_size assignment and the grouped path at the other ranges), stop deriving block_size from weight_scale_inv.shape, and thread or pass the correct block_size into _safe_act_quant and any other consumers so scale rows are applied to the intended 128x128 (or checkpoint-provided) blocks.
🧹 Nitpick comments (5)
tensorrt_llm/_torch/auto_deploy/utils/dist_config.py (1)
44-44: ⚡ Quick winConsider adding a field description.
The new
dist_backendfield lacks aField(description="..."). As per coding guidelines, user-facing Pydantic fields should include descriptions. While this is consistent with the existing fields in this class, consider adding descriptions to clarify the purpose and valid values for users.📝 Suggested improvement
- dist_backend: Literal["auto", "torch", "trtllm"] = Field(default="auto") + dist_backend: Literal["auto", "torch", "trtllm"] = Field( + default="auto", + description="Distributed backend for communication primitives. " + "'auto' selects based on runtime; 'torch' uses PyTorch distributed; " + "'trtllm' uses TensorRT-LLM native backend." + )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/_torch/auto_deploy/utils/dist_config.py` at line 44, Add a user-facing description to the Pydantic field dist_backend by replacing Field(default="auto") with Field(default="auto", description="Choose distributed backend: 'auto' to auto-detect, 'torch' to use PyTorch distributed, or 'trtllm' to use the TRT-LLM backend."), matching the style and wording used for other fields in the same config class (dist_backend is the field to update).Source: Coding guidelines
tensorrt_llm/_torch/auto_deploy/transform/library/kvcache.py (1)
630-638: ⚡ Quick winAdd an explicit return type annotation on
_apply.Suggested patch
- def _apply(self, *args, **kwargs): + def _apply(self, *args, **kwargs) -> Tuple[GraphModule, TransformInfo]: if self.config.backend is None: self.config.backend = "deepseek_v4_sparse" elif self.config.backend != "deepseek_v4_sparse": raise ValueError( "insert_cached_deepseek_v4_sparse_attention only supports " f"backend='deepseek_v4_sparse', got {self.config.backend!r}." ) return super()._apply(*args, **kwargs)As per coding guidelines, always annotate Python function return types.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/_torch/auto_deploy/transform/library/kvcache.py` around lines 630 - 638, The _apply method is missing a return type annotation; update the def _apply signature to include an explicit return type (e.g., -> Any or the specific return type returned by super()._apply) so it complies with typing guidelines, and import typing.Any if needed; ensure the annotated function still sets self.config.backend and calls/returns super()._apply(*args, **kwargs).Source: Coding guidelines
tensorrt_llm/_torch/auto_deploy/models/custom/modeling_deepseek_v4.py (1)
1604-1617: ⚡ Quick winAdd return annotations on the new public model accessors.
These methods are part of the exported model surface and currently skip return typing, which makes the new class harder to use with the repo’s type-checking conventions.
💡 Suggested fix
- def get_input_embeddings(self): + def get_input_embeddings(self) -> nn.Embedding: return self.embed - def set_input_embeddings(self, new_embeddings): + def set_input_embeddings(self, new_embeddings: nn.Embedding) -> None: self.embed = new_embeddings - def get_output_embeddings(self): + def get_output_embeddings(self) -> nn.Linear: return self.head - def set_output_embeddings(self, new_embeddings): + def set_output_embeddings(self, new_embeddings: nn.Linear) -> None: self.head = new_embeddings - def get_decoder(self): + def get_decoder(self) -> "DeepseekV4ForCausalLM": return selfAs per coding guidelines, "Always annotate Python function return types; use
Noneif function does not return anything`."🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/_torch/auto_deploy/models/custom/modeling_deepseek_v4.py` around lines 1604 - 1617, The new public accessors lack return type annotations; add explicit type hints: annotate get_input_embeddings as -> Optional[nn.Module] (or -> nn.Module if embed is always present), set_input_embeddings as -> None, get_output_embeddings as -> Optional[nn.Module] (or -> nn.Module), set_output_embeddings as -> None, and get_decoder as -> "DeepseekModel" (or -> nn.Module/self type) to match the class; also ensure typing.Optional and torch.nn as nn (or the appropriate model class name) are imported so the annotations type-check correctly.Source: Coding guidelines
tests/unittest/auto_deploy/multigpu/transformations/library/test_apply_sharding_hints.py (2)
76-84: ⚡ Quick winAdd return annotations to the new helpers and methods.
These new FX/sharding helpers all omit
->annotations, even though this repo requires return types on Python functions. That makes the GraphModule/optimizer contracts harder to follow in a file that is already very meta-heavy.♻️ Representative fix
class DeepSeekV4IRContractBlock(nn.Module): - def __init__(self): + def __init__(self) -> None: super().__init__() ... - def forward(self, x): + def forward(self, x: torch.Tensor) -> torch.Tensor: ... def _make_optimizer( world_size: int, rank: int = 0, dist_backend: str | None = None, *, simple_shard_only: bool = False, -): +) -> InferenceOptimizer: ... -def _export_deepseek_v4_contract_block(): +def _export_deepseek_v4_contract_block() -> GraphModule: ... -def _register_mxfp4_checkpoint_layout_hooks(gm, checkpoint_layout): +def _register_mxfp4_checkpoint_layout_hooks(gm, checkpoint_layout) -> GraphModule: ...As per coding guidelines, "Always annotate Python function return types; use
Noneif function does not return anything." Based on learnings, Python 3.10+ features can be used throughout the repo, including tests.Also applies to: 218-323, 430-826
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unittest/auto_deploy/multigpu/transformations/library/test_apply_sharding_hints.py` around lines 76 - 84, Several new functions/methods in this test (notably the class __init__ and forward shown, plus the FX/sharding helper functions referenced elsewhere) are missing return type annotations; update each function signature to include an explicit return annotation (use -> None for methods that do not return a value, or the appropriate concrete type for helpers that return tensors/GraphModule/etc.). Look for and update signatures for __init__, forward, and the helper functions mentioned around lines 218-323 and 430-826 so each has a return annotation while preserving parameter names and existing logic (e.g., def forward(self, x) -> torch.Tensor: or def helper(...) -> None / -> SomeType as appropriate).Sources: Coding guidelines, Learnings
326-427: ⚡ Quick winAdd
-> Noneto the new test cases.The new tests follow the repo’s naming/style rules, but they still miss explicit return annotations. In this codebase, tests are expected to annotate returns too.
♻️ Representative fix
-def test_apply_hints_grouped_fp8_linear_trusts_group_sharded_view_input(): +def test_apply_hints_grouped_fp8_linear_trusts_group_sharded_view_input() -> None: ... -def test_deepseek_v4_ir_contract_linear_view_sparse_attention(): +def test_deepseek_v4_ir_contract_linear_view_sparse_attention() -> None: ... -def test_stacked_mxfp4_routing_driven_rank1_preserves_expert_start_with_torch_backend(): +def test_stacked_mxfp4_routing_driven_rank1_preserves_expert_start_with_torch_backend() -> None: ...As per coding guidelines, "Always annotate Python function return types; use
Noneif function does not return anything." Based on learnings, Python 3.10+ features can be used throughout the repo, including tests.Also applies to: 483-495, 829-1172
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unittest/auto_deploy/multigpu/transformations/library/test_apply_sharding_hints.py` around lines 326 - 427, Several new test functions are missing explicit return annotations; add "-> None" to each test function signature (e.g., test_apply_hints_grouped_fp8_linear_trusts_group_sharded_view_input, test_apply_hints_grouped_fp8_linear_slices_plain_global_input_groups, test_simple_shard_only_does_not_ordinary_shard_grouped_fp8_linear, test_apply_hints_default_dist_backend_uses_auto_selection, test_apply_hints_torch_dist_backend_forces_torch_all_reduce, test_deepseek_v4_ir_contract_linear_view_sparse_attention and the other tests referenced in ranges 483-495 and 829-1172) so each def line reads like "def <test_name>(...) -> None:"; keep signatures and bodies unchanged otherwise.Sources: Coding guidelines, Learnings
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@tensorrt_llm/_torch/auto_deploy/custom_ops/attention/deepseek_v4_sparse_attention.py`:
- Line 231: The local variable compressed_len is declared but never used; change
the unpacking from "batch_size, compressed_len, ratio, _ = tensor.shape" to
ignore that value (e.g., "batch_size, _, ratio, _ = tensor.shape") so the unused
name is removed and lint warnings disappear; update the unpacking wherever
"tensor.shape" is decomposed in this scope to use "_" instead of
"compressed_len".
In `@tensorrt_llm/_torch/auto_deploy/models/checkpoint_metadata.py`:
- Around line 68-72: The code currently only catches json.JSONDecodeError when
reading safetensors index and should also normalize UnicodeDecodeError into
QuantizedCheckpointLayoutError so corrupted/invalid UTF-8 is reported as a
layout error (ensure autodetect_quant_config_reader() sees it). Update the
try/except in the block that opens the index (the one that raises
QuantizedCheckpointLayoutError) to catch both json.JSONDecodeError and
UnicodeDecodeError (e.g., except (json.JSONDecodeError, UnicodeDecodeError) as
error) and re-raise QuantizedCheckpointLayoutError(f"Invalid safetensors index
JSON: {path}") from error; apply the same change to the similar handler around
the other block referenced (the one at the 90-95 region) so all malformed
safetensors decoding errors are normalized.
In `@tensorrt_llm/_torch/auto_deploy/models/custom/modeling_deepseek_v4.py`:
- Around line 789-801: The code currently constructs self.experts =
nn.ModuleList([...DeepseekV4MLP...]) before checking ad_use_mxfp4_experts, which
unnecessarily instantiates large dense experts for the MXFP4 path; change the
logic so that if self.ad_use_mxfp4_experts is True you skip creating
DeepseekV4MLP instances (set self.experts = nn.ModuleList() or defer creation)
and still call self._register_mxfp4_runtime_buffers() and
self._register_load_state_dict_pre_hook(self._load_mxfp4_checkpoint_experts);
adjust any code that expects len(self.experts) so it can handle an empty
ModuleList until the packed expert buffers are loaded.
In `@tensorrt_llm/_torch/auto_deploy/transform/library/mxfp4_moe.py`:
- Line 287: The loop binding layout names to extracted args uses
zip({_MXFP4_LAYOUT_ARG_NAMES}, expert_args) which can silently truncate
mismatches; change this to a strict check by either using
zip({_MXFP4_LAYOUT_ARG_NAMES}, expert_args, strict=True) or pre-validate lengths
(raise a descriptive exception if len(_MXFP4_LAYOUT_ARG_NAMES) !=
len(expert_args)) before the loop so any schema drift fails loudly; update the
loop over name and arg (the for ... in zip(...) block) to use the chosen strict
approach and include an informative error message mentioning
_MXFP4_LAYOUT_ARG_NAMES and expert_args.
- Around line 573-575: The call to _get_mxfp4_expert_dims currently unpacks into
num_experts, hidden_size, intermediate_size but those values aren’t used
(RUF059); change the unpacking to either drop unused values or prefix them with
an underscore (for example _num_experts, _hidden_size, _intermediate_size or
simply _, _, _) so the linter stops complaining and intent is clear; locate the
call to _get_mxfp4_expert_dims in the mxfp4_moe code and update the left-hand
side of the assignment accordingly.
In `@tests/integration/defs/accuracy/test_llm_api_autodeploy.py`:
- Around line 72-80: The loop that resolves
DEEPSEEK_V4_FLASH_MODEL_DIR/DEEPSEEK_V4_MODEL_DIR currently checks candidates =
(model_path, model_path / "DeepSeek-V4-Flash") which returns the root model_path
first; change the candidate order to check the subdirectory first (e.g.,
(model_path / "DeepSeek-V4-Flash", model_path)) so the DeepSeek-V4-Flash
directory is preferred when present, keeping the rest of the logic (env_var,
model_path, exists() check, and returning str(candidate)) the same.
In `@tests/unittest/auto_deploy/singlegpu/custom_ops/moe/test_torch_mxfp4_moe.py`:
- Around line 741-745: The test uses regex patterns in pytest.raises(...,
match="...") that contain metacharacters; change the string literals to raw
strings (prefix with r) for both calls to pytest.raises that reference
_resolve_mxfp4_expert_block_size (the two occurrences: one passing
{"expert_block_size": 64} and one passing _UnsupportedBlockSizeLayout()), so the
match argument becomes a raw string (e.g., r"...") to avoid
escaping/interpretation issues and satisfy the linter.
---
Outside diff comments:
In `@tensorrt_llm/_torch/auto_deploy/custom_ops/quantization/torch_quant.py`:
- Around line 707-715: The code is incorrectly inferring FineGrained FP8 block
size by using triton.cdiv on weight_scale_inv.shape (e.g., N,K =
weight_quantized.shape; scale_n,scale_k = weight_scale_inv.shape; block_n =
triton.cdiv(N, scale_n); block_k = triton.cdiv(K, scale_k)), which recovers a
tail-expanded size instead of the canonical block size; change the logic to use
the declared block size (the canonical checkpoint block dimensions provided by
the checkpoint_layout or a threaded block_size argument) wherever block_size is
computed/used (including the instances around the block_size assignment and the
grouped path at the other ranges), stop deriving block_size from
weight_scale_inv.shape, and thread or pass the correct block_size into
_safe_act_quant and any other consumers so scale rows are applied to the
intended 128x128 (or checkpoint-provided) blocks.
In `@tensorrt_llm/_torch/auto_deploy/transform/library/sharding.py`:
- Around line 1482-1501: The function _get_dist_ops currently prefers TRT-LLM
ops whenever they exist, ignoring an explicit dist_backend="torch"; change the
selection logic so that TRT-LLM ops are chosen only when backend == "trtllm" or
backend == "auto" and is_trtllm_op_available() is true, otherwise return the
Torch ops for backend == "torch" (or any other non-trtllm value). Keep the
existing handling of enum-like inputs (checking backend.value) and the existing
return values (torch.ops.auto_deploy.trtllm_dist_all_gather.default /
trtllm_dist_all_reduce.default vs
torch.ops.auto_deploy.torch_dist_all_gather.default /
torch_dist_all_reduce.default).
In `@tensorrt_llm/_torch/auto_deploy/utils/dist_config.py`:
- Around line 132-157: from_sharding_params currently doesn't accept or forward
dist_backend which causes a TypeError when IRShardingConfig._init_dist_config
calls from_sharding_params(dist_backend=...), so add an optional parameter
dist_backend: str = "auto" to the from_sharding_params signature and pass
dist_backend=dist_backend into the DistConfig(...) constructor invocation so the
DistConfig created reflects the requested backend and avoids the TypeError.
In `@tensorrt_llm/_torch/auto_deploy/utils/node_utils.py`:
- Around line 663-669: The grouped FineGrained FP8 op
(torch.ops.auto_deploy.torch_fake_quant_grouped_finegrained_fp8_linear) needs to
be treated as a fake-quant linear as well so it is discovered by is_any_lin_op
-> is_fake_quantized_linear_op flows; update the fake-quant linear
registry/check by adding that symbol to whatever list or predicate used by
is_fake_quantized_linear_op (or the helper it calls) so functions like
get_weight_shape and layer/subgraph discovery will include grouped FineGrained
FP8 nodes; search for is_fake_quantized_linear_op and add
torch.ops.auto_deploy.torch_fake_quant_grouped_finegrained_fp8_linear to the
same checks used for other fake-quant linear ops (keeping
is_finegrained_fp8_linear_op unchanged).
---
Nitpick comments:
In `@tensorrt_llm/_torch/auto_deploy/models/custom/modeling_deepseek_v4.py`:
- Around line 1604-1617: The new public accessors lack return type annotations;
add explicit type hints: annotate get_input_embeddings as -> Optional[nn.Module]
(or -> nn.Module if embed is always present), set_input_embeddings as -> None,
get_output_embeddings as -> Optional[nn.Module] (or -> nn.Module),
set_output_embeddings as -> None, and get_decoder as -> "DeepseekModel" (or ->
nn.Module/self type) to match the class; also ensure typing.Optional and
torch.nn as nn (or the appropriate model class name) are imported so the
annotations type-check correctly.
In `@tensorrt_llm/_torch/auto_deploy/transform/library/kvcache.py`:
- Around line 630-638: The _apply method is missing a return type annotation;
update the def _apply signature to include an explicit return type (e.g., -> Any
or the specific return type returned by super()._apply) so it complies with
typing guidelines, and import typing.Any if needed; ensure the annotated
function still sets self.config.backend and calls/returns super()._apply(*args,
**kwargs).
In `@tensorrt_llm/_torch/auto_deploy/utils/dist_config.py`:
- Line 44: Add a user-facing description to the Pydantic field dist_backend by
replacing Field(default="auto") with Field(default="auto", description="Choose
distributed backend: 'auto' to auto-detect, 'torch' to use PyTorch distributed,
or 'trtllm' to use the TRT-LLM backend."), matching the style and wording used
for other fields in the same config class (dist_backend is the field to update).
In
`@tests/unittest/auto_deploy/multigpu/transformations/library/test_apply_sharding_hints.py`:
- Around line 76-84: Several new functions/methods in this test (notably the
class __init__ and forward shown, plus the FX/sharding helper functions
referenced elsewhere) are missing return type annotations; update each function
signature to include an explicit return annotation (use -> None for methods that
do not return a value, or the appropriate concrete type for helpers that return
tensors/GraphModule/etc.). Look for and update signatures for __init__, forward,
and the helper functions mentioned around lines 218-323 and 430-826 so each has
a return annotation while preserving parameter names and existing logic (e.g.,
def forward(self, x) -> torch.Tensor: or def helper(...) -> None / -> SomeType
as appropriate).
- Around line 326-427: Several new test functions are missing explicit return
annotations; add "-> None" to each test function signature (e.g.,
test_apply_hints_grouped_fp8_linear_trusts_group_sharded_view_input,
test_apply_hints_grouped_fp8_linear_slices_plain_global_input_groups,
test_simple_shard_only_does_not_ordinary_shard_grouped_fp8_linear,
test_apply_hints_default_dist_backend_uses_auto_selection,
test_apply_hints_torch_dist_backend_forces_torch_all_reduce,
test_deepseek_v4_ir_contract_linear_view_sparse_attention and the other tests
referenced in ranges 483-495 and 829-1172) so each def line reads like "def
<test_name>(...) -> None:"; keep signatures and bodies unchanged otherwise.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Enterprise
Run ID: abd8c747-53f9-4de8-8268-c0cf335189c9
📒 Files selected for processing (46)
examples/auto_deploy/model_registry/configs/deepseek_v4_pr1_5layer.yamlexamples/auto_deploy/model_registry/configs/deepseek_v4_pr1_full.yamlexamples/auto_deploy/model_registry/configs/deepseek_v4_pr1_single_rank_smoke.yamlexamples/auto_deploy/model_registry/models.yamltensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.pytensorrt_llm/_torch/auto_deploy/compile/piecewise_utils.pytensorrt_llm/_torch/auto_deploy/config/default.yamltensorrt_llm/_torch/auto_deploy/custom_ops/README.mdtensorrt_llm/_torch/auto_deploy/custom_ops/attention/__init__.pytensorrt_llm/_torch/auto_deploy/custom_ops/attention/deepseek_v4_sparse_attention.pytensorrt_llm/_torch/auto_deploy/custom_ops/attention/torch_attention.pytensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/mxfp4_moe.pytensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/torch_moe.pytensorrt_llm/_torch/auto_deploy/custom_ops/linear/linear.pytensorrt_llm/_torch/auto_deploy/custom_ops/quantization/torch_quant.pytensorrt_llm/_torch/auto_deploy/custom_ops/sharding_ops.pytensorrt_llm/_torch/auto_deploy/models/checkpoint_metadata.pytensorrt_llm/_torch/auto_deploy/models/custom/__init__.pytensorrt_llm/_torch/auto_deploy/models/custom/modeling_deepseek_v4.pytensorrt_llm/_torch/auto_deploy/models/quant_checkpoint_layout.pytensorrt_llm/_torch/auto_deploy/models/quant_config_reader.pytensorrt_llm/_torch/auto_deploy/transform/library/kvcache.pytensorrt_llm/_torch/auto_deploy/transform/library/mxfp4_moe.pytensorrt_llm/_torch/auto_deploy/transform/library/quantization.pytensorrt_llm/_torch/auto_deploy/transform/library/rms_norm.pytensorrt_llm/_torch/auto_deploy/transform/library/sharding.pytensorrt_llm/_torch/auto_deploy/transform/library/sharding_ir.pytensorrt_llm/_torch/auto_deploy/utils/dist_config.pytensorrt_llm/_torch/auto_deploy/utils/node_utils.pytensorrt_llm/_torch/auto_deploy/utils/quantization_utils.pytests/integration/defs/accuracy/test_llm_api_autodeploy.pytests/unittest/_torch/auto_deploy/unit/models/test_deepseek_v4_quant_checkpoint_layout.pytests/unittest/_torch/auto_deploy/unit/quantization/test_deepseek_v4_finegrained_fp8_linear.pytests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_deepseek_v4_modeling.pytests/unittest/auto_deploy/multigpu/transformations/library/test_apply_sharding_hints.pytests/unittest/auto_deploy/singlegpu/compile/test_captured_graph.pytests/unittest/auto_deploy/singlegpu/compile/test_piecewise_utils.pytests/unittest/auto_deploy/singlegpu/custom_ops/attention/test_deepseek_v4_sparse_attention.pytests/unittest/auto_deploy/singlegpu/custom_ops/moe/test_torch_moe_swiglu_limit.pytests/unittest/auto_deploy/singlegpu/custom_ops/moe/test_torch_mxfp4_moe.pytests/unittest/auto_deploy/singlegpu/custom_ops/quantization/test_quant.pytests/unittest/auto_deploy/singlegpu/models/test_hf.pytests/unittest/auto_deploy/singlegpu/transformations/library/test_fuse_rmsnorm.pytests/unittest/auto_deploy/singlegpu/utils/test_example_configs.pytests/unittest/auto_deploy/singlegpu/utils/test_node_utils_sharding.pytests/unittest/auto_deploy/singlegpu/utils/test_quantization_utils.py
Branch file-change summaryI reviewed the branch diff against the merge base with Diff size: 46 files, about 14.6k inserted lines and 183 deleted lines.
|
Checkpoint layout handling and why DeepSeek V4 needs itThis branch adds a model-owned checkpoint-layout layer for quantized HF checkpoints. The main flow is:
The safetensors metadata reader only reads names, dtypes, and shapes from the safetensors header. It does not load tensor payloads. That is useful here because DeepSeek V4 needs to decide early whether this checkpoint physically matches the model-specific packed layout. What is special in DeepSeek V4:
So the new infra is not just parsing another quantization method. It bridges an HF checkpoint's physical tensor layout to the runtime graph layout AutoDeploy needs, while failing early if required companion tensors, dtypes, or block shapes are missing. |
MoE forward and
|
30aaf48 to
0a30c44
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #52886 [ run ] triggered by Bot. Commit: |
|
PR_Github #52886 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #52974 [ run ] triggered by Bot. Commit: |
|
DeepSeek V4 Flash AutoDeploy e2e output sanity check Ran the 10 default text prompts from
1. PromptResponse: 2. PromptResponse: 3. PromptResponse: 4. PromptResponse: 5. PromptResponse: 6. PromptResponse: 7. PromptResponse: 8. PromptResponse: 9. PromptResponse: 10. PromptResponse: |
|
/bot run --disable-fail-fast |
|
PR_Github #52974 [ run ] completed with state
|
|
Here is a review from claude and codex with some review guidelines I've been using recently. Figure I'll just paste it here and you can see if any of this is reasonable to add :) [P1] Real DeepSeek V4 checkpoints may fail quantized-layout validation before load-hook key remapping. [P2] Cached sparse attention lacks same-forward mixed prefill+decode coverage. [P2] dist_backend="torch" does not force torch collectives. |
8f71a83 to
bfefc5e
Compare
|
/bot run --disable-fail-fast |
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
0e0b866 to
94e65e4
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #54102 [ run ] triggered by Bot. Commit: |
|
/bot run --disable-fail-fast |
|
PR_Github #54191 [ run ] triggered by Bot. Commit: |
|
PR_Github #54102 [ run ] completed with state |
|
PR_Github #54191 [ run ] completed with state
|
Summary by CodeRabbit
New Features
Documentation
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
If PR introduces API changes, an appropriate PR label is added - either
api-compatibleorapi-breaking. Forapi-breaking, includeBREAKINGin the PR title.Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.