Skip to content

[None][feat] DSv4 prep: compressor and mHC primitives#15379

Open
lfr-0531 wants to merge 8 commits into
NVIDIA:mainfrom
lfr-0531:user/fanrongl/dsv4-compressor-mhc
Open

[None][feat] DSv4 prep: compressor and mHC primitives#15379
lfr-0531 wants to merge 8 commits into
NVIDIA:mainfrom
lfr-0531:user/fanrongl/dsv4-compressor-mhc

Conversation

@lfr-0531

@lfr-0531 lfr-0531 commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

Description

This is PR-2 from the DSv4 umbrella split. It lands standalone compressor and mHC primitives from #14751 without pulling in the DSv4 sparse cache manager, sparse MLA backend, MoE routing, or model/tokenizer changes.

Included:

  • compressor CUDA kernels, Python wrapper, thop registration, and kernel/TF32 unit tests
  • mHC CUDA kernels, Python wrapper, thop registration, and unit tests
  • minimal CMake/thop build wiring for compressor and mHC only
  • header-only deepseek_v4/__init__.py so standalone compressor import does not import the full DSv4 backend

Intentionally excluded:

  • test_compressor_module.py and cache_manager.py (PR-6)
  • sparse MLA backend implementation (PR-7)
  • IndexerTopK/TopK (PR-3)
  • attention fusion/custom ops beyond compressor/mHC (PR-4/PR-5)
  • MoE/routing and DSv4 model/tokenizer/API changes (PR-8/PR-9)

Verification

Base/source:

Build/install:

  • python3 ./scripts/build_wheel.py --trt_root /usr/local/tensorrt --benchmarks --use_ccache --cuda_architectures "90-real;100-real" --configure_cmake
  • Built wheel: build/tensorrt_llm-1.3.0rc18-cp312-cp312-linux_x86_64.whl
  • Build logs confirmed mhcOp.cpp and compressorOp.cpp compiled into th_common.
  • python -m pip install --force-reinstall --no-deps build/tensorrt_llm-1.3.0rc18-cp312-cp312-linux_x86_64.whl
  • Note: attribution generation warned ninja -t inputs returned no results for wheel targets, but wheel build exited 0.

Import/custom-op smoke:

  • from tensorrt_llm.bindings.internal import thop succeeded.
  • deepseek_v4.compressor was imported from this worktree path.
  • cache_manager and backend deepseek_v4.py modules were not loaded by compressor import.
  • torch.ops.trtllm exposed:
    • compressor_prefill_reduction
    • compressor_paged_kv_compress
    • compressor_postprocess_scatter
    • mhc_big_fuse
    • mhc_gemm_sqrsum_fma
    • mhc_post_mapping
    • mhc_fused_hc
    • mhc_hc_head_apply

Tests:

  • Checked GPU state before each test; all 8 B300 GPUs were idle, used CUDA_VISIBLE_DEVICES=0.
  • timeout 1200 python -m pytest -q --tb=short -ra tests/unittest/_torch/attention/sparse/deepseek_v4/test_compressor_kernel.py: 63 passed, 22 skipped, 3 warnings in 4.41s
  • timeout 1200 python -m pytest -q --tb=short -ra tests/unittest/_torch/attention/sparse/deepseek_v4/test_compressor_tf32.py: 4 passed, 2 warnings in 2.53s
  • timeout 1800 python -m pytest -q --tb=short -ra tests/unittest/_torch/modules/test_mhc.py: 50 passed, 3 warnings in 4.16s

Scope/pre-commit:

  • pre-commit run --files $(git diff --name-only HEAD) passed before commit; commit hooks passed.
  • Scope check returned no output:
    git diff --name-only HEAD | rg 'cache_manager.py|attention_backend/sparse/deepseek_v4/deepseek_v4.py|IndexerTopK|indexerTopK|RoutingKernelTopK|modeling_deepseekv4|tokenizer/deepseek_v4|fused_moe|moeGate|deepseekV4QNorm|fp8Quantize.cpp|mlaRopeInplaceOp|triton_fused_inv_rope'

Summary by CodeRabbit

  • New Features

    • Added KV cache compression support for DeepSeek V4 with multiple compression strategies for improved inference efficiency.
    • Implemented Multi-Head Hyper-Connection (mHC) mechanism with optimized CUDA kernels for enhanced model performance.
    • Added PyTorch integration layers for both compression and mHC operations with automatic kernel selection.
  • Tests

    • Added comprehensive test suites validating compression and mHC functionality against reference implementations.
    • Added profiling and regression tests to ensure correctness and detect performance issues.

Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
@lfr-0531 lfr-0531 requested review from a team as code owners June 15, 2026 14:56
@coderabbitai

coderabbitai Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

Adds two new CUDA kernel libraries for DeepSeek-V4: mhcKernels (Multi-Head Hyper-Connection boundary ops with FMA and SM100 TF32/UMMA fused paths, autotuning, and Sinkhorn normalization) and compressorKernels (KV-cache compression for prefill and decode with FP8/MXFP4 postprocess-scatter). Both are built as CMake OBJECT libraries, registered as Torch custom ops via THOP, wrapped in Python nn.Module classes, and covered by unit tests and benchmarks.

Changes

DeepSeek-V4 mHC and KV Compressor Kernel Stack

Layer / File(s) Summary
Build wiring for mHC and compressor targets
cpp/tensorrt_llm/CMakeLists.txt, cpp/tensorrt_llm/kernels/CMakeLists.txt, cpp/tensorrt_llm/kernels/compressorKernels/CMakeLists.txt, cpp/tensorrt_llm/kernels/mhcKernels/CMakeLists.txt, cpp/tensorrt_llm/thop/CMakeLists.txt
Registers mhcKernels and compressorKernels as add_subdirectory entries with glob exclusions, creates each as an OBJECT library with PIC/CUDA-device-symbol/fast-math settings, links both into TRTLLM_LINK_LIBS, and adds mhcOp.cpp and compressorOp.cpp to th_common.
Compressor CUDA kernel contracts and implementation
cpp/tensorrt_llm/kernels/compressorKernels/compressorKernels.h, cpp/tensorrt_llm/kernels/compressorKernels/compressorKernels.cu
Declares pagedKvCompressLaunch, prefillReductionLaunch, and postProcessScatterLaunch plus CacheScaleType; implements decode-path pagedKvCompressKernel (online softmax over paged KV/score), prefill-path prefillReductionKernel (window-wise softmax with remainder state persistence), and fused postProcessScatterKernel (RMSNorm + RoPE + Hadamard butterfly + paged scatter for kNone/FP8/MXFP4 cache modes) with X-macro instantiation tables.
Compressor THOP op registration and Python module
cpp/tensorrt_llm/thop/compressorOp.cpp, tensorrt_llm/_torch/attention_backend/sparse/deepseek_v4/compressor.py, tensorrt_llm/_torch/attention_backend/sparse/deepseek_v4/__init__.py
compressorOp.cpp wraps the three launch functions with stream/dtype/contiguity validation and optional-tensor nullptr handling, registering them under TORCH_LIBRARY_FRAGMENT(trtllm). compressor.py defines KVCacheDtype, resolve_kv_cache_dtype, and Compressor (an nn.Module that projects tokens, dispatches prefill/decode compress ops, allocates FP8/MXFP4 quantization buffers, and calls compressor_postprocess_scatter).
mHC core CUDA kernel contracts and base kernels
cpp/tensorrt_llm/kernels/mhcKernels/mhcKernels.h, cpp/tensorrt_llm/kernels/mhcKernels/mhcKernels.cu
mhcKernels.h declares eight launcher functions across big-fuse, GEMM-sqrsum-FMA, head-apply, post-mapping, and four fused boundary variants. mhcKernels.cu implements mhcBigFuseKernel (split-K reduce + RMSnorm + sigmoid + Sinkhorn + optional kFuseNorm), mhcGemmSqrsumFmaKernel (split-N FP32 FMA GEMM with fused per-row sqrsum), mhcPostMappingKernel, mhcHcHeadApplyKernel, and all dispatch/validation wrappers.
mHC fused FMA path
cpp/tensorrt_llm/kernels/mhcKernels/mhc_fused_fma.cuh, cpp/tensorrt_llm/kernels/mhcKernels/mhcFusedHcKernel.cu (FMA sections)
mhc_fused_fma.cuh defines fused_pmap_gemm_fma_ksplit (pmap + GEMM split-K writing Yp/Rp partials) and fused_pmap_gemm_fma_allinone (single-kernel pmap + GEMM + inline bigFuse with done_counter election, Sinkhorn, and optional RMSNorm). mhcFusedHcKernel.cu adds workspace zeroing, shape validation, fused_pmap_gemm_fma_ksplit-based mhcFusedHcFmaLaunch, and mhcFusedHcFmaAllInOneLaunch with PDL dispatch.
mHC TF32/SM100 fused path
cpp/tensorrt_llm/kernels/mhcKernels/fused_tf32_pmap_gemm.cuh, cpp/tensorrt_llm/kernels/mhcKernels/mhcFusedHcKernel.cu (TF32 sections)
fused_tf32_pmap_gemm.cuh implements fused_tf32_pmap_gemm_rout_atomic_impl (TMA/SMEM/TMEM-pipelined UMMA TF32 GEMM with atomic split-K and sqr_sum reduction) and fused_allinone_tf32_pmap_gemm_atomic_impl (fuses pmap + tcgen05 GEMM + inline bigFuse with cross-split done_counter and optional fused RMSNorm). mhcFusedHcKernel.cu adds cuTensorMap LRU descriptor caching, mhcFusedHcLaunch (two-phase: tcgen05 + mhcBigFuseLaunch), and mhcFusedHcAllInOneLaunch (single kernel).
mHC THOP op registration and Python modules
cpp/tensorrt_llm/thop/mhcOp.cpp, tensorrt_llm/_torch/modules/mhc/hyper_connection.py, tensorrt_llm/_torch/modules/mhc/mhc_cuda.py, tensorrt_llm/_torch/modules/mhc/__init__.py
mhcOp.cpp registers five Torch ops under trtllm with backend-integer dispatch and optional norm_weight handling. hyper_connection.py defines HCState dataclass, mHC module (pre_mapping, fused_hc, post_mapping), and HCHead with skip_forward. mhc_cuda.py provides low-level CUDA wrappers, MhcPreMappingRunner and MhcFusedHcRunner TunableRunner subclasses with AutoTuner.choose_one() integration, _FusedHcWorkspaceCache with bounded LRU caching, and top-level mhc_pre_mapping_fused, mhc_fused_hc, and mhc_hc_head_cuda APIs.
Unit tests and benchmarks
tests/unittest/_torch/modules/test_mhc.py, tests/unittest/_torch/attention/sparse/deepseek_v4/test_compressor_tf32.py, tests/unittest/_torch/attention/sparse/deepseek_v4/__init__.py
test_mhc.py includes vanilla PyTorch reference implementations, CUDA profiling helpers, data generators, correctness tests for all mHC ops (fused-norm, all backends, realistic-scale regression, CUDA graph capture/replay), an HC-head test, a pre-mapping benchmark harness, and a session-scoped timing reporter. test_compressor_tf32.py validates BF16 GEMM precision for wkv_gate against an FP32 reference using cosine similarity and relative error thresholds.

Sequence Diagram(s)

sequenceDiagram
    participant PyLayer as Model Layer (Python)
    participant mHC as mHC Module
    participant MhcFusedHcRunner
    participant mhcFusedHcLaunch as CUDA: mhcFusedHcLaunch
    participant mhcBigFuseLaunch as CUDA: mhcBigFuseLaunch
    participant Compressor as Compressor Module
    participant compressorKernels as CUDA: Compressor Kernels
    participant PagedKVCache as Paged KV Cache

    PyLayer->>mHC: fused_hc(x_prev, residual_prev, post_mix_prev, comb_mix_prev, norm_weight)
    mHC->>MhcFusedHcRunner: AutoTuner.choose_one() selects backend tactic
    MhcFusedHcRunner->>mhcFusedHcLaunch: dispatch (FMA/TF32/all-in-one) with workspaces
    mhcFusedHcLaunch->>mhcBigFuseLaunch: y_acc, r_acc -> split-K reduce + Sinkhorn + optional RMSNorm
    mhcBigFuseLaunch-->>mHC: residual_cur, post_mix_cur, comb_mix_cur, layer_input_cur

    PyLayer->>Compressor: forward(x, metadata)
    Compressor->>compressorKernels: prefillReductionLaunch (prefill tokens)
    Compressor->>compressorKernels: pagedKvCompressLaunch (decode tokens)
    compressorKernels->>PagedKVCache: update online softmax state, emit kv_comp
    Compressor->>compressorKernels: postProcessScatterLaunch (RMSNorm+RoPE+Hadamard, cache_scale_type)
    compressorKernels->>PagedKVCache: scatter normalized/quantized KV (FP8/MXFP4/BF16)
    Compressor-->>PyLayer: (kv_out or quant_output, scale_output)
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Suggested reviewers

  • schetlur-nv
  • chang-l
  • byshiue
  • kaiyux
  • xxi-nv
  • leslie-fang25
  • laikhtewari
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 39.22% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title '[None][feat] DSv4 prep: compressor and mHC primitives' clearly and specifically describes the main change: adding compressor and mHC (Multi-Head Hyper-Connection) primitives as preparation for DeepSeek-V4 support.
Description check ✅ Passed The PR description is comprehensive, clearly explaining what is included (compressor CUDA kernels, mHC CUDA kernels, Python wrappers, CMake wiring) and explicitly listing what is intentionally excluded. It includes detailed verification steps showing successful build, imports, tests, and scope validation.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (4)
tensorrt_llm/_torch/attention_backend/sparse/deepseek_v4/compressor.py (1)

56-59: 💤 Low value

Unknown string keys raise cryptic KeyError.

resolve_kv_cache_dtype will raise a bare KeyError if passed an unrecognized string. A more informative error message would help users diagnose configuration issues.

Suggested improvement
 def resolve_kv_cache_dtype(kv_cache_dtype: Union[str, KVCacheDtype]) -> KVCacheDtype:
     if isinstance(kv_cache_dtype, str):
-        return _KV_CACHE_DTYPE_MAP[kv_cache_dtype]
+        if kv_cache_dtype not in _KV_CACHE_DTYPE_MAP:
+            raise ValueError(
+                f"Unknown kv_cache_dtype: '{kv_cache_dtype}'. "
+                f"Valid options: {list(_KV_CACHE_DTYPE_MAP.keys())}"
+            )
+        return _KV_CACHE_DTYPE_MAP[kv_cache_dtype]
     return kv_cache_dtype
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/attention_backend/sparse/deepseek_v4/compressor.py`
around lines 56 - 59, The resolve_kv_cache_dtype function currently raises a
bare KeyError when given an unrecognized string key, providing no helpful
guidance to users. Add error handling around the _KV_CACHE_DTYPE_MAP dictionary
lookup in resolve_kv_cache_dtype to catch the KeyError and re-raise a more
informative error message that includes the invalid key that was provided and
optionally lists the valid/supported options from the _KV_CACHE_DTYPE_MAP
dictionary. This will help users diagnose configuration issues more easily.
tensorrt_llm/_torch/modules/mhc/mhc_cuda.py (1)

150-150: ⚡ Quick win

Consider raising a descriptive error instead of assertion.

The assertion guards against calling the function when DeepGEMM is unavailable, but assertions are removed in optimized Python (-O). A clear RuntimeError would be more robust.

♻️ Suggested refactor
-    assert dg_fn is not None, "DeepGEMM is not available"
+    if dg_fn is None:
+        raise RuntimeError(
+            "DeepGEMM is not available. Install deep_gemm or use FMA backend."
+        )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/modules/mhc/mhc_cuda.py` at line 150, The assertion at
line 150 checking if dg_fn is not None can be removed when Python runs in
optimized mode (-O flag), making the guard ineffective. Replace the assert
statement with an explicit if condition that raises a RuntimeError instead,
maintaining the descriptive error message "DeepGEMM is not available" so the
check is always enforced regardless of Python optimization settings.
tests/unittest/_torch/modules/test_mhc.py (1)

959-961: ⚡ Quick win

Add strict=True to zip for safety.

Python 3.10+ supports zip(..., strict=True), which raises ValueError if the iterables have different lengths. This catches bugs where the returned tuple doesn't match the expected structure.

♻️ Suggested fix
-    for ge, ee, name in zip(
-        graph_out, eager_out, ["residual", "post_mix", "comb_mix", "layer_input"]
-    ):
+    for ge, ee, name in zip(
+        graph_out, eager_out, ["residual", "post_mix", "comb_mix", "layer_input"], strict=True
+    ):

Apply the same change to line 976.

Also applies to: 976-978

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unittest/_torch/modules/test_mhc.py` around lines 959 - 961, The zip()
function calls in the test file lack the strict=True parameter, which would help
catch bugs where iterables have different lengths. In
tests/unittest/_torch/modules/test_mhc.py at lines 959-961 (the zip call with
graph_out, eager_out, and the name list), add strict=True as a parameter to the
zip() function to enable length validation. Apply the identical change at lines
976-978 where another zip() call appears that needs the same strict=True
parameter added.
tensorrt_llm/_torch/modules/mhc/hyper_connection.py (1)

114-116: ⚡ Quick win

Consider replacing assertions with explicit ValueError for dtype/shape checks.

Assertions are removed when Python runs with -O (optimized mode), which can lead to silent failures in production. Explicit ValueError or TypeError is more robust for runtime validation in forward methods.

♻️ Suggested refactor
-    assert x.dtype == torch.bfloat16
-    assert self.mult == x.shape[-2]
-    assert self.hidden_size == x.shape[-1]
+    if x.dtype != torch.bfloat16:
+        raise TypeError(f"pre_mapping requires bfloat16 input, got {x.dtype}")
+    if self.mult != x.shape[-2]:
+        raise ValueError(f"Expected shape[−2]={self.mult}, got {x.shape[-2]}")
+    if self.hidden_size != x.shape[-1]:
+        raise ValueError(f"Expected shape[−1]={self.hidden_size}, got {x.shape[-1]}")

Apply the same pattern to lines 188-189 in fused_hc.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/modules/mhc/hyper_connection.py` around lines 114 - 116,
Replace the assertions that validate dtype and shape in the forward method
(checking x.dtype against torch.bfloat16, self.mult against x.shape[-2], and
self.hidden_size against x.shape[-1]) with explicit ValueError or TypeError
raises instead, since assertions are stripped when Python runs with -O flag and
will silently fail in production. Apply the same pattern to the assertions at
lines 188-189 in the fused_hc method to ensure consistent runtime validation
across both locations.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cpp/tensorrt_llm/kernels/compressorKernels/compressorKernels.cu`:
- Around line 162-176: The packE2M1x2 function silently returns 0 on pre-SM100
architectures instead of failing, which causes data corruption when
kMXFP4Blockwise mode is enabled on unsupported GPUs. Add a runtime architecture
validation in postProcessScatterLaunch (around line 1692) that rejects
kMXFP4Blockwise mode if the GPU compute capability is less than SM100 (major
version less than 10). Use TLLM_CHECK_WITH_INFO to perform this validation and
report the actual GPU architecture in the error message so users know why their
request was rejected. This prevents silent data corruption by catching the
incompatibility at runtime rather than letting packE2M1x2 silently degrade.

In `@cpp/tensorrt_llm/kernels/mhcKernels/mhcFusedHcKernel.cu`:
- Around line 530-542: The FMA kernel launchers use tile_n and tile_m parameters
in modulo and division operations without first validating that these values are
positive, which can cause host crashes or undefined behavior if they are zero or
negative. Add TLLM_CHECK_WITH_INFO validation calls to ensure tile_n is greater
than zero before the modulo operation FHC_SHAPE_N % tile_n (around line 534),
and ensure tile_m is greater than zero before the division operation m_batches =
... / tile_m (around line 804). Apply the same validation checks to the second
FMA launcher function in the consolidated_sites range (lines 795-820) to prevent
the same crash paths in both implementations.

In `@cpp/tensorrt_llm/kernels/mhcKernels/mhcKernels.cu`:
- Around line 867-875: The mhcHcHeadApplyLaunch wrapper function accepts a
runtime parameter mult without validation, but the kernel mhcHcHeadApplyKernel
only allocates s_pre[8] in shared memory. If mult exceeds 8, it causes
out-of-bounds memory writes and reads. Add a validation check in
mhcHcHeadApplyLaunch before the kernel launch statement to ensure mult does not
exceed 8, either through an assertion, error return, or exception to prevent
invalid kernel behavior.
- Around line 847-863: The default case in the switch statement for tileN in
mhcGemmSqrsumFmaLaunch incorrectly maps all unrecognized tileN values to the
TN=24 kernel, which causes out-of-bounds writes when tileN is unsupported (e.g.,
tileN=16). Add a validation check using TLLM_CHECK_WITH_INFO after the existing
divisibility check on line 849 to explicitly reject unsupported tileN values,
ensuring only the valid cases (1, 2, 3, 4, 6, 8, 12, and 24) are allowed to
proceed to the switch statement.

In `@cpp/tensorrt_llm/thop/compressorOp.cpp`:
- Around line 78-118: The compressorPostProcessScatterOp function is missing
contiguity validation checks for several input tensors that the kernel expects
to have contiguous memory layout. Add TORCH_CHECK statements after the existing
position_ids contiguity check (around line 100) to validate that kv_comp,
rms_weight, block_offsets, and kv_cache are all contiguous, using the same
pattern as the existing cos_sin_table and position_ids checks with the
is_contiguous() method and appropriate error messages identifying each tensor.

In `@cpp/tensorrt_llm/thop/mhcOp.cpp`:
- Around line 94-100: The norm_weight tensor validation in the mhc_fused_hc
function is missing device checks, which could allow CPU or different-device
tensors to be passed to the CUDA kernel, causing illegal memory access. Add
TORCH_CHECK validations to ensure that norm_weight is a CUDA tensor and is on
the same device as the other tensors being used in the operation. Insert these
device checks in the existing norm_weight validation block (after checking
dtype, contiguity, and numel) to verify the tensor is on a CUDA device before
dereferencing its pointer.
- Around line 103-155: The code currently handles backend values 3, 2, and 1
with explicit if conditions, but any other value silently falls through to the
final mhcFusedHcLaunch call (backend 0 fallback), which masks bugs like typos or
autotuner errors. Add a guard after the if conditions to reject unknown backend
values (anything other than 0, 1, 2, or 3) by throwing an error or asserting,
before reaching the mhcFusedHcLaunch call. This ensures invalid backend values
fail fast instead of silently executing the wrong kernel.

---

Nitpick comments:
In `@tensorrt_llm/_torch/attention_backend/sparse/deepseek_v4/compressor.py`:
- Around line 56-59: The resolve_kv_cache_dtype function currently raises a bare
KeyError when given an unrecognized string key, providing no helpful guidance to
users. Add error handling around the _KV_CACHE_DTYPE_MAP dictionary lookup in
resolve_kv_cache_dtype to catch the KeyError and re-raise a more informative
error message that includes the invalid key that was provided and optionally
lists the valid/supported options from the _KV_CACHE_DTYPE_MAP dictionary. This
will help users diagnose configuration issues more easily.

In `@tensorrt_llm/_torch/modules/mhc/hyper_connection.py`:
- Around line 114-116: Replace the assertions that validate dtype and shape in
the forward method (checking x.dtype against torch.bfloat16, self.mult against
x.shape[-2], and self.hidden_size against x.shape[-1]) with explicit ValueError
or TypeError raises instead, since assertions are stripped when Python runs with
-O flag and will silently fail in production. Apply the same pattern to the
assertions at lines 188-189 in the fused_hc method to ensure consistent runtime
validation across both locations.

In `@tensorrt_llm/_torch/modules/mhc/mhc_cuda.py`:
- Line 150: The assertion at line 150 checking if dg_fn is not None can be
removed when Python runs in optimized mode (-O flag), making the guard
ineffective. Replace the assert statement with an explicit if condition that
raises a RuntimeError instead, maintaining the descriptive error message
"DeepGEMM is not available" so the check is always enforced regardless of Python
optimization settings.

In `@tests/unittest/_torch/modules/test_mhc.py`:
- Around line 959-961: The zip() function calls in the test file lack the
strict=True parameter, which would help catch bugs where iterables have
different lengths. In tests/unittest/_torch/modules/test_mhc.py at lines 959-961
(the zip call with graph_out, eager_out, and the name list), add strict=True as
a parameter to the zip() function to enable length validation. Apply the
identical change at lines 976-978 where another zip() call appears that needs
the same strict=True parameter added.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e0251fc7-d3e0-4b27-b4fc-d3a7189d27f1

📥 Commits

Reviewing files that changed from the base of the PR and between 20b6068 and 8b66876.

📒 Files selected for processing (23)
  • cpp/tensorrt_llm/CMakeLists.txt
  • cpp/tensorrt_llm/kernels/CMakeLists.txt
  • cpp/tensorrt_llm/kernels/compressorKernels/CMakeLists.txt
  • cpp/tensorrt_llm/kernels/compressorKernels/compressorKernels.cu
  • cpp/tensorrt_llm/kernels/compressorKernels/compressorKernels.h
  • cpp/tensorrt_llm/kernels/mhcKernels/CMakeLists.txt
  • cpp/tensorrt_llm/kernels/mhcKernels/fused_tf32_pmap_gemm.cuh
  • cpp/tensorrt_llm/kernels/mhcKernels/mhcFusedHcKernel.cu
  • cpp/tensorrt_llm/kernels/mhcKernels/mhcKernels.cu
  • cpp/tensorrt_llm/kernels/mhcKernels/mhcKernels.h
  • cpp/tensorrt_llm/kernels/mhcKernels/mhc_fused_fma.cuh
  • cpp/tensorrt_llm/thop/CMakeLists.txt
  • cpp/tensorrt_llm/thop/compressorOp.cpp
  • cpp/tensorrt_llm/thop/mhcOp.cpp
  • tensorrt_llm/_torch/attention_backend/sparse/deepseek_v4/__init__.py
  • tensorrt_llm/_torch/attention_backend/sparse/deepseek_v4/compressor.py
  • tensorrt_llm/_torch/modules/mhc/__init__.py
  • tensorrt_llm/_torch/modules/mhc/hyper_connection.py
  • tensorrt_llm/_torch/modules/mhc/mhc_cuda.py
  • tests/unittest/_torch/attention/sparse/deepseek_v4/__init__.py
  • tests/unittest/_torch/attention/sparse/deepseek_v4/test_compressor_kernel.py
  • tests/unittest/_torch/attention/sparse/deepseek_v4/test_compressor_tf32.py
  • tests/unittest/_torch/modules/test_mhc.py

Comment thread cpp/tensorrt_llm/kernels/compressorKernels/compressorKernels.cu
Comment thread cpp/tensorrt_llm/kernels/mhcKernels/mhcFusedHcKernel.cu
Comment thread cpp/tensorrt_llm/kernels/mhcKernels/mhcKernels.cu
Comment thread cpp/tensorrt_llm/kernels/mhcKernels/mhcKernels.cu
Comment thread cpp/tensorrt_llm/thop/compressorOp.cpp
Comment thread cpp/tensorrt_llm/thop/mhcOp.cpp
Comment thread cpp/tensorrt_llm/thop/mhcOp.cpp
@lfr-0531

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@lfr-0531 lfr-0531 requested a review from mingyangHao June 15, 2026 16:01
@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54321 [ run ] triggered by Bot. Commit: 8b66876 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54321 [ run ] completed with state SUCCESS. Commit: 8b66876
/LLM/main/L0_MergeRequest_PR pipeline #43391 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@longlee0622

Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54397 [ run ] triggered by Bot. Commit: 8b66876 Link to invocation

@lfr-0531 lfr-0531 force-pushed the user/fanrongl/dsv4-compressor-mhc branch from 09f114d to 8b66876 Compare June 16, 2026 01:41
@lfr-0531

Copy link
Copy Markdown
Collaborator Author

/bot kill

@lfr-0531

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

Comment thread tests/unittest/_torch/modules/test_mhc.py
Comment thread cpp/tensorrt_llm/kernels/CMakeLists.txt
Comment thread cpp/tensorrt_llm/kernels/compressorKernels/compressorKernels.cu Outdated
@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54411 [ run ] triggered by Bot. Commit: e72c27b Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54412 [ kill ] triggered by Bot. Commit: e72c27b Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54411 [ run ] completed with state ABORTED. Commit: e72c27b

Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54397 [ run ] completed with state ABORTED. Commit: 8b66876

Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54412 [ kill ] completed with state SUCCESS. Commit: e72c27b
Successfully killed previous jobs for commit e72c27b

Link to invocation

Signed-off-by: Mingyang Hao <200044211+mingyangHao@users.noreply.github.com>
Signed-off-by: Mingyang Hao <200044211+mingyangHao@users.noreply.github.com>
@mingyangHao mingyangHao force-pushed the user/fanrongl/dsv4-compressor-mhc branch from c75b5fc to 6e4e004 Compare June 16, 2026 04:45
@mingyangHao

Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54516 [ run ] completed with state SUCCESS. Commit: 6e4e004
/LLM/main/L0_MergeRequest_PR pipeline #43576 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@lfr-0531

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54601 [ run ] triggered by Bot. Commit: 6e4e004 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54601 [ run ] completed with state SUCCESS. Commit: 6e4e004
/LLM/main/L0_MergeRequest_PR pipeline #43641 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@lfr-0531

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54696 [ run ] triggered by Bot. Commit: 6debcbd Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54696 [ run ] completed with state FAILURE. Commit: 6debcbd
/LLM/main/L0_MergeRequest_PR pipeline #43727 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@lfr-0531

Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54779 [ run ] triggered by Bot. Commit: 6debcbd Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54779 [ run ] completed with state SUCCESS. Commit: 6debcbd
/LLM/main/L0_MergeRequest_PR pipeline #43797 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@lfr-0531

Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54811 [ run ] triggered by Bot. Commit: 6debcbd Link to invocation

@lfr-0531 lfr-0531 requested review from yuxianq and removed request for PerkzZheng June 17, 2026 09:00
@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54811 [ run ] completed with state SUCCESS. Commit: 6debcbd
/LLM/main/L0_MergeRequest_PR pipeline #43826 completed with status: 'SUCCESS'

CI Report

Link to invocation

@lfr-0531 lfr-0531 enabled auto-merge (squash) June 17, 2026 11:36

TORCH_LIBRARY_IMPL(trtllm, CUDA, m)
{
m.impl("mhc_big_fuse", &mhcBigFuseOp);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add register_fake for all ops to tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the newly added custom ops are in-place, so there's no need to add them to register_fake. I just updated the inplace_map in tensorrt_llm/_torch/compilation/utils.py.


TORCH_LIBRARY_IMPL(trtllm, CUDA, m)
{
m.impl("compressor_paged_kv_compress", &compressorPagedKvCompressOp);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add register_fake for all ops to tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
@lfr-0531 lfr-0531 requested a review from a team as a code owner June 23, 2026 11:40
@lfr-0531 lfr-0531 requested a review from hyukn June 23, 2026 11:40
Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
@lfr-0531

Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55244 [ run ] triggered by Bot. Commit: 31b1c4b Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55244 [ run ] completed with state SUCCESS. Commit: 31b1c4b
/LLM/main/L0_MergeRequest_PR pipeline #44202 completed with status: 'SUCCESS'

CI Report

Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants