[None][feat] DSv4 prep: compressor and mHC primitives by lfr-0531 · Pull Request #15379 · NVIDIA/TensorRT-LLM

lfr-0531 · 2026-06-15T14:56:18Z

Description

This is PR-2 from the DSv4 umbrella split. It lands standalone compressor and mHC primitives from #14751 without pulling in the DSv4 sparse cache manager, sparse MLA backend, MoE routing, or model/tokenizer changes.

Included:

compressor CUDA kernels, Python wrapper, thop registration, and kernel/TF32 unit tests
mHC CUDA kernels, Python wrapper, thop registration, and unit tests
minimal CMake/thop build wiring for compressor and mHC only
header-only deepseek_v4/__init__.py so standalone compressor import does not import the full DSv4 backend

Intentionally excluded:

test_compressor_module.py and cache_manager.py (PR-6)
sparse MLA backend implementation (PR-7)
IndexerTopK/TopK (PR-3)
attention fusion/custom ops beyond compressor/mHC (PR-4/PR-5)
MoE/routing and DSv4 model/tokenizer/API changes (PR-8/PR-9)

Verification

Base/source:

Restored from latest [None][feat] Support DeepSeek-V4 model #14751 head aea4ae426b619bbc1e8411e8028de7fc77747664.
Branch created from github/main 20b606838773ed96194a20f37b54441f4281e4d0.

Build/install:

python3 ./scripts/build_wheel.py --trt_root /usr/local/tensorrt --benchmarks --use_ccache --cuda_architectures "90-real;100-real" --configure_cmake
Built wheel: build/tensorrt_llm-1.3.0rc18-cp312-cp312-linux_x86_64.whl
Build logs confirmed mhcOp.cpp and compressorOp.cpp compiled into th_common.
python -m pip install --force-reinstall --no-deps build/tensorrt_llm-1.3.0rc18-cp312-cp312-linux_x86_64.whl
Note: attribution generation warned ninja -t inputs returned no results for wheel targets, but wheel build exited 0.

Import/custom-op smoke:

from tensorrt_llm.bindings.internal import thop succeeded.
deepseek_v4.compressor was imported from this worktree path.
cache_manager and backend deepseek_v4.py modules were not loaded by compressor import.
torch.ops.trtllm exposed:
- compressor_prefill_reduction
- compressor_paged_kv_compress
- compressor_postprocess_scatter
- mhc_big_fuse
- mhc_gemm_sqrsum_fma
- mhc_post_mapping
- mhc_fused_hc
- mhc_hc_head_apply

Tests:

Checked GPU state before each test; all 8 B300 GPUs were idle, used CUDA_VISIBLE_DEVICES=0.
timeout 1200 python -m pytest -q --tb=short -ra tests/unittest/_torch/attention/sparse/deepseek_v4/test_compressor_kernel.py: 63 passed, 22 skipped, 3 warnings in 4.41s
timeout 1200 python -m pytest -q --tb=short -ra tests/unittest/_torch/attention/sparse/deepseek_v4/test_compressor_tf32.py: 4 passed, 2 warnings in 2.53s
timeout 1800 python -m pytest -q --tb=short -ra tests/unittest/_torch/modules/test_mhc.py: 50 passed, 3 warnings in 4.16s

Scope/pre-commit:

pre-commit run --files $(git diff --name-only HEAD) passed before commit; commit hooks passed.
Scope check returned no output:
git diff --name-only HEAD | rg 'cache_manager.py|attention_backend/sparse/deepseek_v4/deepseek_v4.py|IndexerTopK|indexerTopK|RoutingKernelTopK|modeling_deepseekv4|tokenizer/deepseek_v4|fused_moe|moeGate|deepseekV4QNorm|fp8Quantize.cpp|mlaRopeInplaceOp|triton_fused_inv_rope'

Summary by CodeRabbit

New Features
- Added KV cache compression support for DeepSeek V4 with multiple compression strategies for improved inference efficiency.
- Implemented Multi-Head Hyper-Connection (mHC) mechanism with optimized CUDA kernels for enhanced model performance.
- Added PyTorch integration layers for both compression and mHC operations with automatic kernel selection.
Tests
- Added comprehensive test suites validating compression and mHC functionality against reference implementations.
- Added profiling and regression tests to ensure correctness and detect performance issues.

Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>

coderabbitai · 2026-06-15T15:14:39Z

📝 Walkthrough

Walkthrough

Adds two new CUDA kernel libraries for DeepSeek-V4: mhcKernels (Multi-Head Hyper-Connection boundary ops with FMA and SM100 TF32/UMMA fused paths, autotuning, and Sinkhorn normalization) and compressorKernels (KV-cache compression for prefill and decode with FP8/MXFP4 postprocess-scatter). Both are built as CMake OBJECT libraries, registered as Torch custom ops via THOP, wrapped in Python nn.Module classes, and covered by unit tests and benchmarks.

Changes

DeepSeek-V4 mHC and KV Compressor Kernel Stack

Layer / File(s)	Summary
Build wiring for mHC and compressor targets `cpp/tensorrt_llm/CMakeLists.txt`, `cpp/tensorrt_llm/kernels/CMakeLists.txt`, `cpp/tensorrt_llm/kernels/compressorKernels/CMakeLists.txt`, `cpp/tensorrt_llm/kernels/mhcKernels/CMakeLists.txt`, `cpp/tensorrt_llm/thop/CMakeLists.txt`	Registers `mhcKernels` and `compressorKernels` as `add_subdirectory` entries with glob exclusions, creates each as an `OBJECT` library with PIC/CUDA-device-symbol/fast-math settings, links both into `TRTLLM_LINK_LIBS`, and adds `mhcOp.cpp` and `compressorOp.cpp` to `th_common`.
Compressor CUDA kernel contracts and implementation `cpp/tensorrt_llm/kernels/compressorKernels/compressorKernels.h`, `cpp/tensorrt_llm/kernels/compressorKernels/compressorKernels.cu`	Declares `pagedKvCompressLaunch`, `prefillReductionLaunch`, and `postProcessScatterLaunch` plus `CacheScaleType`; implements decode-path `pagedKvCompressKernel` (online softmax over paged KV/score), prefill-path `prefillReductionKernel` (window-wise softmax with remainder state persistence), and fused `postProcessScatterKernel` (RMSNorm + RoPE + Hadamard butterfly + paged scatter for kNone/FP8/MXFP4 cache modes) with X-macro instantiation tables.
Compressor THOP op registration and Python module `cpp/tensorrt_llm/thop/compressorOp.cpp`, `tensorrt_llm/_torch/attention_backend/sparse/deepseek_v4/compressor.py`, `tensorrt_llm/_torch/attention_backend/sparse/deepseek_v4/__init__.py`	`compressorOp.cpp` wraps the three launch functions with stream/dtype/contiguity validation and optional-tensor `nullptr` handling, registering them under `TORCH_LIBRARY_FRAGMENT(trtllm)`. `compressor.py` defines `KVCacheDtype`, `resolve_kv_cache_dtype`, and `Compressor` (an `nn.Module` that projects tokens, dispatches prefill/decode compress ops, allocates FP8/MXFP4 quantization buffers, and calls `compressor_postprocess_scatter`).
mHC core CUDA kernel contracts and base kernels `cpp/tensorrt_llm/kernels/mhcKernels/mhcKernels.h`, `cpp/tensorrt_llm/kernels/mhcKernels/mhcKernels.cu`	`mhcKernels.h` declares eight launcher functions across big-fuse, GEMM-sqrsum-FMA, head-apply, post-mapping, and four fused boundary variants. `mhcKernels.cu` implements `mhcBigFuseKernel` (split-K reduce + RMSnorm + sigmoid + Sinkhorn + optional `kFuseNorm`), `mhcGemmSqrsumFmaKernel` (split-N FP32 FMA GEMM with fused per-row sqrsum), `mhcPostMappingKernel`, `mhcHcHeadApplyKernel`, and all dispatch/validation wrappers.
mHC fused FMA path `cpp/tensorrt_llm/kernels/mhcKernels/mhc_fused_fma.cuh`, `cpp/tensorrt_llm/kernels/mhcKernels/mhcFusedHcKernel.cu` (FMA sections)	`mhc_fused_fma.cuh` defines `fused_pmap_gemm_fma_ksplit` (pmap + GEMM split-K writing `Yp`/`Rp` partials) and `fused_pmap_gemm_fma_allinone` (single-kernel pmap + GEMM + inline bigFuse with `done_counter` election, Sinkhorn, and optional RMSNorm). `mhcFusedHcKernel.cu` adds workspace zeroing, shape validation, `fused_pmap_gemm_fma_ksplit`-based `mhcFusedHcFmaLaunch`, and `mhcFusedHcFmaAllInOneLaunch` with PDL dispatch.
mHC TF32/SM100 fused path `cpp/tensorrt_llm/kernels/mhcKernels/fused_tf32_pmap_gemm.cuh`, `cpp/tensorrt_llm/kernels/mhcKernels/mhcFusedHcKernel.cu` (TF32 sections)	`fused_tf32_pmap_gemm.cuh` implements `fused_tf32_pmap_gemm_rout_atomic_impl` (TMA/SMEM/TMEM-pipelined UMMA TF32 GEMM with atomic split-K and sqr_sum reduction) and `fused_allinone_tf32_pmap_gemm_atomic_impl` (fuses pmap + tcgen05 GEMM + inline bigFuse with cross-split `done_counter` and optional fused RMSNorm). `mhcFusedHcKernel.cu` adds cuTensorMap LRU descriptor caching, `mhcFusedHcLaunch` (two-phase: tcgen05 + `mhcBigFuseLaunch`), and `mhcFusedHcAllInOneLaunch` (single kernel).
mHC THOP op registration and Python modules `cpp/tensorrt_llm/thop/mhcOp.cpp`, `tensorrt_llm/_torch/modules/mhc/hyper_connection.py`, `tensorrt_llm/_torch/modules/mhc/mhc_cuda.py`, `tensorrt_llm/_torch/modules/mhc/__init__.py`	`mhcOp.cpp` registers five Torch ops under `trtllm` with backend-integer dispatch and optional `norm_weight` handling. `hyper_connection.py` defines `HCState` dataclass, `mHC` module (`pre_mapping`, `fused_hc`, `post_mapping`), and `HCHead` with `skip_forward`. `mhc_cuda.py` provides low-level CUDA wrappers, `MhcPreMappingRunner` and `MhcFusedHcRunner` `TunableRunner` subclasses with `AutoTuner.choose_one()` integration, `_FusedHcWorkspaceCache` with bounded LRU caching, and top-level `mhc_pre_mapping_fused`, `mhc_fused_hc`, and `mhc_hc_head_cuda` APIs.
Unit tests and benchmarks `tests/unittest/_torch/modules/test_mhc.py`, `tests/unittest/_torch/attention/sparse/deepseek_v4/test_compressor_tf32.py`, `tests/unittest/_torch/attention/sparse/deepseek_v4/__init__.py`	`test_mhc.py` includes vanilla PyTorch reference implementations, CUDA profiling helpers, data generators, correctness tests for all mHC ops (fused-norm, all backends, realistic-scale regression, CUDA graph capture/replay), an HC-head test, a pre-mapping benchmark harness, and a session-scoped timing reporter. `test_compressor_tf32.py` validates BF16 GEMM precision for `wkv_gate` against an FP32 reference using cosine similarity and relative error thresholds.

Sequence Diagram(s)

sequenceDiagram
    participant PyLayer as Model Layer (Python)
    participant mHC as mHC Module
    participant MhcFusedHcRunner
    participant mhcFusedHcLaunch as CUDA: mhcFusedHcLaunch
    participant mhcBigFuseLaunch as CUDA: mhcBigFuseLaunch
    participant Compressor as Compressor Module
    participant compressorKernels as CUDA: Compressor Kernels
    participant PagedKVCache as Paged KV Cache

    PyLayer->>mHC: fused_hc(x_prev, residual_prev, post_mix_prev, comb_mix_prev, norm_weight)
    mHC->>MhcFusedHcRunner: AutoTuner.choose_one() selects backend tactic
    MhcFusedHcRunner->>mhcFusedHcLaunch: dispatch (FMA/TF32/all-in-one) with workspaces
    mhcFusedHcLaunch->>mhcBigFuseLaunch: y_acc, r_acc -> split-K reduce + Sinkhorn + optional RMSNorm
    mhcBigFuseLaunch-->>mHC: residual_cur, post_mix_cur, comb_mix_cur, layer_input_cur

    PyLayer->>Compressor: forward(x, metadata)
    Compressor->>compressorKernels: prefillReductionLaunch (prefill tokens)
    Compressor->>compressorKernels: pagedKvCompressLaunch (decode tokens)
    compressorKernels->>PagedKVCache: update online softmax state, emit kv_comp
    Compressor->>compressorKernels: postProcessScatterLaunch (RMSNorm+RoPE+Hadamard, cache_scale_type)
    compressorKernels->>PagedKVCache: scatter normalized/quantized KV (FP8/MXFP4/BF16)
    Compressor-->>PyLayer: (kv_out or quant_output, scale_output)

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Suggested reviewers

schetlur-nv
chang-l
byshiue
kaiyux
xxi-nv
leslie-fang25
laikhtewari

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 39.22% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title '[None][feat] DSv4 prep: compressor and mHC primitives' clearly and specifically describes the main change: adding compressor and mHC (Multi-Head Hyper-Connection) primitives as preparation for DeepSeek-V4 support.
Description check	✅ Passed	The PR description is comprehensive, clearly explaining what is included (compressor CUDA kernels, mHC CUDA kernels, Python wrappers, CMake wiring) and explicitly listing what is intentionally excluded. It includes detailed verification steps showing successful build, imports, tests, and scope validation.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 7

🧹 Nitpick comments (4)

tensorrt_llm/_torch/attention_backend/sparse/deepseek_v4/compressor.py (1)

56-59: 💤 Low value

Unknown string keys raise cryptic KeyError.

resolve_kv_cache_dtype will raise a bare KeyError if passed an unrecognized string. A more informative error message would help users diagnose configuration issues.

Suggested improvement

 def resolve_kv_cache_dtype(kv_cache_dtype: Union[str, KVCacheDtype]) -> KVCacheDtype:
     if isinstance(kv_cache_dtype, str):
-        return _KV_CACHE_DTYPE_MAP[kv_cache_dtype]
+        if kv_cache_dtype not in _KV_CACHE_DTYPE_MAP:
+            raise ValueError(
+                f"Unknown kv_cache_dtype: '{kv_cache_dtype}'. "
+                f"Valid options: {list(_KV_CACHE_DTYPE_MAP.keys())}"
+            )
+        return _KV_CACHE_DTYPE_MAP[kv_cache_dtype]
     return kv_cache_dtype

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/attention_backend/sparse/deepseek_v4/compressor.py`
around lines 56 - 59, The resolve_kv_cache_dtype function currently raises a
bare KeyError when given an unrecognized string key, providing no helpful
guidance to users. Add error handling around the _KV_CACHE_DTYPE_MAP dictionary
lookup in resolve_kv_cache_dtype to catch the KeyError and re-raise a more
informative error message that includes the invalid key that was provided and
optionally lists the valid/supported options from the _KV_CACHE_DTYPE_MAP
dictionary. This will help users diagnose configuration issues more easily.

tensorrt_llm/_torch/modules/mhc/mhc_cuda.py (1)

150-150: ⚡ Quick win

Consider raising a descriptive error instead of assertion.

The assertion guards against calling the function when DeepGEMM is unavailable, but assertions are removed in optimized Python (-O). A clear RuntimeError would be more robust.

♻️ Suggested refactor

-    assert dg_fn is not None, "DeepGEMM is not available"
+    if dg_fn is None:
+        raise RuntimeError(
+            "DeepGEMM is not available. Install deep_gemm or use FMA backend."
+        )

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/modules/mhc/mhc_cuda.py` at line 150, The assertion at
line 150 checking if dg_fn is not None can be removed when Python runs in
optimized mode (-O flag), making the guard ineffective. Replace the assert
statement with an explicit if condition that raises a RuntimeError instead,
maintaining the descriptive error message "DeepGEMM is not available" so the
check is always enforced regardless of Python optimization settings.

tests/unittest/_torch/modules/test_mhc.py (1)

959-961: ⚡ Quick win

Add strict=True to zip for safety.

Python 3.10+ supports zip(..., strict=True), which raises ValueError if the iterables have different lengths. This catches bugs where the returned tuple doesn't match the expected structure.

♻️ Suggested fix

-    for ge, ee, name in zip(
-        graph_out, eager_out, ["residual", "post_mix", "comb_mix", "layer_input"]
-    ):
+    for ge, ee, name in zip(
+        graph_out, eager_out, ["residual", "post_mix", "comb_mix", "layer_input"], strict=True
+    ):

Apply the same change to line 976.

Also applies to: 976-978

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unittest/_torch/modules/test_mhc.py` around lines 959 - 961, The zip()
function calls in the test file lack the strict=True parameter, which would help
catch bugs where iterables have different lengths. In
tests/unittest/_torch/modules/test_mhc.py at lines 959-961 (the zip call with
graph_out, eager_out, and the name list), add strict=True as a parameter to the
zip() function to enable length validation. Apply the identical change at lines
976-978 where another zip() call appears that needs the same strict=True
parameter added.

tensorrt_llm/_torch/modules/mhc/hyper_connection.py (1)

114-116: ⚡ Quick win

Consider replacing assertions with explicit ValueError for dtype/shape checks.

Assertions are removed when Python runs with -O (optimized mode), which can lead to silent failures in production. Explicit ValueError or TypeError is more robust for runtime validation in forward methods.

♻️ Suggested refactor

-    assert x.dtype == torch.bfloat16
-    assert self.mult == x.shape[-2]
-    assert self.hidden_size == x.shape[-1]
+    if x.dtype != torch.bfloat16:
+        raise TypeError(f"pre_mapping requires bfloat16 input, got {x.dtype}")
+    if self.mult != x.shape[-2]:
+        raise ValueError(f"Expected shape[−2]={self.mult}, got {x.shape[-2]}")
+    if self.hidden_size != x.shape[-1]:
+        raise ValueError(f"Expected shape[−1]={self.hidden_size}, got {x.shape[-1]}")

Apply the same pattern to lines 188-189 in fused_hc.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/modules/mhc/hyper_connection.py` around lines 114 - 116,
Replace the assertions that validate dtype and shape in the forward method
(checking x.dtype against torch.bfloat16, self.mult against x.shape[-2], and
self.hidden_size against x.shape[-1]) with explicit ValueError or TypeError
raises instead, since assertions are stripped when Python runs with -O flag and
will silently fail in production. Apply the same pattern to the assertions at
lines 188-189 in the fused_hc method to ensure consistent runtime validation
across both locations.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cpp/tensorrt_llm/kernels/compressorKernels/compressorKernels.cu`:
- Around line 162-176: The packE2M1x2 function silently returns 0 on pre-SM100
architectures instead of failing, which causes data corruption when
kMXFP4Blockwise mode is enabled on unsupported GPUs. Add a runtime architecture
validation in postProcessScatterLaunch (around line 1692) that rejects
kMXFP4Blockwise mode if the GPU compute capability is less than SM100 (major
version less than 10). Use TLLM_CHECK_WITH_INFO to perform this validation and
report the actual GPU architecture in the error message so users know why their
request was rejected. This prevents silent data corruption by catching the
incompatibility at runtime rather than letting packE2M1x2 silently degrade.

In `@cpp/tensorrt_llm/kernels/mhcKernels/mhcFusedHcKernel.cu`:
- Around line 530-542: The FMA kernel launchers use tile_n and tile_m parameters
in modulo and division operations without first validating that these values are
positive, which can cause host crashes or undefined behavior if they are zero or
negative. Add TLLM_CHECK_WITH_INFO validation calls to ensure tile_n is greater
than zero before the modulo operation FHC_SHAPE_N % tile_n (around line 534),
and ensure tile_m is greater than zero before the division operation m_batches =
... / tile_m (around line 804). Apply the same validation checks to the second
FMA launcher function in the consolidated_sites range (lines 795-820) to prevent
the same crash paths in both implementations.

In `@cpp/tensorrt_llm/kernels/mhcKernels/mhcKernels.cu`:
- Around line 867-875: The mhcHcHeadApplyLaunch wrapper function accepts a
runtime parameter mult without validation, but the kernel mhcHcHeadApplyKernel
only allocates s_pre[8] in shared memory. If mult exceeds 8, it causes
out-of-bounds memory writes and reads. Add a validation check in
mhcHcHeadApplyLaunch before the kernel launch statement to ensure mult does not
exceed 8, either through an assertion, error return, or exception to prevent
invalid kernel behavior.
- Around line 847-863: The default case in the switch statement for tileN in
mhcGemmSqrsumFmaLaunch incorrectly maps all unrecognized tileN values to the
TN=24 kernel, which causes out-of-bounds writes when tileN is unsupported (e.g.,
tileN=16). Add a validation check using TLLM_CHECK_WITH_INFO after the existing
divisibility check on line 849 to explicitly reject unsupported tileN values,
ensuring only the valid cases (1, 2, 3, 4, 6, 8, 12, and 24) are allowed to
proceed to the switch statement.

In `@cpp/tensorrt_llm/thop/compressorOp.cpp`:
- Around line 78-118: The compressorPostProcessScatterOp function is missing
contiguity validation checks for several input tensors that the kernel expects
to have contiguous memory layout. Add TORCH_CHECK statements after the existing
position_ids contiguity check (around line 100) to validate that kv_comp,
rms_weight, block_offsets, and kv_cache are all contiguous, using the same
pattern as the existing cos_sin_table and position_ids checks with the
is_contiguous() method and appropriate error messages identifying each tensor.

In `@cpp/tensorrt_llm/thop/mhcOp.cpp`:
- Around line 94-100: The norm_weight tensor validation in the mhc_fused_hc
function is missing device checks, which could allow CPU or different-device
tensors to be passed to the CUDA kernel, causing illegal memory access. Add
TORCH_CHECK validations to ensure that norm_weight is a CUDA tensor and is on
the same device as the other tensors being used in the operation. Insert these
device checks in the existing norm_weight validation block (after checking
dtype, contiguity, and numel) to verify the tensor is on a CUDA device before
dereferencing its pointer.
- Around line 103-155: The code currently handles backend values 3, 2, and 1
with explicit if conditions, but any other value silently falls through to the
final mhcFusedHcLaunch call (backend 0 fallback), which masks bugs like typos or
autotuner errors. Add a guard after the if conditions to reject unknown backend
values (anything other than 0, 1, 2, or 3) by throwing an error or asserting,
before reaching the mhcFusedHcLaunch call. This ensures invalid backend values
fail fast instead of silently executing the wrong kernel.

---

Nitpick comments:
In `@tensorrt_llm/_torch/attention_backend/sparse/deepseek_v4/compressor.py`:
- Around line 56-59: The resolve_kv_cache_dtype function currently raises a bare
KeyError when given an unrecognized string key, providing no helpful guidance to
users. Add error handling around the _KV_CACHE_DTYPE_MAP dictionary lookup in
resolve_kv_cache_dtype to catch the KeyError and re-raise a more informative
error message that includes the invalid key that was provided and optionally
lists the valid/supported options from the _KV_CACHE_DTYPE_MAP dictionary. This
will help users diagnose configuration issues more easily.

In `@tensorrt_llm/_torch/modules/mhc/hyper_connection.py`:
- Around line 114-116: Replace the assertions that validate dtype and shape in
the forward method (checking x.dtype against torch.bfloat16, self.mult against
x.shape[-2], and self.hidden_size against x.shape[-1]) with explicit ValueError
or TypeError raises instead, since assertions are stripped when Python runs with
-O flag and will silently fail in production. Apply the same pattern to the
assertions at lines 188-189 in the fused_hc method to ensure consistent runtime
validation across both locations.

In `@tensorrt_llm/_torch/modules/mhc/mhc_cuda.py`:
- Line 150: The assertion at line 150 checking if dg_fn is not None can be
removed when Python runs in optimized mode (-O flag), making the guard
ineffective. Replace the assert statement with an explicit if condition that
raises a RuntimeError instead, maintaining the descriptive error message
"DeepGEMM is not available" so the check is always enforced regardless of Python
optimization settings.

In `@tests/unittest/_torch/modules/test_mhc.py`:
- Around line 959-961: The zip() function calls in the test file lack the
strict=True parameter, which would help catch bugs where iterables have
different lengths. In tests/unittest/_torch/modules/test_mhc.py at lines 959-961
(the zip call with graph_out, eager_out, and the name list), add strict=True as
a parameter to the zip() function to enable length validation. Apply the
identical change at lines 976-978 where another zip() call appears that needs
the same strict=True parameter added.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e0251fc7-d3e0-4b27-b4fc-d3a7189d27f1

📥 Commits

Reviewing files that changed from the base of the PR and between 20b6068 and 8b66876.

📒 Files selected for processing (23)

cpp/tensorrt_llm/CMakeLists.txt
cpp/tensorrt_llm/kernels/CMakeLists.txt
cpp/tensorrt_llm/kernels/compressorKernels/CMakeLists.txt
cpp/tensorrt_llm/kernels/compressorKernels/compressorKernels.cu
cpp/tensorrt_llm/kernels/compressorKernels/compressorKernels.h
cpp/tensorrt_llm/kernels/mhcKernels/CMakeLists.txt
cpp/tensorrt_llm/kernels/mhcKernels/fused_tf32_pmap_gemm.cuh
cpp/tensorrt_llm/kernels/mhcKernels/mhcFusedHcKernel.cu
cpp/tensorrt_llm/kernels/mhcKernels/mhcKernels.cu
cpp/tensorrt_llm/kernels/mhcKernels/mhcKernels.h
cpp/tensorrt_llm/kernels/mhcKernels/mhc_fused_fma.cuh
cpp/tensorrt_llm/thop/CMakeLists.txt
cpp/tensorrt_llm/thop/compressorOp.cpp
cpp/tensorrt_llm/thop/mhcOp.cpp
tensorrt_llm/_torch/attention_backend/sparse/deepseek_v4/__init__.py
tensorrt_llm/_torch/attention_backend/sparse/deepseek_v4/compressor.py
tensorrt_llm/_torch/modules/mhc/__init__.py
tensorrt_llm/_torch/modules/mhc/hyper_connection.py
tensorrt_llm/_torch/modules/mhc/mhc_cuda.py
tests/unittest/_torch/attention/sparse/deepseek_v4/__init__.py
tests/unittest/_torch/attention/sparse/deepseek_v4/test_compressor_kernel.py
tests/unittest/_torch/attention/sparse/deepseek_v4/test_compressor_tf32.py
tests/unittest/_torch/modules/test_mhc.py

lfr-0531 · 2026-06-15T16:00:43Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-15T16:06:43Z

PR_Github #54321 [ run ] triggered by Bot. Commit: 8b66876 Link to invocation

tensorrt-cicd · 2026-06-15T21:33:51Z

PR_Github #54321 [ run ] completed with state SUCCESS. Commit: 8b66876
/LLM/main/L0_MergeRequest_PR pipeline #43391 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

longlee0622 · 2026-06-16T01:26:12Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-16T01:32:13Z

PR_Github #54397 [ run ] triggered by Bot. Commit: 8b66876 Link to invocation

lfr-0531 · 2026-06-16T01:56:59Z

/bot kill

lfr-0531 · 2026-06-16T01:57:05Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-16T02:03:27Z

PR_Github #54411 [ run ] triggered by Bot. Commit: e72c27b Link to invocation

tensorrt-cicd · 2026-06-16T02:03:51Z

PR_Github #54412 [ kill ] triggered by Bot. Commit: e72c27b Link to invocation

tensorrt-cicd · 2026-06-16T02:03:56Z

PR_Github #54411 [ run ] completed with state ABORTED. Commit: e72c27b

Link to invocation

tensorrt-cicd · 2026-06-16T02:04:08Z

PR_Github #54397 [ run ] completed with state ABORTED. Commit: 8b66876

Link to invocation

tensorrt-cicd · 2026-06-16T02:04:16Z

PR_Github #54412 [ kill ] completed with state SUCCESS. Commit: e72c27b
Successfully killed previous jobs for commit e72c27b

Link to invocation

Signed-off-by: Mingyang Hao <200044211+mingyangHao@users.noreply.github.com>

mingyangHao · 2026-06-16T06:40:27Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-16T13:38:15Z

PR_Github #54516 [ run ] completed with state SUCCESS. Commit: 6e4e004
/LLM/main/L0_MergeRequest_PR pipeline #43576 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

lfr-0531 · 2026-06-16T14:22:50Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-16T14:29:25Z

PR_Github #54601 [ run ] triggered by Bot. Commit: 6e4e004 Link to invocation

tensorrt-cicd · 2026-06-16T16:22:03Z

PR_Github #54601 [ run ] completed with state SUCCESS. Commit: 6e4e004
/LLM/main/L0_MergeRequest_PR pipeline #43641 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

lfr-0531 · 2026-06-17T00:53:32Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-17T00:58:56Z

PR_Github #54696 [ run ] triggered by Bot. Commit: 6debcbd Link to invocation

tensorrt-cicd · 2026-06-17T04:39:37Z

PR_Github #54696 [ run ] completed with state FAILURE. Commit: 6debcbd
/LLM/main/L0_MergeRequest_PR pipeline #43727 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

lfr-0531 · 2026-06-17T06:40:22Z

/bot run

tensorrt-cicd · 2026-06-17T06:47:00Z

PR_Github #54779 [ run ] triggered by Bot. Commit: 6debcbd Link to invocation

tensorrt-cicd · 2026-06-17T08:33:51Z

PR_Github #54779 [ run ] completed with state SUCCESS. Commit: 6debcbd
/LLM/main/L0_MergeRequest_PR pipeline #43797 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

lfr-0531 · 2026-06-17T08:38:33Z

/bot run

tensorrt-cicd · 2026-06-17T08:45:18Z

PR_Github #54811 [ run ] triggered by Bot. Commit: 6debcbd Link to invocation

tensorrt-cicd · 2026-06-17T10:45:54Z

PR_Github #54811 [ run ] completed with state SUCCESS. Commit: 6debcbd
/LLM/main/L0_MergeRequest_PR pipeline #43826 completed with status: 'SUCCESS'

CI Report

Link to invocation

yuxianq · 2026-06-23T10:30:53Z

+
+TORCH_LIBRARY_IMPL(trtllm, CUDA, m)
+{
+    m.impl("mhc_big_fuse", &mhcBigFuseOp);


Should we add register_fake for all ops to tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py?

All the newly added custom ops are in-place, so there's no need to add them to register_fake. I just updated the inplace_map in tensorrt_llm/_torch/compilation/utils.py.

yuxianq · 2026-06-23T10:31:12Z

+
+TORCH_LIBRARY_IMPL(trtllm, CUDA, m)
+{
+    m.impl("compressor_paged_kv_compress", &compressorPagedKvCompressOp);


Should we add register_fake for all ops to tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py?

Same as above.

Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>

lfr-0531 · 2026-06-23T12:45:37Z

/bot run

tensorrt-cicd · 2026-06-23T12:51:51Z

PR_Github #55244 [ run ] triggered by Bot. Commit: 31b1c4b Link to invocation

tensorrt-cicd · 2026-06-23T16:23:41Z

PR_Github #55244 [ run ] completed with state SUCCESS. Commit: 31b1c4b
/LLM/main/L0_MergeRequest_PR pipeline #44202 completed with status: 'SUCCESS'

CI Report

Link to invocation

[None][feat] DSv4 prep: compressor and mHC primitives

8b66876

Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>

lfr-0531 requested review from a team as code owners June 15, 2026 14:56

lfr-0531 requested review from PerkzZheng and leslie-fang25 June 15, 2026 14:56

github-actions Bot assigned lfr-0531 Jun 15, 2026

coderabbitai Bot reviewed Jun 15, 2026

View reviewed changes

lfr-0531 requested a review from mingyangHao June 15, 2026 16:01

lfr-0531 force-pushed the user/fanrongl/dsv4-compressor-mhc branch from 09f114d to 8b66876 Compare June 16, 2026 01:41

Merge branch 'main' into user/fanrongl/dsv4-compressor-mhc

e72c27b

longlee0622 reviewed Jun 16, 2026

View reviewed changes

Comment thread tests/unittest/_torch/modules/test_mhc.py

Comment thread cpp/tensorrt_llm/kernels/CMakeLists.txt

Comment thread cpp/tensorrt_llm/kernels/compressorKernels/compressorKernels.cu Outdated

[None][fix] Add DeepSeek V4 attention type enum

0f3020e

Signed-off-by: Mingyang Hao <200044211+mingyangHao@users.noreply.github.com>

longlee0622 approved these changes Jun 16, 2026

View reviewed changes

lfr-0531 mentioned this pull request Jun 16, 2026

[None][feat] DSv4: sparse cache manager adapter #15394

Draft

1 task

[None][fix] Address DSv4 primitive review comments

6e4e004

Signed-off-by: Mingyang Hao <200044211+mingyangHao@users.noreply.github.com>

mingyangHao force-pushed the user/fanrongl/dsv4-compressor-mhc branch from c75b5fc to 6e4e004 Compare June 16, 2026 04:45

Merge branch 'main' into user/fanrongl/dsv4-compressor-mhc

6debcbd

lfr-0531 requested review from yuxianq and removed request for PerkzZheng June 17, 2026 09:00

mingyangHao approved these changes Jun 17, 2026

View reviewed changes

lfr-0531 enabled auto-merge (squash) June 17, 2026 11:36

leslie-fang25 approved these changes Jun 23, 2026

View reviewed changes

yuxianq reviewed Jun 23, 2026

View reviewed changes

[None][fix] Track DSV4 inplace custom ops

e6a20eb

Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>

lfr-0531 requested a review from a team as a code owner June 23, 2026 11:40

lfr-0531 requested a review from hyukn June 23, 2026 11:40

[None][fix] Correct DSV4 inplace output indices

199918d

Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>

liji-nv approved these changes Jun 23, 2026

View reviewed changes

Merge branch 'main' into user/fanrongl/dsv4-compressor-mhc

31b1c4b

Conversation

lfr-0531 commented Jun 15, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Verification

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 15, 2026

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lfr-0531 commented Jun 15, 2026

Uh oh!

tensorrt-cicd commented Jun 15, 2026

Uh oh!

tensorrt-cicd commented Jun 15, 2026

Uh oh!

longlee0622 commented Jun 16, 2026

Uh oh!

tensorrt-cicd commented Jun 16, 2026

Uh oh!

lfr-0531 commented Jun 16, 2026

Uh oh!

lfr-0531 commented Jun 16, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tensorrt-cicd commented Jun 16, 2026

Uh oh!

tensorrt-cicd commented Jun 16, 2026

Uh oh!

tensorrt-cicd commented Jun 16, 2026

Uh oh!

tensorrt-cicd commented Jun 16, 2026

Uh oh!

tensorrt-cicd commented Jun 16, 2026

Uh oh!

mingyangHao commented Jun 16, 2026

Uh oh!

tensorrt-cicd commented Jun 16, 2026

Uh oh!

lfr-0531 commented Jun 16, 2026

Uh oh!

tensorrt-cicd commented Jun 16, 2026

Uh oh!

tensorrt-cicd commented Jun 16, 2026

Uh oh!

lfr-0531 commented Jun 17, 2026

Uh oh!

tensorrt-cicd commented Jun 17, 2026

Uh oh!

tensorrt-cicd commented Jun 17, 2026

Uh oh!

lfr-0531 commented Jun 17, 2026

Uh oh!

tensorrt-cicd commented Jun 17, 2026

Uh oh!

tensorrt-cicd commented Jun 17, 2026

Uh oh!

lfr-0531 commented Jun 17, 2026

Uh oh!

tensorrt-cicd commented Jun 17, 2026

Uh oh!

tensorrt-cicd commented Jun 17, 2026

Uh oh!

lfr-0531 commented Jun 15, 2026 •

edited by coderabbitai Bot

Loading