[#14828][feat] AutoDeploy: enable trtllm multi KV cache pool (gpt-oss branch) by MrGeva · Pull Request #253 · nv-auto-deploy/TensorRT-LLM

MrGeva · 2026-06-04T05:56:19Z

Summary

Enables multiple KV cache memory pools in AutoDeploy's trtllm attention backend, then re-enables it on this gpt-oss branch by reverting the temporary single-pool workaround (commit 0679581675). Targets taylor/gpt-oss-0511_rebase_0511 so it composes with the gpt-oss accuracy CI test + optimizations already on that branch. Implements NVIDIA#14828.

Two commits:

Multi KV pool support in trtllm attention — removes the three trtllm-specific blockers:
- Gate: requires_uniform_kv_caches → False, so _identify_managed_kv_resources no longer raises with >1 pool (AD trtllm uses per-layer KV views, so block_offset_multiplier is uniformly kv_factor=2).
- Per-group block_offsets: the planner keeps an address-stable buffer per KV window group (keyed by the group's cache_loc input ptr) so per-group prepare_trtllm_metadata invocations don't clobber a single shared buffer. CUDA-graph safe.
- Cyclic-window staging: the trtllm kernel masks the window internally via cyclic indexing, so the executor passes the full per-window block table + global KV length (mirroring the PyTorch backend) instead of host-slicing. New AttentionDescriptor.kernel_handles_cyclic_swa() (True only for trtllm) is plumbed through the kvcache transform -> CachedSequenceInterface -> executor. triton/flashinfer keep host-slicing.
Re-enable multi-pool on the gpt-oss branch — restores sliding_window propagation in get_cache_initializers so non-uniform-window models (gpt-oss: 128/4096) form one pool per window again.

Validation (gpt-oss-20b, AutoDeploy trtllm, 1xGB200, `gpt_oss.yaml`)

Built with auto-dev install -s. gsm8k via the same tensorrt_llm.evaluate.GSM8K evaluator as TestGPTOSS (200 samples, reasoning_effort=low), A/B vs the single-pool workaround:

	Multi-pool	Single-pool (workaround)
KV pools	2 (window 128 + 4096)	1 (all 24 layers @ 4096)
gsm8k (200)	85.500	85.500
KV mem (fixed 0.8 frac)	118.96 GB	118.97 GB
Max KV tokens	6,271,552	524,288

Accuracy: no regression — multi-pool == single-pool == 85.5 (ref gpt-oss-20b = 85.823). (Contrast: the earlier naive host-sliced attempt scored 6%.)
KV-cache capacity: ~12x more tokens at equal memory — the 12 SWA layers use a 128-window pool instead of over-allocating the full 4096 window (single-pool's 524,288 = exactly batch 128 x seq 4096). Implies higher achievable concurrency / longer context at the same GPU memory.

Caveat: the ~12x is a static KV-capacity measurement from the cache-manager resize_kv_cache init logs, not a throughput/latency benchmark. Accuracy is a real end-to-end measurement; realized serving speedup was not benchmarked here.

Unit tests

Per-group block_offsets (non-clobbering + stable ptr), cyclic-view staging, kernel_handles_cyclic_swa plumbing, the multi-pool gate, and a forward-level two-pool trtllm test — all pass on the rebased branch (tests/unittest/auto_deploy/singlegpu/...).

CI

AutoDeploy change — run with the AutoDeploy stages:

/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1"

🤖 Generated with Claude Code

Wraps torch.ops.trtllm.bf16_mxe2m1_block_scale_moe_runner -- the trtllm-gen MXFP4-weight x BF16-activation MoE kernel that PT's W4A16MXFP4TRTLLMGenFusedMoEMethod uses today on B200 by default. Op signature: takes pre-shuffled MXFP4 weights, UE8M0 scales, float32 biases, and per-expert SwiGLU params. At forward time only zero-pads activations to the kernel's expected H_pad and slices the output back to valid_hidden_size. The matching weight-prep helper, transform, and ShardingInfo arrive in following steps. Op verified to register via torch.library and produce the expected schema. No graph/transform changes yet -- this op is inert until step 3 wires it into a transform. Refs: cc_reports/gpt-oss-120b/MOE_TRTLLM_GEN_PLAN.md (step 1 of 6) Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>

Mirrors PT MXFP4WeightTRTLLMGenFusedMoEMethod weight-loading path (quantization.py:4135-4500). Reuses PT helpers maybe_pad_for_mxfp4, trtllmgen_maybe_get_cached_*_permute_indices, _get_weight_alignment. Steps: reshape HF [E, 2I, H/32, 16] -> [E, 2I, H/2], pad to alignment (input_hidden_alignment//2=256 cols, weight_alignment=128 rows), pad matching scales, shuffle per expert via torch.ops.trtllm.shuffle_matrix, cast biases to float32. Returns PreparedMXFP4Weights dataclass. Step-2 scope: tp_size=1 only; TP slicing arrives in step 5. Smoke-tested on gpt-oss-120b shapes (E=128, I=H=2880) on B200 -- output shapes match PT byte-for-byte. Refs: cc_reports/gpt-oss-120b/MOE_TRTLLM_GEN_PLAN.md (step 2 of 6) Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>

Cleaner graph integration: the op now takes raw router_weight + bias + top_k and computes RenormalizeMoeRoutingMethod-style routing internally (F.linear -> topk -> softmax-of-topk), then dispatches to the kernel with pre-computed topk_weights / topk_ids. This makes the upcoming transform (step 3) a single 1:1 op rewrite of torch_moe_dense_mlp -> trtllm_mxfp4_w4a16_moe_fused without needing a separate routing op upstream. Refs: cc_reports/gpt-oss-120b/MOE_TRTLLM_GEN_PLAN.md (step 1 of 6) Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>

Runs in post_load_fusion stage. Picks up triton_mxfp4_moe nodes from quantize_mxfp4_moe, runs the step-2 weight prep, registers prepared params on the experts module, and rewrites the call to auto_deploy::trtllm_mxfp4_w4a16_moe_fused. Frees the original raw HF-layout MXFP4 params after rewrite. Step-3 V4 scope: EP=1 (triton_mxfp4_moe without _ep) only. EP variant is covered by step 5 with MXFP4TRTLLMGenSharding. Refs: cc_reports/gpt-oss-120b/MOE_TRTLLM_GEN_PLAN.md (step 3 of 6) Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>

Reorder positional args in ``Bf16MxE2m1BlockScaleMoERunner.get_valid_tactics`` to match the C++ signature of ``Bf16MxE2m1BlockScaleMoeRunner::getValidConfigs`` (``cpp/tensorrt_llm/thop/mxFp4BlockScaleMoe.cpp:516``): ``(topK, hiddenSize, intermediateSize, numLocalExperts, numTokens, validHiddenSize, validIntermediateSize)``. Commit 86cfb3e (cubin update + valid_*_size plumbing) added ``valid_hidden_size`` / ``valid_intermediate_size`` params to all three trtllm-gen MoE runners' Python wrappers. The other two siblings (``MxE4m3MxE2m1`` line 968, ``E4m3MxE2m1`` line 1274) appended the new args at the end correctly; only ``Bf16MxE2m1`` placed them in the middle, so the autotuner was passing ``valid_*`` values into the ``numLocalExperts`` / ``numTokens`` slots and ``local_num_experts`` / ``num_tokens`` into the ``valid_*`` slots. Effect: the cubin filter saw garbage shape parameters, returned an empty tactic list, and the autotune cache stayed empty -- so at run time the kernel fell back to ``getDefaultValidConfigIndex`` and asserted "No valid config found for the given problem shape MNK" on the first MoE call (e.g. AD's ``resize_kv_cache`` memory probe at ``max_num_tokens=8192``). This Python-only reorder restores parity with the C++ binding; no recompile needed. Found while onboarding gpt-oss-120b on AutoDeploy with the ``bf16_mxe2m1`` MoE path; reproduces in any non-tuning-mode call to the op (e.g. PT's ``MXFP4WeightTRTLLMGenFusedMoEMethod`` users hit it on the first prefill). Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>

Bring ``prepare_mxfp4_weights_for_trtllm_gen`` and the ``trtllm_mxfp4_w4a16_moe_fused`` op into structural parity with PT's ``MXFP4WeightTRTLLMGenFusedMoEMethod`` (quantization.py:4135) so the trtllm-gen MoE kernel sees the same byte layout PT exercises: mxfp4_weight_prep.py changes: * Per-expert ``I_pad = roundUp(I, weight_alignment) = 2944`` first; derive ``2I_pad = 5888`` and ``I/2_pad = 1472`` from that. Previously we padded ``2I = 5760`` directly which is already 128-aligned and thus a no-op, leaving w1's effective ``I = 2880`` while w2's column padding pushed ``I = 2944`` -- inconsistent intermediate dim across the two gemms. * W1 hidden axis padded to ``input_hidden_alignment = 512`` (``H_w1_pad = 3072``), W2 hidden axis padded to ``weight_alignment = 128`` (``H_w2_pad = 2944``), matching PT's ``create_weights`` (lines 3715-3717 of quantization.py). * De-interleave gate / up rows from the on-disk row-interleaved storage (``gate_up_proj_blocks[:, ::2, :]`` = gate, ``[:, 1::2, :]`` = up) and pad each half to ``I_pad`` separately before stacking as ``[up | gate]``. PT's chunk-then-copy dance (modeling_gpt_oss.py:695-706 + quantization.py:4252-4258) ends up with the same physical layout. * Add ``torch.ops.trtllm.block_scale_interleave`` after ``shuffle_matrix`` for both fc1 and fc2 scales -- PT does both ops (quantization.py:4382, 4439); skipping the second was a partial bug. trtllm_moe.py change: * Routing softmax in fp32 instead of bf16 -- matches PT's ``RenormalizeMoeRoutingMethod`` which casts to fp32 for the topk softmax then back to the activation dtype. Status: kernel builds and runs cleanly with these changes, and pure GEMM throughput is at the V4 target (~9.28 ms ITL / ~108 tok/s for gpt-oss-120b vs V3 Triton's 127.79 ms / 7.96 tok/s -- 13.5x). However, content correctness is still blocked by an upstream NaN bug in the trtllm-gen MoE kernel itself: PT's own ``TRTLLMGenFusedMoE.forward`` on gpt-oss-120b at this TRT-LLM commit also produces NaN logits, so any byte-correct prep cannot rescue output. Tracking note: re-validate when upstream fix lands; if correctness is restored, proceed to step 5 (TP-MoE sharding). Refs: cc_reports/gpt-oss-120b/MOE_TRTLLM_GEN_PLAN.md (step 4 of 6), RESUME_V4.md. Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>

Mirrors modeling_gpt_oss.py but routes every attention Linear through ``torch.ops.auto_deploy.torch_linear_simple`` with sharding hint kwargs (``tp_mode``, ``tp_min_local_shape``, ``layer_type``), and inserts ``torch.ops.auto_deploy.view`` (``tp_scaled_dim=2``) for q/k/v/attn_out reshapes plus a trailing ``torch.ops.auto_deploy.all_reduce`` placeholder after the rowwise o_proj. Same pattern qwen3_ir / qwen3_5_moe_ir use. Sharding strategy emitted into the graph: q_proj / k_proj / v_proj -> colwise (+ tp_min_local_shape=head_dim for GQA: 64 Q heads / 8 KV heads at TP=8) view (q/k/v/attn_out) -> tp_scaled_dim=2 (head-count dim) o_proj -> rowwise + auto_deploy.all_reduce Out of scope here (matches qwen_ir convention): * MoE router + experts stay replicated -- the V4 trtllm-gen MoE op (``trtllm_mxfp4_w4a16_moe_fused``) has no ShardableNode yet. Step 5 of MOE_TRTLLM_GEN_PLAN.md (V6) registers TP-MoE for that op. * lm_head stays as plain nn.Linear -- no canonical sharding-IR pattern for col-parallel-then-all-gather in this codebase yet. Registration: * GptOssForCausalLM still registers via ``register_custom_model_cls`` (last-registration-wins). * ``models/custom/__init__.py`` adds modeling_gpt_oss_ir to the ``AD_USE_IR_MODELS`` opt-in block, alongside deepseek_ir, nemotron_h_ir, qwen3_5_moe_ir. Validated end-to-end on gpt-oss-120b 8xB200 with the new V5 yaml (world_size=8, apply_sharding_hints with shard_layers=["mha"], detect_sharding+sharding_transform_executor disabled): apply_sharding_hints processed 324 nodes / skipped 37 (the MoE nodes carry layer_type="moe"), strip_sharding_hints stripped 288 hints, fuse_allreduce_residual_rmsnorm matched 36 -- attention TP=8 fully wired through. Refs: cc_reports/gpt-oss-120b/MOE_TRTLLM_GEN_PLAN.md (V5 step), RESUME_V4.md (still valid for the trtllm-gen NaN tracking). Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>

Step 5 of MOE_TRTLLM_GEN_PLAN.md: extend the V4 trtllm-gen MoE op with TP-sharding on the intermediate axis so MoE compute itself splits across ranks (V5 only sharded attention; MoE was replicated and dominated cost). prepare_mxfp4_weights_for_trtllm_gen: * Add tp_rank arg. * Compute TP-aware alignment via _get_weight_alignment so per-rank intermediate is itself 128-aligned after pad-before-shard (matches PT load_expert_w3_w1_weight / load_expert_w2_weight). * Pre-pad intermediate axis to alignment_tp, then slice [tp_rank * I_pr, (tp_rank+1) * I_pr] on gate/up rows, scales, biases (col-parallel) and on dn_3d cols (row-parallel, /2 for packed mxfp4). * Slice down_scales on dim 2 with /scaling_vector_size stride. * Clamp valid_intermediate to min(intermediate_size, slice_stop) - slice_start. QuantizeMXFP4MoETrtllmGen transform: * Read moe_tp_size / moe_tp_rank / allreduce_strategy from shared_config.dist_config. * Forward to prepare_mxfp4_weights_for_trtllm_gen. * After the V4 op rewrite, when moe_tp_size > 1 insert auto_deploy.all_reduce so partial [..., hidden] outputs from each rank sum across ranks before the residual add. fc2_bias is divided by tp_size in the prep helper so the post-AR sum reproduces the unsharded bias. Smoke-tested: * tp=1 -> fc1=[8, 5888, 1536] valid_I=2880 (no regression). * tp=8 rank=0 -> fc1=[8, 768, 1536] valid_I=384. * tp=8 rank=7 -> fc1=[8, 768, 1536] valid_I=192 (last rank partial). Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>

Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>

…ayout prepare_mxfp4_weights_for_trtllm_gen padded per-expert biases but never row-shuffled them, while it did shuffle the weights and scales. The trtllm-gen bf16_mxe2m1_block_scale_moe_runner kernel adds bias[i] to post-shuffle output row i of GEMM1/GEMM2, so leaving biases in pre-shuffle order made the kernel attribute the wrong bias to each row and the MoE output came out as noise (gpt-oss-120b GSM8K dropped to 2.05% vs the 90.30% reference). PT's MXFP4WeightTRTLLMGenFusedMoEMethod (quantization.py:4204-4319) runs the very same row permutation on the bias destination buffer: load_expert_w3_w1_weight applies the gated-act-gemm interleave + epilogue-tile reorder to the 1-D [2*I_pad] gated bias, and load_expert_w2_weight applies the epilogue-tile reorder to the 1-D [H_pad] down bias. Mirror that in the AD prep helper via two new _shuffle_per_expert_bias_w3_w1 / _shuffle_per_expert_bias_w2 helpers so the AD prep stays byte-identical with PT. Add tests/unittest/auto_deploy/singlegpu/custom_ops/moe/test_mxfp4_weight_prep.py (3 tests) to pin the invariant: fc1 bias matches a manual gated+TMA permute, fc2 bias matches the manual TMA permute, and the full prep output is byte-identical to a per-expert PT-style reference loader (weights, scales, and biases all checked). Without the fix all three tests fail (98.8% mismatch on the bias rows); with it they pass. End-to-end validation on gpt-oss-120b at world_size=1 with quantize_mxfp4_moe_trtllm_gen enabled: - GSM8K (test_mxfp4_gsm8k[120b]): 2.05% -> 90.37% (threshold 87.10%, reference 90.30%) -> PASS. - ITL (V4 single-GPU, ISL=1000 OSL=1000 conc=1, 20 reqs): 8.53 ms p50 / 117.4 tok/s/user with content valid (OSL=1000 verified). Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>

…gen MoE The V4 single-GPU + trtllm-gen MXFP4 MoE path is the now-correctness- validated baseline for gpt-oss-120b on B200 (previous commit fixes the weight-prep bias shuffle so the trtllm-gen kernel produces correct logits). Update examples/auto_deploy/model_registry/configs/ gpt_oss_120b.yaml to that configuration so the standalone AD serving config matches the live recommendation: - world_size 4 -> 1 (single GPU; the model fits in 192 GB HBM at MXFP4 and there is no AR overhead at BS=1). - Enable transform `quantize_mxfp4_moe_trtllm_gen` so the post-load fusion stage rewrites `triton_mxfp4_moe` to `auto_deploy::trtllm_mxfp4_w4a16_moe_fused` and dispatches to `torch.ops.trtllm.bf16_mxe2m1_block_scale_moe_runner` -- the same kernel PT exercises via `MXFP4WeightTRTLLMGenFusedMoEMethod`. Measured on the same standalone serving config (ISL=1000, OSL=1000, conc=1, 20 reqs, `DISABLE_HARMONY_ADAPTER=1` + `--use-server-token-count`): - ITL p50 8.53 ms / 117.4 tok/s/user (vs Triton-MXFP4 baseline 122 ms ITL / 8 tok/s/user, ~15x speedup). - GSM8K accuracy 90.37 % (threshold 87.10 %, reference 90.30 %). Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>

The HF config.json for openai/gpt-oss-120b ships without a `torch_dtype`/`dtype` field. Under transformers 5.x, AD's meta-device build path (`build_model` transform -> `_build_model` -> `custom_model_cls._from_config(model_config)`) reads `config.dtype` to decide the construction dtype; when it is None, `_from_config` skips the `local_torch_dtype` context and the model is created in fp32. `load_or_random_init` then loads bf16 safetensors weights cast to fp32 (`load_state_dict(assign=False)`), so the entire model runs in fp32. That breaks trtllm attention: `cpp/tensorrt_llm/common/attentionOp.cpp` disables `mEnableContextFMHA` for any dtype that is not fp16/bf16, falls back to unfused MHA, and the context workspace formula (`size * batch * num_heads * seq * seq` for qk + qk_float) tries to allocate ~1 TB during the `resize_kv_cache` forward pass. Server log: [common] Fall back to unfused MHA because of unsupported data type. [thop] Attention workspace size is not enough, increase the size from 268435456 bytes to 1110551169280 bytes RuntimeError: CUDA out of memory. Tried to allocate 1034.28 GiB. Adding `model_kwargs.dtype: bfloat16` makes `_recursive_update_config` set `config.dtype = torch.bfloat16` before `_from_config` runs, so the model is constructed in bf16 and FMHA stays on (~40 MB workspace). Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Previously fuse_gemms skipped any linear with bias (TODO at the gather-loop in FuseGemms._apply). This excluded the most common multi-GEMM fusion target -- Q/K/V projections that always have bias in models like gpt-oss. Bias support: * Allow children with bias in the gather loop. * Require uniform bias state across siblings (all-or-none) -- mixed bias would need zero-padding which we don't do. * Stack biases via torch.cat on dim=0, mirroring weight stacking. * Validate each bias is per-channel 1D and matches its weight's out_features; reject non-standard shapes (broadcast bias, scalar). * Validate biases come from get_attr nodes (statically known). * Validate uniform bias dtype across children. * Wire fused get_attr bias node into the fused linear call args. Verified on gpt-oss-120b V4 (single-GPU, BS=1 conc=1 ISL=OSL=1000): * fuse_gemms matches=36 (one per layer, Q+K+V stacked). * ITL: 10.68 ms -> 9.22 ms (-1.46 ms / -13.7%). * TPS/user: 93.77 -> 109.03 (+16%). * Output Token Count = 1000 / 1000 verified across all 20 requests. Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>

For trtllm_attention_mha_with_cache the 'out' parameter sits in the middle of the schema (after out_scale, before rotary_cos_sin, ...). The cached-attn insertion in transform/library/kvcache.py passes None for 'out' positionally to preserve positional ordering of the parameters that follow. The previous _inject_out_param implementation then set out=out_placeholder as a kwarg on top of that, producing a duplicate binding ("received N+1 arguments"). Fix: detect the schema index of 'out', convert any positional args at/after that index into kwargs (skipping the positional 'out' itself), and bind 'out' as a kwarg. Raise a clear error if the dynamic cached op has no 'out' parameter at all. This is load-bearing for the gpt-oss-120b TP=2 cached-attention path under AD_USE_IR_MODELS=1 -- without it, every dynamic-shape decode call fails on the kvcache-inserted attention op. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>

transformers 5.x moved `config.rope_theta` into `config.rope_scaling` (e.g. `config.rope_scaling['rope_theta'] = 150000` for gpt-oss-120b). The previous `getattr(config, "rope_theta", 10000.0)` silently fell back to the 10000.0 default, which is 15x off the actual 150000 base GPT-OSS uses. That broke RoPE position encoding entirely. Mirror what PT's modeling_gpt_oss.py already does after the transformers 5.3.0 upgrade (NVIDIA#12829): use the `get_hf_rope_theta()` helper from `tensorrt_llm._utils`. Apply to both `modeling_gpt_oss.py` and `modeling_gpt_oss_ir.py`. Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>

…ted-topk trtllm-gen MoE C++ routing was refactored in main (NVIDIA#13328) such that bf16_mxe2m1_block_scale_moe_runner with router_logits=None + only topk_weights/topk_ids kwargs silently produces broken routing. Model emits degenerate token loops instead of normal tokens. Mirror PT's invocation pattern (and source AD_W4A8_FUSED_ROUTING=1 path from commit 7719712): pass router_logits directly and let the kernel do fused topk + softmax internally. Note: routing_bias stays None because the linear-layer bias is already folded into router_logits via F.linear(x, w, b); the kernel's routing_bias is a separate per-expert offset. Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>

…PE-fused decode Root cause: with trtllm attn_backend, AD applies RoPE in modeling code via torch_rope_with_explicit_cos_sin and passes post-RoPE Q/K to thop.attention. PT, in contrast, passes raw Q/K + the YARN rotary_cos_sin table so the kernel applies RoPE internally. The two RoPE paths produce slightly different cos/sin numerics (modeling-side uses our cached fp32 table while the kernel computes its own), and the difference compounds through the KV cache: prefill stores K rotated externally, decode reads cached K and computes attention with Q rotated externally — minor cos/sin differences turn into ~60% rel_RMSE on the layer-0 attn_out at decode step 1. Enabling fuse_rope_into_trtllm_attention folds RoPE into the kernel call so AD takes the same path as PT, eliminating the divergence. Verified on a 4-layer gpt-oss-120b subset by dumping per-stage activations in both PT and AD modeling and comparing PT residual vs AD layer output: L0 attn_out decode_1 rel_RMSE: 129% -> 1% L0 residual decode_1 rel_RMSE: 60% -> 0.7% First two generated tokens now match exactly between PT and AD. Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>

Adds the W4A8MXFP4MXFP8 activation-quantization path mirroring PT's W4A8MXFP4MXFP8TRTLLMGenFusedMoEMethod: * New op: auto_deploy.trtllm_mxfp4_w4a8_moe_fused Same args as W4A16 op; pre-quantizes activation via torch.ops.trtllm.mxfp8_quantize(False, alignment=512) and dispatches to torch.ops.trtllm.mxe4m3_mxe2m1_block_scale_moe_runner. Uses the same MXFP4 weights as W4A16 (no checkpoint re-prep). * Transform config: QuantizeMXFP4MoETrtllmGenConfig.quant_act Choose 'bf16' (default; W4A16, bf16 input cubin family bmm_Bfloat16_MxE2m1Bfloat16) or 'mxfp8' (W4A8, MXFP8 input cubin family bmm_MxE4m3_MxE2m1MxE4m3 — 9 us/call median vs 27 us for bf16). KNOWN LIMITATION (Phase 2 blocker, this commit): The autotuner's get_valid_configs() returns empty for the decode shape (num_tokens=1, hidden_padded=3072) when called against mxe4m3_mxe2m1_block_scale_moe_runner with the gpt-oss-120b weight shapes. The runner falls back to a default tactic that's significantly slower than the bf16 path's tactic. Empirical decode regression on gpt-oss-120b TP=2 BS=1: ITL p50 7.48 ms (W4A16) -> 9.21 ms (W4A8) / TPS 127 -> 102. The kernels and weights are compatible -- W4A16 path with the SAME weight tensors finds tactics for tileN=8 cleanly. The W4A8 path's get_valid_configs filters something (likely C++ runner internal shape/scale validation) that rejects all tileN=8 candidates at decode shape. Needs C++ runner investigation before this can land as a perf win. The infrastructure (op + config flag) is committed because: 1. The op definition is correct API-wise (compiled, registers, runs). 2. The autotune compatibility is a pure C++ runner issue, not an AD-side issue. 3. Future fix in the C++ runner makes this op production-ready without further AD work. Set 'quant_act: mxfp8' explicitly in yaml to opt in (default stays bf16). Bench dir (regression run): auto-deploy/gpt-oss-120b/v8_tp2_fg_arfix_w4a8_bench_sweep_conc_1_20260508_022925/ yaml: auto-deploy/gpt-oss-120b/gpt_oss_120b_v8_tp2_fg_w4a8.yaml script: auto-deploy/gpt-oss-120b/run_gpt_conc1_V8_tp2_fg_w4a8.sh Notes: cc_reports/gpt-oss-120b/report.md §3.10 + §5.3 C7 (to be added). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>

trtllm_mxfp4_w4a8_moe_fused now always passes router_logits to mxe4m3_mxe2m1_block_scale_moe_runner; the C++ runner does fused topk + softmax + cast internally (matches PT's run_fp4_block_scale_moe path). This eliminates ~5 elementwise launches per layer × 36 ≈ 180 launches/iter on gpt-oss-120b. Replaces the earlier AD_W4A8_FUSED_ROUTING env-flag gate (which defaulted to off) with unconditional fused routing — the fused path is correct and faster, so there's no reason to keep the Python topk/softmax fallback. Bench result on gpt-oss-120b W4A8 tp=2 (hot 2nd-run, paired with fuse_rope_into_trtllm_attention yaml flag): ITL p50 7.56 -> 6.12 ms (-1.44 ms / +22% TPS), correctness preserved (OSL=1000, mismatch=0). Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>

Set quantize_mxfp4_moe_trtllm_gen.quant_act=mxfp8 so the trtllm-gen MoE transform rewrites the MoE call to trtllm_mxfp4_w4a8_moe_fused (MXFP4 weights x MXFP8 activations) instead of trtllm_mxfp4_w4a16_moe_fused (MXFP4 weights x bf16 activations). Matches PT's W4A8MXFP4MXFP8TRTLLMGenFusedMoEMethod path. Verified: GSM8K @ 50 samples = 88% (ref 90.3%) — PASSED, no regression from W4A16 baseline (which also passes within statistical noise). Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>

…rmsnorm Move post-MoE allreduce insertion from immediately-after the V4 MoE op to immediately-after the downstream aten.view consumer. Before: MoE -> AR -> view -> add -> norm After: MoE -> view -> AR -> add -> norm The fuse_allreduce_residual_rmsnorm matcher in tensorrt_llm/_torch/auto_deploy/transform/library/collectives.py requires AR to be the immediate predecessor of the residual add (no intervening view). Pre-fix only the 36 post-attn ARs got fused; the 36 post-MoE ARs ran as plain ncclDevKernel_AllReduce_Sum_RING_LL with no overlap. Post-fix the matcher catches all 72 ARs per rank. Numerically equivalent: view is a free reshape and AR is element-wise across ranks. gpt-oss-120b TP=2 BS=1 conc=1 OSL=1000 verified, 1000/1000 tokens: V8 TP=2 baseline: ITL p50 8.70 ms / 109.52 TPS V8 TP=2 + this: ITL p50 7.48 ms / 127.04 TPS (-1.22 ms / +16%) V4+fg single-GPU: ITL p50 8.05 ms / 124.32 TPS First multi-GPU config to BEAT V4 single-GPU on this workload. Bench: auto-deploy/gpt-oss-120b/v8_tp2_fg_arfix_bench_sweep_conc_1_20260508_021633/ Notes: cc_reports/gpt-oss-120b/report.md §3.10 + §5.1 O1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>

Adds a third pytest.param entry to ``TestGPTOSS.test_mxfp4_gsm8k`` that runs gpt-oss-120b at TP=2 by overriding the model registry yaml's ``world_size: 1`` via a new ``world_size_override`` parameter. Existing 20b and 120b TP=1 cases are preserved (override = None means "use yaml default"). The 120b-tp2 case is gated by ``skip_less_device(2)`` so it skips automatically on single-GPU runs. Pairs with the post-MoE allreduce placement fix (9b1dca4705 [ad-mxfp4-moe] Fix post-MoE AR placement for fuse_allreduce_residual_rmsnorm) so the TP=2 accuracy path is exercised in CI alongside the TP=1 baseline. Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>

…ariant The sharding-IR ``modeling_gpt_oss_ir.py`` already covered every feature of the non-IR legacy file (same RMSNorm / RoPE / attention with sinks / MoE router / experts), differing only in: * attention Linears go through ``torch.ops.auto_deploy.torch_linear_simple`` with sharding hint kwargs so TP > 1 attention sharding works without an external graph rewrite; * view ops on q/k/v/attn_out use ``torch.ops.auto_deploy.view`` with ``tp_scaled_dim=2`` so the head-count dim scales with TP; * the post-attention all-reduce is expressed as a ``torch.ops.auto_deploy.all_reduce`` placeholder. Consolidate: rename ``modeling_gpt_oss_ir.py`` into the default ``modeling_gpt_oss.py`` (the legacy non-IR variant is removed) and drop the ``AD_USE_IR_MODELS`` opt-in entry for gpt-oss in ``models/custom/__init__.py``. This matches the trajectory in upstream PR NVIDIA#13478 (other models being migrated to sharding-IR as default). GSM8K full 1319-sample validation, gpt-oss-120b @ TP=2 (post-rebase, W4A8 mxfp8 activations): Pre-IR (legacy modeling): 88.55 % (±0.88), 992 s Post-IR (sharding-IR): 88.55 % (±0.88), 902 s Reference (PT): 90.30 % Accuracy is identical to the post-rebase TP=2 baseline; total run-time is ~9 % faster. The existing ``quantize_mxfp4_moe_trtllm_gen`` post-load transform continues to handle MXFP4 weight prep and op retargeting on top of the IR modeling. Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>

Adds ``make_mxfp4_trtllm_gen_load_hook`` to ``mxfp4_weight_prep.py`` -- a state_dict pre-hook factory that runs the trtllm-gen weight prep (pad + shuffle + per-rank slice + block-scale interleave) at ``load_state_dict`` time instead of in a post-load transform. Mirrors the GLM5 / DeepSeek MLA pattern (see ``modeling_glm4_moe_lite.py`` / ``mla_rope_utils._rope_deinterleave_load_hook``): the hook walks the layer prefix in the incoming state dict, calls ``prepare_mxfp4_weights_for_trtllm_gen`` per layer, pops the six raw HF MXFP4 keys (``gate_up_proj_{blocks,scales,bias}`` / ``down_proj_{blocks,scales,bias}``) and inserts the six prepared keys (``fc1_w_trtllm_gen`` / ``fc1_w_scale_trtllm_gen`` / ``fc1_bias_trtllm_gen`` / ``fc2_*``) at the same experts subpath. TP info is read from ``torch.distributed`` (fallback ``(1, 0)`` when not initialized). This patch only lands the helper; integration with the modeling code (register prepared-shape params + the hook in ``GptOssExperts.__init__`` and simplify the ``quantize_mxfp4_moe_trtllm_gen`` transform to a graph retarget) is a follow-up. Once integrated, the trtllm-gen MXFP4 path will allocate only prepared-shape parameters on the experts module (peak working set identical to the steady state), avoiding the brief raw + prepared double allocation in the current post-load-fusion flow (~150 GB on gpt-oss-120b 128 experts x 36 layers). Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>

…oad-hook prep Moves the trtllm-gen MXFP4 weight preparation for gpt-oss from a post-load FX transform into the modeling layer + state_dict pre-hook. The experts module registers prepared-shape parameters directly and the load hook converts raw HF MXFP4 entries (gate_up_proj_blocks / _scales / _bias and down_proj_*) into the prepared keys at load time. Net effect: peak weight memory goes from raw+prepared (~150 GB on 120b: 128 experts x 36 layers) to just prepared. Key changes: * ``GptOssExperts`` registers ``fc1_w_trtllm_gen`` / ``fc1_w_scale_trtllm_gen`` / ``fc1_bias_trtllm_gen`` / ``fc2_*`` plus per-expert SwiGLU constants when the HF config advertises MXFP4. Overrides ``_apply`` to protect the kernel-required dtypes (uint8 weights, ue8m0 scales, float32 bias / swiglu) from ``model.to(bf16)``. * ``GptOssMLP.forward`` dispatches to ``trtllm_mxfp4_w4a*_moe_fused`` directly, letting the C++ runner do fused topk+softmax inside the kernel. Activation precision selected via ``AD_MXFP4_QUANT_ACT``. * New ``make_mxfp4_trtllm_gen_load_hook`` factory in ``custom_ops/fused_moe/mxfp4_weight_prep.py``. Reads TP info from ``torch.distributed`` (falls back to (1, 0)), runs ``prepare_mxfp4_weights_for_trtllm_gen`` on the raw state-dict tensors, pops the six raw keys, and writes the six prepared keys plus the three SwiGLU constants. SwiGLU injection is critical: under HF accelerate's ``init_empty_weights`` the literal alpha/beta/limit registered in ``__init__`` get demoted to meta, and the HF safetensors has no swiglu keys, so without the hook they'd stay zero-init and the SwiGLU output would be garbage (was GSM8K 0.076% before this fix). * ``AD_MXFP4_TRTLLM_GEN_MODELING`` defaults to "1" (modeling-side path is the default for any MXFP4 gpt-oss). Setting it to "0" falls back to the legacy post-load transforms, which now early-return when the modeling path is active. * Drop the leftover ``NUM_SAMPLES=50`` debug mock in the test file so CI runs the full 1319-sample GSM8K eval. Validated: * TP=1 GSM8K: 90.98% (ref 90.30%, threshold 87.10%) PASSED. Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>

…ing override The modeling-side MXFP4 trtllm-gen path landed in ceff972 broke TP=2: the runtime DistConfig defaulted to MoE-EP topology (``moe_tp_size=1, moe_ep_size=world_size``) while the load hook TP-sliced the intermediate dim, so the kernel ran with TP-shape weights but no AR across ranks. Each rank's partial intermediate-sum flowed straight into the residual stream of the next layer, producing rank- divergent state and a 300-second hang at the first sampler event. Fix: * ``GptOssMLP.forward``: emit ``auto_deploy.all_reduce(out, "moe")`` unconditionally after the post-MoE view. Two reasons it must be unconditional and use the ``"moe"`` layer_type: - The previous ``if _tp_size > 1`` guard constant-folded under FX export whenever ``torch.distributed`` was not initialised at ``GptOssExperts.__init__`` time, dropping the AR even on TP > 1. - The placeholder layer_type must be in ``apply_sharding_hints``'s ``shard_layers`` list for ``AllReduceShardableNode`` to rewrite it into a real dist all_reduce. ``"auto"`` was filtered out. On TP=1 the placeholder is stripped to a passthrough by ``apply_sharding_hints``, so the always-emit is a no-op there. * ``test_mxfp4_gsm8k``: when ``model_id == "120b"`` and ``world_size == 2``, pass an inline ``transforms`` kwarg that flips ``apply_sharding_hints`` to enabled with ``dist_mapping: {tp: 2, moe_tp: 2, moe_ep: 1}`` and ``shard_layers: ["mha", "moe"]`` (mirrors the perf-yaml MoE-TP topology used in auto-deploy/gpt-oss-120b/). Inline override avoids shipping a TP-specific yaml in the model registry. Validated: * TP=1 GSM8K: 90.98% (unchanged). * TP=2 GSM8K: 88.55% (matches the baseline before Phase 2). Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>

…textvar Extends the modeling-side MXFP4 trtllm-gen weight prep landed in ceff972 / bad1871 to handle MoE-EP topology correctly. Before this commit the hook always intermediate-TP-sliced based on world_size, which happened to produce numerically correct output on EP=2 (because ``apply_sharding_hints`` inserts a real AR whenever ``dc.tp_size > 1``) but kept the full 128-expert weight footprint on every rank — no EP memory savings. Plumbing: * ``utils/dist_config.py``: add ``_ACTIVE_DIST_CONFIG`` ContextVar, ``use_dist_config`` context manager, and ``get_active_dist_config`` getter. ``contextvars`` rather than a bare module-level global so threading / asyncio contexts stay isolated. * ``transform/library/build_model.py``: ``BuildModel`` and ``BuildAndLoadFactoryModel`` wrap their factory build calls in ``with use_dist_config(shared_config.dist_config):`` so modeling code constructed inside the factory can read the active topology at ``__init__`` time — needed for registering rank-correct parameter shapes BEFORE ``load_state_dict`` runs. Hook + prep: * ``custom_ops/fused_moe/mxfp4_weight_prep.py``: - Rename ``_get_default_tp_info`` to ``_get_default_dist_info`` and return ``(moe_tp_size, moe_tp_rank, moe_ep_size, moe_ep_rank)``. - ``make_mxfp4_trtllm_gen_load_hook`` gains a required ``num_experts`` arg and replaces ``tp_info_fn`` with ``dist_info_fn``. The hook now EP-slices the six raw HF MXFP4 tensors on their leading expert axis BEFORE calling ``prepare_mxfp4_weights_for_trtllm_gen``. - Diagnostic print reports ``(moe_tp=Nr<rank>, moe_ep=Mr<rank>)``. Modeling: * ``models/custom/modeling_gpt_oss.py``: - New ``_resolve_moe_dist_info()`` helper reads the active ``DistConfig`` (preferred) or falls back to torch.distributed. - ``GptOssExperts._register_mxfp4_trtllm_gen_params`` allocates ``E_local = num_experts // moe_ep_size`` experts, stores ``_local_expert_offset = moe_ep_rank * E_local``. - ``GptOssMLP.forward`` passes ``e._local_expert_offset`` (was hardcoded ``0``) to the trtllm-gen op. - ``GptOssForCausalLM.__init__`` snapshots ``_resolve_moe_dist_info()`` into a closure variable and passes ``dist_info_fn=lambda: _dist_info`` to the hook factory. Binds the hook's slicing decision to the same topology the parameters were registered against at ``__init__`` time. Test: * ``test_llm_api_autodeploy.py``: add a 4th tuple element ``moe_topology`` (``None`` / ``"tp"`` / ``"ep"``) to ``MODEL_PARAMS`` and a new ``120b-ep2`` parametrize entry. When non-``None``, the test passes an inline ``transforms`` override with the matching ``dist_mapping`` plus ``shard_layers: ["mha", "moe"]``. Validated (full 1319-sample GSM8K, threshold 87.10%): * TP=1: 89.99% — moe_tp=1r0, moe_ep=1r0, shape (128, 5888, 1536). * TP=2: 88.55% — moe_tp=2r{0,1}, moe_ep=1r0, shape (128, 3072, 1536). * EP=2: 88.02% — moe_tp=1r0, moe_ep=2r{0,1}, shape (64, 5888, 1536). Real EP: per-rank fc1_bias_abs_max differs (3.234 vs 2.578) and local_expert_offset is rank-dependent (0 / 64). Per-rank weight footprint ~40 GB (vs ~75 GB for TP=2's intermediate-halved-but- full-experts layout). Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>

Mirrors the same one-liner fix already in ``gpt_oss_120b.yaml``. The HF ``config.json`` for gpt-oss-{20b,120b} omits the ``torch_dtype`` / ``dtype`` field, so transformers 5.x's ``_from_config`` (used by AD's meta-device build path) falls back to fp32. With fp32 activations, trtllm attention's FMHA path is disabled (it only supports fp16/bf16) and the unfused-MHA workspace explodes: Attention workspace size is not enough, increase the size from 268435456 bytes to 9928387479808 bytes CUDA out of memory. Tried to allocate 9246.53 GiB. Pinning ``model_kwargs.dtype: bfloat16`` lets the FMHA path stay active. Validated: * gpt-oss-20b GSM8K full 1319 samples: 85.82% (ref 85.823, PASSED). * Modeling-side trtllm-gen hook works on 20b too: ``prepped 24/24 layers (moe_tp=1r0, moe_ep=1r0)``, ``fc1_w_shape=(32, 5888, 1536)`` — 32 experts × 24 decoder layers consistent with the 20b config (120b is 128 × 36). Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>

Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>

…elpers Modeling-side __init__ code no longer reads the active DistConfig via the contextvars-backed get_active_dist_config (that path moved to the transform sharding load hook + FuseMXFP4Moe), so the helpers in dist_config.py have no callers. Removes _ACTIVE_DIST_CONFIG / get_active_dist_config / use_dist_config and their dead imports. build_model.py is unchanged. Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>

…ments from base Brings back the architecture summary, AD-canonical-ops list, and inline forward annotations from the 3ae0b70 base that got dropped during the sharding-IR rewrite, while keeping the new sharding-hint sections of the docstring + the existing code. Also trims the now-redundant lm_head / registration comments (covered by the module docstring or stale). Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>

Covers the new bias-aware fusion added to FuseGemms vs the 3ae0b70 base: * All-bias siblings fuse into one linear with stacked bias (concat dim=0). * Mixed bias / no-bias siblings on the same parent get bucketed separately (one fused with-bias linear + one fused no-bias linear). Existing FusableModel3's stale "no bias support yet" note is updated to reflect the new bucketing behavior. Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>

…ly override Pipeline trace confirms the only module-level dtype walk is ``QuantConfigReader.post_process_model``'s ``model.to(new_dtype)``, which fires *before* PATTERN_MATCHER. At that point ``_dtype_protected_params`` is unset (FuseMXFP4Moe sets it later) so the override degenerates to a normal ``nn.Module._apply``. No subsequent ``gm.to(dtype)`` exists. Removes the override + both transform-side ``_dtype_protected_params`` setters + the matching docstrings. Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>

…-out lines * Remove "match PT test_w4_1gpu" trailing comment on GSM8K_MAX_OUTPUT_LEN. * Remove the MODEL_PARAMS entry-format docstring — the parametrize names and the if-elif moe_topology dispatch below are already self-describing. * Remove the commented-out ``marks=pytest.mark.skip_less_device(4)``. Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>

… only for MoE topology Source of truth for ``world_size`` is the registry (``_get_registry_yaml_extra``'s 2nd return value — defaults to 1 when yaml doesn't carry an explicit ``world_size_N.yaml``); MODEL_PARAMS' 3rd column is a thin ``world_size_override`` used only for the ``120b-tp2`` / ``120b-ep2`` cases that exercise MoE-TP / MoE-EP on top of the same yaml. Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>

Consolidates the prep-helper invariants (previously in test_prepare_trtllm_gen_moe_mxfp4_weights.py) and the unified op's act_dtype-dispatch contract into a single file at tests/unittest/auto_deploy/singlegpu/custom_ops/moe/test_trtllm_quant_mxfp4_trtllm_gen_moe.py. Coverage: * fc1 / fc2 bias rows must follow the SAME TMA permute as the weights (gated-act-gemm + epilogue-tile reorder for w3/w1; epilogue-tile reorder only for w2) — guard for the gpt-oss-120b GSM8K 2% bug. * Byte-identical match against PT's MXFP4 reference loader. * ``act_dtype="bf16"`` and ``act_dtype="mxfp8"`` both run end-to-end on Blackwell+; invalid ``act_dtype`` raises ``ValueError``. Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>

Pins the POST_LOAD_FUSION contract of ``FuseMXFP4Moe``: * Raw HF MXFP4 buffers (``gate_up_proj_{blocks,scales,bias}`` / ``down_proj_{blocks,scales,bias}``) are deleted and replaced by the six prepared ``*_trtllm`` params on the experts module. * The ``trtllm_quant_mxfp4_trtllm_gen_moe_fused`` op's weight/bias arg slots (4..9) are re-pointed at the new prepared get_attr nodes. * ``moe_tp_size > 1`` divides ONLY ``fc2_bias_trtllm`` by ``moe_tp_size`` (so the post-AR sum reproduces the unsharded bias); all other prepared tensors match the TP=1 prep output byte-for-byte. * Re-running on an already-prepped graph is a no-op (idempotent skip). Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>

…g_transform_executor disables ``apply_sharding_hints`` is the only sharding pass actually used here; the explicit ``enabled: false`` lines for ``detect_sharding`` and ``sharding_transform_executor`` are no-ops (those passes are off by default in this pipeline) and just add noise. Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>

- 20B now inherits the same MXFP4/sharding/fuse transforms as 120B (it was missing them despite being MXFP4 too). - world_size moves out of the model yaml and is supplied by the registry's world_size_N.yaml overlay. - models.yaml, cookbook, and supported-models.md all point at the unified config. Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>

get_sm_version() is already @lru_cache(maxsize=1), so the manual _SM_VERSION cache adds nothing. The try/except fallback to 0 was dead defensive code: this branch only triggers on CUDA bf16 tensors, where torch.cuda.get_device_properties(0) cannot fail. Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>

…XFP4 prep _tp_slice_intermediate_axis() pre-pads I to i_padded_tp before slicing, and _get_weight_alignment() guarantees the alignment is a multiple of tp_size, so the helper already handles non-tp-divisible intermediate sizes (the original I is never reused downstream — only per_rank_i is). The guard rejected exactly the shapes the helper was designed to support. Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>

…llers already contiguous where needed). - _shuffle_per_expert: drop per-expert loop; batched torch.index_select on dim=1 (permute derived once on stacked[0], _PERMUTE_CACHE is shape-keyed). - default.yaml: comment fuse_mxfp4_moe.expect_mem_change with alignment-padding rationale. Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>

- _register_mxfp4_expert_params (Triton): torch.zeros -> torch.empty with device=gu_w.device. - _apply_trtllm: raw_specs + make_swiglu_param_tensors now use device=gu_w_t.device. - Avoids materializing giant CPU buffers on meta-device builds (GPT-OSS-120B); load hook overwrites bytes anyway. Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>

This reverts commit ef50383. Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>

… only for HF-loaded params Previous "Keep MXFP4 placeholders on existing param device" change (ef50383) also routed make_swiglu_param_tensors through param_device, which is meta on the normal build. swiglu_alpha/beta/limit (1.702/1.0/7.0) are NOT in HF safetensors, so meta tensors silently dropped the values and tanked GSM8K. Restore the memory-saving device reuse for raw HF buffers (blocks/scales/bias) and keep SwiGLU constants on CPU with real values; add a comment so it isn't re-broken. Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>

Replace three inline ((x + a - 1) // a) * a expressions with tensorrt_llm.math_utils.pad_up. Same arithmetic, less to misread. Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>

…t inline Pull the GPT-OSS sharding invariants into gpt_oss.yaml so the model registry is the single source of truth: - detect_sharding.enabled=false - sharding_transform_executor.enabled=false Both were inlined in TestGPTOSS.test_mxfp4_gsm8k only for the tp2/ep2 parametrize cases; with them in yaml, trtllm-serve via this config now uses the same apply_sharding_hints-only sharding path as the test. Test inline keeps only the per-parametrize dist_mapping override; the already-duplicated apply_sharding_hints.{enabled, requires_shape_prop, shard_layers} keys (also present in yaml) are dropped. pydantic-settings deep-merges init kwargs into yaml-sourced transforms, so the effective config is unchanged across all 4 parametrize cases. Pattern matches _IR_SHARDING_TRANSFORMS used by the existing IR-sharding tests (TestNemotronSuperV3_IR, TestQwen3_5_MoE_IR). Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>

Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>

…-window models - PR NVIDIA#13745 made trtllm `get_cache_initializers` propagate `sliding_window`, so models with non-uniform windows (e.g. gpt-oss-120b: 128/4096) form >1 KV pool. - trtllm enforces a single uniform pool (`requires_uniform_kv_caches`) → crash at cache_init: "KV resources are not uniform". - Temporary fix: trtllm-only revert of the pool-alloc change (handler `sliding_window=0` → single full-seq pool, SWA layers over-allocate KV); the sliding-window mask is still applied via the op's own `sliding_window` arg. Validated gsm8k[120b]=90.4. - Re-enabling multi-pool needs trtllm-native VSWA: per-pool `block_offsets` tables + real `pool_mapping` routing — the degenerate per-layer-pointer path can't address multiple pools, and flashinfer-style host-sliced views corrupt the cyclic-window trtllm kernel (naive attempt → gsm8k 6%). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>

- Lazy-import _get_weight_alignment so fused_moe_mxfp4 imports without tensorrt_llm; match_dense_moe_pattern/quantize_mxfp4_moe/fuse_mxfp4_moe re-register in standalone (fixes KeyError cascade in llmc standalone tests). - Exclude trtllm-gen-only MXFP4 tests from the standalone package. Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>

- linear.py imported get_sm_version from tensorrt_llm._utils (added in the bf16->cublas_mm sm>=100 routing); standalone has no tensorrt_llm, so the central linear op module silently skipped registration. - Caused 18 collection errors + 199 failures in llmc standalone tests. - Use ..._compat.get_sm_version (works in both TRT-LLM and standalone modes). Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>

…n trtllm attention Enable multiple KV cache memory pools in AutoDeploy's trtllm attention so non-uniform sliding-window models (e.g. gpt-oss) work on the trtllm backend. The window-group machinery already existed for triton/flashinfer; this removes the three trtllm-specific blockers: - Gate: requires_uniform_kv_caches now returns False, so the unified KVCacheManager may host more than one pool for trtllm. - Per-group block_offsets: the trtllm planner keeps an address-stable block_offsets buffer per KV window group (keyed by the group's cache_loc input pointer) so per-group prepare_trtllm_metadata invocations no longer clobber a single shared buffer. - Cyclic-window staging: the trtllm kernel masks the sliding window internally via cyclic indexing, so the executor passes the full per-window block table and global KV length (mirroring the PyTorch backend) instead of host-slicing. A new AttentionDescriptor.kernel_handles_cyclic_swa() capability (True only for trtllm) is plumbed through the kvcache transform and CachedSequenceInterface. Adds unit coverage for per-group buffers, cyclic-view staging, the gate, the backend plumbing, and a forward-level two-pool trtllm test. Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com>

…ss branch Reverts the temporary single-pool workaround (commit 0679581) now that the trtllm backend supports multiple KV pools (per-group block_offsets + cyclic full-table staging): restores sliding_window propagation in get_cache_initializers so non-uniform-window models form one pool per window. Also fixes the VSWA metadata test backend name after rebase (triton_paged -> triton). Validated on gpt-oss-20b (AD-trtllm, GB200, gpt_oss.yaml): - gsm8k(200) multi-pool = 85.500, single-pool = 85.500 (ref 85.823): no accuracy regression. - Same 118.96GB KV budget: multi-pool exposes 6,271,552 KV tokens vs single-pool 524,288 (~12x), since the 12 SWA layers use a 128-window pool instead of over-allocating the full 4096 window. Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com>

MrGeva · 2026-06-04T12:10:31Z

/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1"

taylor-yb-lee and others added 30 commits June 2, 2026 10:32

Update gpt-oss-120b acc ref value

7196efc

Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>

gpt-oss-120b tp2 sharding

1d4a228

Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>

Update yaml

5330fe0

Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>

taylor-yb-lee and others added 25 commits June 2, 2026 10:34

Revert "Keep MXFP4 placeholders on existing param device"

a12a5a3

This reverts commit ef50383. Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>

[ad-mxfp4-moe] _compute_padded_dims: use pad_up helper

ceb45ec

Replace three inline ((x + a - 1) // a) * a expressions with tensorrt_llm.math_utils.pad_up. Same arithmetic, less to misread. Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>

Add acc test to CI

6be7c48

Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>

remove redundant configs

ba9afc9

Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>

github-actions Bot assigned MrGeva Jun 4, 2026

taylor-yb-lee force-pushed the taylor/gpt-oss-0511_rebase_0511 branch 3 times, most recently from d9f7f0f to 9210625 Compare June 8, 2026 18:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[#14828][feat] AutoDeploy: enable trtllm multi KV cache pool (gpt-oss branch)#253

[#14828][feat] AutoDeploy: enable trtllm multi KV cache pool (gpt-oss branch)#253
MrGeva wants to merge 73 commits into
taylor/gpt-oss-0511_rebase_0511from
eg/ad-trtllm-multipool-on-taylor

MrGeva commented Jun 4, 2026

Uh oh!

MrGeva commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

MrGeva commented Jun 4, 2026

Summary

Validation (gpt-oss-20b, AutoDeploy trtllm, 1xGB200, gpt_oss.yaml)

Unit tests

CI

Uh oh!

MrGeva commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Validation (gpt-oss-20b, AutoDeploy trtllm, 1xGB200, `gpt_oss.yaml`)