[#14828][feat] AutoDeploy: enable trtllm multi KV cache pool (gpt-oss branch)#253
Draft
MrGeva wants to merge 73 commits into
Draft
[#14828][feat] AutoDeploy: enable trtllm multi KV cache pool (gpt-oss branch)#253MrGeva wants to merge 73 commits into
MrGeva wants to merge 73 commits into
Conversation
Wraps torch.ops.trtllm.bf16_mxe2m1_block_scale_moe_runner -- the trtllm-gen MXFP4-weight x BF16-activation MoE kernel that PT's W4A16MXFP4TRTLLMGenFusedMoEMethod uses today on B200 by default. Op signature: takes pre-shuffled MXFP4 weights, UE8M0 scales, float32 biases, and per-expert SwiGLU params. At forward time only zero-pads activations to the kernel's expected H_pad and slices the output back to valid_hidden_size. The matching weight-prep helper, transform, and ShardingInfo arrive in following steps. Op verified to register via torch.library and produce the expected schema. No graph/transform changes yet -- this op is inert until step 3 wires it into a transform. Refs: cc_reports/gpt-oss-120b/MOE_TRTLLM_GEN_PLAN.md (step 1 of 6) Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
Mirrors PT MXFP4WeightTRTLLMGenFusedMoEMethod weight-loading path (quantization.py:4135-4500). Reuses PT helpers maybe_pad_for_mxfp4, trtllmgen_maybe_get_cached_*_permute_indices, _get_weight_alignment. Steps: reshape HF [E, 2I, H/32, 16] -> [E, 2I, H/2], pad to alignment (input_hidden_alignment//2=256 cols, weight_alignment=128 rows), pad matching scales, shuffle per expert via torch.ops.trtllm.shuffle_matrix, cast biases to float32. Returns PreparedMXFP4Weights dataclass. Step-2 scope: tp_size=1 only; TP slicing arrives in step 5. Smoke-tested on gpt-oss-120b shapes (E=128, I=H=2880) on B200 -- output shapes match PT byte-for-byte. Refs: cc_reports/gpt-oss-120b/MOE_TRTLLM_GEN_PLAN.md (step 2 of 6) Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
Cleaner graph integration: the op now takes raw router_weight + bias + top_k and computes RenormalizeMoeRoutingMethod-style routing internally (F.linear -> topk -> softmax-of-topk), then dispatches to the kernel with pre-computed topk_weights / topk_ids. This makes the upcoming transform (step 3) a single 1:1 op rewrite of torch_moe_dense_mlp -> trtllm_mxfp4_w4a16_moe_fused without needing a separate routing op upstream. Refs: cc_reports/gpt-oss-120b/MOE_TRTLLM_GEN_PLAN.md (step 1 of 6) Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
Runs in post_load_fusion stage. Picks up triton_mxfp4_moe nodes from quantize_mxfp4_moe, runs the step-2 weight prep, registers prepared params on the experts module, and rewrites the call to auto_deploy::trtllm_mxfp4_w4a16_moe_fused. Frees the original raw HF-layout MXFP4 params after rewrite. Step-3 V4 scope: EP=1 (triton_mxfp4_moe without _ep) only. EP variant is covered by step 5 with MXFP4TRTLLMGenSharding. Refs: cc_reports/gpt-oss-120b/MOE_TRTLLM_GEN_PLAN.md (step 3 of 6) Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
Reorder positional args in ``Bf16MxE2m1BlockScaleMoERunner.get_valid_tactics`` to match the C++ signature of ``Bf16MxE2m1BlockScaleMoeRunner::getValidConfigs`` (``cpp/tensorrt_llm/thop/mxFp4BlockScaleMoe.cpp:516``): ``(topK, hiddenSize, intermediateSize, numLocalExperts, numTokens, validHiddenSize, validIntermediateSize)``. Commit 86cfb3e (cubin update + valid_*_size plumbing) added ``valid_hidden_size`` / ``valid_intermediate_size`` params to all three trtllm-gen MoE runners' Python wrappers. The other two siblings (``MxE4m3MxE2m1`` line 968, ``E4m3MxE2m1`` line 1274) appended the new args at the end correctly; only ``Bf16MxE2m1`` placed them in the middle, so the autotuner was passing ``valid_*`` values into the ``numLocalExperts`` / ``numTokens`` slots and ``local_num_experts`` / ``num_tokens`` into the ``valid_*`` slots. Effect: the cubin filter saw garbage shape parameters, returned an empty tactic list, and the autotune cache stayed empty -- so at run time the kernel fell back to ``getDefaultValidConfigIndex`` and asserted "No valid config found for the given problem shape MNK" on the first MoE call (e.g. AD's ``resize_kv_cache`` memory probe at ``max_num_tokens=8192``). This Python-only reorder restores parity with the C++ binding; no recompile needed. Found while onboarding gpt-oss-120b on AutoDeploy with the ``bf16_mxe2m1`` MoE path; reproduces in any non-tuning-mode call to the op (e.g. PT's ``MXFP4WeightTRTLLMGenFusedMoEMethod`` users hit it on the first prefill). Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
Bring ``prepare_mxfp4_weights_for_trtllm_gen`` and the ``trtllm_mxfp4_w4a16_moe_fused`` op into structural parity with PT's ``MXFP4WeightTRTLLMGenFusedMoEMethod`` (quantization.py:4135) so the trtllm-gen MoE kernel sees the same byte layout PT exercises: mxfp4_weight_prep.py changes: * Per-expert ``I_pad = roundUp(I, weight_alignment) = 2944`` first; derive ``2I_pad = 5888`` and ``I/2_pad = 1472`` from that. Previously we padded ``2I = 5760`` directly which is already 128-aligned and thus a no-op, leaving w1's effective ``I = 2880`` while w2's column padding pushed ``I = 2944`` -- inconsistent intermediate dim across the two gemms. * W1 hidden axis padded to ``input_hidden_alignment = 512`` (``H_w1_pad = 3072``), W2 hidden axis padded to ``weight_alignment = 128`` (``H_w2_pad = 2944``), matching PT's ``create_weights`` (lines 3715-3717 of quantization.py). * De-interleave gate / up rows from the on-disk row-interleaved storage (``gate_up_proj_blocks[:, ::2, :]`` = gate, ``[:, 1::2, :]`` = up) and pad each half to ``I_pad`` separately before stacking as ``[up | gate]``. PT's chunk-then-copy dance (modeling_gpt_oss.py:695-706 + quantization.py:4252-4258) ends up with the same physical layout. * Add ``torch.ops.trtllm.block_scale_interleave`` after ``shuffle_matrix`` for both fc1 and fc2 scales -- PT does both ops (quantization.py:4382, 4439); skipping the second was a partial bug. trtllm_moe.py change: * Routing softmax in fp32 instead of bf16 -- matches PT's ``RenormalizeMoeRoutingMethod`` which casts to fp32 for the topk softmax then back to the activation dtype. Status: kernel builds and runs cleanly with these changes, and pure GEMM throughput is at the V4 target (~9.28 ms ITL / ~108 tok/s for gpt-oss-120b vs V3 Triton's 127.79 ms / 7.96 tok/s -- 13.5x). However, content correctness is still blocked by an upstream NaN bug in the trtllm-gen MoE kernel itself: PT's own ``TRTLLMGenFusedMoE.forward`` on gpt-oss-120b at this TRT-LLM commit also produces NaN logits, so any byte-correct prep cannot rescue output. Tracking note: re-validate when upstream fix lands; if correctness is restored, proceed to step 5 (TP-MoE sharding). Refs: cc_reports/gpt-oss-120b/MOE_TRTLLM_GEN_PLAN.md (step 4 of 6), RESUME_V4.md. Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
Mirrors modeling_gpt_oss.py but routes every attention Linear through
``torch.ops.auto_deploy.torch_linear_simple`` with sharding hint kwargs
(``tp_mode``, ``tp_min_local_shape``, ``layer_type``), and inserts
``torch.ops.auto_deploy.view`` (``tp_scaled_dim=2``) for q/k/v/attn_out
reshapes plus a trailing ``torch.ops.auto_deploy.all_reduce`` placeholder
after the rowwise o_proj. Same pattern qwen3_ir / qwen3_5_moe_ir use.
Sharding strategy emitted into the graph:
q_proj / k_proj / v_proj -> colwise (+ tp_min_local_shape=head_dim
for GQA: 64 Q heads / 8 KV heads at TP=8)
view (q/k/v/attn_out) -> tp_scaled_dim=2 (head-count dim)
o_proj -> rowwise + auto_deploy.all_reduce
Out of scope here (matches qwen_ir convention):
* MoE router + experts stay replicated -- the V4 trtllm-gen MoE op
(``trtllm_mxfp4_w4a16_moe_fused``) has no ShardableNode yet. Step 5
of MOE_TRTLLM_GEN_PLAN.md (V6) registers TP-MoE for that op.
* lm_head stays as plain nn.Linear -- no canonical sharding-IR pattern
for col-parallel-then-all-gather in this codebase yet.
Registration:
* GptOssForCausalLM still registers via ``register_custom_model_cls``
(last-registration-wins).
* ``models/custom/__init__.py`` adds modeling_gpt_oss_ir to the
``AD_USE_IR_MODELS`` opt-in block, alongside deepseek_ir,
nemotron_h_ir, qwen3_5_moe_ir.
Validated end-to-end on gpt-oss-120b 8xB200 with the new V5 yaml
(world_size=8, apply_sharding_hints with shard_layers=["mha"],
detect_sharding+sharding_transform_executor disabled): apply_sharding_hints
processed 324 nodes / skipped 37 (the MoE nodes carry layer_type="moe"),
strip_sharding_hints stripped 288 hints, fuse_allreduce_residual_rmsnorm
matched 36 -- attention TP=8 fully wired through.
Refs: cc_reports/gpt-oss-120b/MOE_TRTLLM_GEN_PLAN.md (V5 step),
RESUME_V4.md (still valid for the trtllm-gen NaN tracking).
Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
Step 5 of MOE_TRTLLM_GEN_PLAN.md: extend the V4 trtllm-gen MoE op with
TP-sharding on the intermediate axis so MoE compute itself splits across
ranks (V5 only sharded attention; MoE was replicated and dominated cost).
prepare_mxfp4_weights_for_trtllm_gen:
* Add tp_rank arg.
* Compute TP-aware alignment via _get_weight_alignment so per-rank
intermediate is itself 128-aligned after pad-before-shard (matches
PT load_expert_w3_w1_weight / load_expert_w2_weight).
* Pre-pad intermediate axis to alignment_tp, then slice
[tp_rank * I_pr, (tp_rank+1) * I_pr] on gate/up rows, scales, biases
(col-parallel) and on dn_3d cols (row-parallel, /2 for packed mxfp4).
* Slice down_scales on dim 2 with /scaling_vector_size stride.
* Clamp valid_intermediate to min(intermediate_size, slice_stop) -
slice_start.
QuantizeMXFP4MoETrtllmGen transform:
* Read moe_tp_size / moe_tp_rank / allreduce_strategy from
shared_config.dist_config.
* Forward to prepare_mxfp4_weights_for_trtllm_gen.
* After the V4 op rewrite, when moe_tp_size > 1 insert
auto_deploy.all_reduce so partial [..., hidden] outputs from each
rank sum across ranks before the residual add. fc2_bias is divided
by tp_size in the prep helper so the post-AR sum reproduces the
unsharded bias.
Smoke-tested:
* tp=1 -> fc1=[8, 5888, 1536] valid_I=2880 (no regression).
* tp=8 rank=0 -> fc1=[8, 768, 1536] valid_I=384.
* tp=8 rank=7 -> fc1=[8, 768, 1536] valid_I=192 (last rank partial).
Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
…ayout prepare_mxfp4_weights_for_trtllm_gen padded per-expert biases but never row-shuffled them, while it did shuffle the weights and scales. The trtllm-gen bf16_mxe2m1_block_scale_moe_runner kernel adds bias[i] to post-shuffle output row i of GEMM1/GEMM2, so leaving biases in pre-shuffle order made the kernel attribute the wrong bias to each row and the MoE output came out as noise (gpt-oss-120b GSM8K dropped to 2.05% vs the 90.30% reference). PT's MXFP4WeightTRTLLMGenFusedMoEMethod (quantization.py:4204-4319) runs the very same row permutation on the bias destination buffer: load_expert_w3_w1_weight applies the gated-act-gemm interleave + epilogue-tile reorder to the 1-D [2*I_pad] gated bias, and load_expert_w2_weight applies the epilogue-tile reorder to the 1-D [H_pad] down bias. Mirror that in the AD prep helper via two new _shuffle_per_expert_bias_w3_w1 / _shuffle_per_expert_bias_w2 helpers so the AD prep stays byte-identical with PT. Add tests/unittest/auto_deploy/singlegpu/custom_ops/moe/test_mxfp4_weight_prep.py (3 tests) to pin the invariant: fc1 bias matches a manual gated+TMA permute, fc2 bias matches the manual TMA permute, and the full prep output is byte-identical to a per-expert PT-style reference loader (weights, scales, and biases all checked). Without the fix all three tests fail (98.8% mismatch on the bias rows); with it they pass. End-to-end validation on gpt-oss-120b at world_size=1 with quantize_mxfp4_moe_trtllm_gen enabled: - GSM8K (test_mxfp4_gsm8k[120b]): 2.05% -> 90.37% (threshold 87.10%, reference 90.30%) -> PASS. - ITL (V4 single-GPU, ISL=1000 OSL=1000 conc=1, 20 reqs): 8.53 ms p50 / 117.4 tok/s/user with content valid (OSL=1000 verified). Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
…gen MoE The V4 single-GPU + trtllm-gen MXFP4 MoE path is the now-correctness- validated baseline for gpt-oss-120b on B200 (previous commit fixes the weight-prep bias shuffle so the trtllm-gen kernel produces correct logits). Update examples/auto_deploy/model_registry/configs/ gpt_oss_120b.yaml to that configuration so the standalone AD serving config matches the live recommendation: - world_size 4 -> 1 (single GPU; the model fits in 192 GB HBM at MXFP4 and there is no AR overhead at BS=1). - Enable transform `quantize_mxfp4_moe_trtllm_gen` so the post-load fusion stage rewrites `triton_mxfp4_moe` to `auto_deploy::trtllm_mxfp4_w4a16_moe_fused` and dispatches to `torch.ops.trtllm.bf16_mxe2m1_block_scale_moe_runner` -- the same kernel PT exercises via `MXFP4WeightTRTLLMGenFusedMoEMethod`. Measured on the same standalone serving config (ISL=1000, OSL=1000, conc=1, 20 reqs, `DISABLE_HARMONY_ADAPTER=1` + `--use-server-token-count`): - ITL p50 8.53 ms / 117.4 tok/s/user (vs Triton-MXFP4 baseline 122 ms ITL / 8 tok/s/user, ~15x speedup). - GSM8K accuracy 90.37 % (threshold 87.10 %, reference 90.30 %). Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
The HF config.json for openai/gpt-oss-120b ships without a
`torch_dtype`/`dtype` field. Under transformers 5.x, AD's meta-device
build path (`build_model` transform -> `_build_model` ->
`custom_model_cls._from_config(model_config)`) reads `config.dtype`
to decide the construction dtype; when it is None, `_from_config`
skips the `local_torch_dtype` context and the model is created in
fp32. `load_or_random_init` then loads bf16 safetensors weights cast
to fp32 (`load_state_dict(assign=False)`), so the entire model runs
in fp32.
That breaks trtllm attention: `cpp/tensorrt_llm/common/attentionOp.cpp`
disables `mEnableContextFMHA` for any dtype that is not fp16/bf16,
falls back to unfused MHA, and the context workspace formula
(`size * batch * num_heads * seq * seq` for qk + qk_float) tries to
allocate ~1 TB during the `resize_kv_cache` forward pass. Server log:
[common] Fall back to unfused MHA because of unsupported data type.
[thop] Attention workspace size is not enough, increase the
size from 268435456 bytes to 1110551169280 bytes
RuntimeError: CUDA out of memory. Tried to allocate 1034.28 GiB.
Adding `model_kwargs.dtype: bfloat16` makes `_recursive_update_config`
set `config.dtype = torch.bfloat16` before `_from_config` runs, so
the model is constructed in bf16 and FMHA stays on (~40 MB workspace).
Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously fuse_gemms skipped any linear with bias (TODO at the
gather-loop in FuseGemms._apply). This excluded the most common
multi-GEMM fusion target -- Q/K/V projections that always have bias
in models like gpt-oss.
Bias support:
* Allow children with bias in the gather loop.
* Require uniform bias state across siblings (all-or-none) -- mixed
bias would need zero-padding which we don't do.
* Stack biases via torch.cat on dim=0, mirroring weight stacking.
* Validate each bias is per-channel 1D and matches its weight's
out_features; reject non-standard shapes (broadcast bias, scalar).
* Validate biases come from get_attr nodes (statically known).
* Validate uniform bias dtype across children.
* Wire fused get_attr bias node into the fused linear call args.
Verified on gpt-oss-120b V4 (single-GPU, BS=1 conc=1 ISL=OSL=1000):
* fuse_gemms matches=36 (one per layer, Q+K+V stacked).
* ITL: 10.68 ms -> 9.22 ms (-1.46 ms / -13.7%).
* TPS/user: 93.77 -> 109.03 (+16%).
* Output Token Count = 1000 / 1000 verified across all 20 requests.
Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
For trtllm_attention_mha_with_cache the 'out' parameter sits in the middle
of the schema (after out_scale, before rotary_cos_sin, ...). The cached-attn
insertion in transform/library/kvcache.py passes None for 'out' positionally
to preserve positional ordering of the parameters that follow. The previous
_inject_out_param implementation then set out=out_placeholder as a kwarg on
top of that, producing a duplicate binding ("received N+1 arguments").
Fix: detect the schema index of 'out', convert any positional args at/after
that index into kwargs (skipping the positional 'out' itself), and bind
'out' as a kwarg. Raise a clear error if the dynamic cached op has no 'out'
parameter at all.
This is load-bearing for the gpt-oss-120b TP=2 cached-attention path under
AD_USE_IR_MODELS=1 -- without it, every dynamic-shape decode call fails on
the kvcache-inserted attention op.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
transformers 5.x moved `config.rope_theta` into `config.rope_scaling` (e.g. `config.rope_scaling['rope_theta'] = 150000` for gpt-oss-120b). The previous `getattr(config, "rope_theta", 10000.0)` silently fell back to the 10000.0 default, which is 15x off the actual 150000 base GPT-OSS uses. That broke RoPE position encoding entirely. Mirror what PT's modeling_gpt_oss.py already does after the transformers 5.3.0 upgrade (NVIDIA#12829): use the `get_hf_rope_theta()` helper from `tensorrt_llm._utils`. Apply to both `modeling_gpt_oss.py` and `modeling_gpt_oss_ir.py`. Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
…ted-topk trtllm-gen MoE C++ routing was refactored in main (NVIDIA#13328) such that bf16_mxe2m1_block_scale_moe_runner with router_logits=None + only topk_weights/topk_ids kwargs silently produces broken routing. Model emits degenerate token loops instead of normal tokens. Mirror PT's invocation pattern (and source AD_W4A8_FUSED_ROUTING=1 path from commit 7719712): pass router_logits directly and let the kernel do fused topk + softmax internally. Note: routing_bias stays None because the linear-layer bias is already folded into router_logits via F.linear(x, w, b); the kernel's routing_bias is a separate per-expert offset. Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
…PE-fused decode Root cause: with trtllm attn_backend, AD applies RoPE in modeling code via torch_rope_with_explicit_cos_sin and passes post-RoPE Q/K to thop.attention. PT, in contrast, passes raw Q/K + the YARN rotary_cos_sin table so the kernel applies RoPE internally. The two RoPE paths produce slightly different cos/sin numerics (modeling-side uses our cached fp32 table while the kernel computes its own), and the difference compounds through the KV cache: prefill stores K rotated externally, decode reads cached K and computes attention with Q rotated externally — minor cos/sin differences turn into ~60% rel_RMSE on the layer-0 attn_out at decode step 1. Enabling fuse_rope_into_trtllm_attention folds RoPE into the kernel call so AD takes the same path as PT, eliminating the divergence. Verified on a 4-layer gpt-oss-120b subset by dumping per-stage activations in both PT and AD modeling and comparing PT residual vs AD layer output: L0 attn_out decode_1 rel_RMSE: 129% -> 1% L0 residual decode_1 rel_RMSE: 60% -> 0.7% First two generated tokens now match exactly between PT and AD. Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
Adds the W4A8MXFP4MXFP8 activation-quantization path mirroring PT's W4A8MXFP4MXFP8TRTLLMGenFusedMoEMethod: * New op: auto_deploy.trtllm_mxfp4_w4a8_moe_fused Same args as W4A16 op; pre-quantizes activation via torch.ops.trtllm.mxfp8_quantize(False, alignment=512) and dispatches to torch.ops.trtllm.mxe4m3_mxe2m1_block_scale_moe_runner. Uses the same MXFP4 weights as W4A16 (no checkpoint re-prep). * Transform config: QuantizeMXFP4MoETrtllmGenConfig.quant_act Choose 'bf16' (default; W4A16, bf16 input cubin family bmm_Bfloat16_MxE2m1Bfloat16) or 'mxfp8' (W4A8, MXFP8 input cubin family bmm_MxE4m3_MxE2m1MxE4m3 — 9 us/call median vs 27 us for bf16). KNOWN LIMITATION (Phase 2 blocker, this commit): The autotuner's get_valid_configs() returns empty for the decode shape (num_tokens=1, hidden_padded=3072) when called against mxe4m3_mxe2m1_block_scale_moe_runner with the gpt-oss-120b weight shapes. The runner falls back to a default tactic that's significantly slower than the bf16 path's tactic. Empirical decode regression on gpt-oss-120b TP=2 BS=1: ITL p50 7.48 ms (W4A16) -> 9.21 ms (W4A8) / TPS 127 -> 102. The kernels and weights are compatible -- W4A16 path with the SAME weight tensors finds tactics for tileN=8 cleanly. The W4A8 path's get_valid_configs filters something (likely C++ runner internal shape/scale validation) that rejects all tileN=8 candidates at decode shape. Needs C++ runner investigation before this can land as a perf win. The infrastructure (op + config flag) is committed because: 1. The op definition is correct API-wise (compiled, registers, runs). 2. The autotune compatibility is a pure C++ runner issue, not an AD-side issue. 3. Future fix in the C++ runner makes this op production-ready without further AD work. Set 'quant_act: mxfp8' explicitly in yaml to opt in (default stays bf16). Bench dir (regression run): auto-deploy/gpt-oss-120b/v8_tp2_fg_arfix_w4a8_bench_sweep_conc_1_20260508_022925/ yaml: auto-deploy/gpt-oss-120b/gpt_oss_120b_v8_tp2_fg_w4a8.yaml script: auto-deploy/gpt-oss-120b/run_gpt_conc1_V8_tp2_fg_w4a8.sh Notes: cc_reports/gpt-oss-120b/report.md §3.10 + §5.3 C7 (to be added). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
trtllm_mxfp4_w4a8_moe_fused now always passes router_logits to mxe4m3_mxe2m1_block_scale_moe_runner; the C++ runner does fused topk + softmax + cast internally (matches PT's run_fp4_block_scale_moe path). This eliminates ~5 elementwise launches per layer × 36 ≈ 180 launches/iter on gpt-oss-120b. Replaces the earlier AD_W4A8_FUSED_ROUTING env-flag gate (which defaulted to off) with unconditional fused routing — the fused path is correct and faster, so there's no reason to keep the Python topk/softmax fallback. Bench result on gpt-oss-120b W4A8 tp=2 (hot 2nd-run, paired with fuse_rope_into_trtllm_attention yaml flag): ITL p50 7.56 -> 6.12 ms (-1.44 ms / +22% TPS), correctness preserved (OSL=1000, mismatch=0). Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
Set quantize_mxfp4_moe_trtllm_gen.quant_act=mxfp8 so the trtllm-gen MoE transform rewrites the MoE call to trtllm_mxfp4_w4a8_moe_fused (MXFP4 weights x MXFP8 activations) instead of trtllm_mxfp4_w4a16_moe_fused (MXFP4 weights x bf16 activations). Matches PT's W4A8MXFP4MXFP8TRTLLMGenFusedMoEMethod path. Verified: GSM8K @ 50 samples = 88% (ref 90.3%) — PASSED, no regression from W4A16 baseline (which also passes within statistical noise). Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
…rmsnorm Move post-MoE allreduce insertion from immediately-after the V4 MoE op to immediately-after the downstream aten.view consumer. Before: MoE -> AR -> view -> add -> norm After: MoE -> view -> AR -> add -> norm The fuse_allreduce_residual_rmsnorm matcher in tensorrt_llm/_torch/auto_deploy/transform/library/collectives.py requires AR to be the immediate predecessor of the residual add (no intervening view). Pre-fix only the 36 post-attn ARs got fused; the 36 post-MoE ARs ran as plain ncclDevKernel_AllReduce_Sum_RING_LL with no overlap. Post-fix the matcher catches all 72 ARs per rank. Numerically equivalent: view is a free reshape and AR is element-wise across ranks. gpt-oss-120b TP=2 BS=1 conc=1 OSL=1000 verified, 1000/1000 tokens: V8 TP=2 baseline: ITL p50 8.70 ms / 109.52 TPS V8 TP=2 + this: ITL p50 7.48 ms / 127.04 TPS (-1.22 ms / +16%) V4+fg single-GPU: ITL p50 8.05 ms / 124.32 TPS First multi-GPU config to BEAT V4 single-GPU on this workload. Bench: auto-deploy/gpt-oss-120b/v8_tp2_fg_arfix_bench_sweep_conc_1_20260508_021633/ Notes: cc_reports/gpt-oss-120b/report.md §3.10 + §5.1 O1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
Adds a third pytest.param entry to ``TestGPTOSS.test_mxfp4_gsm8k`` that runs gpt-oss-120b at TP=2 by overriding the model registry yaml's ``world_size: 1`` via a new ``world_size_override`` parameter. Existing 20b and 120b TP=1 cases are preserved (override = None means "use yaml default"). The 120b-tp2 case is gated by ``skip_less_device(2)`` so it skips automatically on single-GPU runs. Pairs with the post-MoE allreduce placement fix (9b1dca4705 [ad-mxfp4-moe] Fix post-MoE AR placement for fuse_allreduce_residual_rmsnorm) so the TP=2 accuracy path is exercised in CI alongside the TP=1 baseline. Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
…ariant The sharding-IR ``modeling_gpt_oss_ir.py`` already covered every feature of the non-IR legacy file (same RMSNorm / RoPE / attention with sinks / MoE router / experts), differing only in: * attention Linears go through ``torch.ops.auto_deploy.torch_linear_simple`` with sharding hint kwargs so TP > 1 attention sharding works without an external graph rewrite; * view ops on q/k/v/attn_out use ``torch.ops.auto_deploy.view`` with ``tp_scaled_dim=2`` so the head-count dim scales with TP; * the post-attention all-reduce is expressed as a ``torch.ops.auto_deploy.all_reduce`` placeholder. Consolidate: rename ``modeling_gpt_oss_ir.py`` into the default ``modeling_gpt_oss.py`` (the legacy non-IR variant is removed) and drop the ``AD_USE_IR_MODELS`` opt-in entry for gpt-oss in ``models/custom/__init__.py``. This matches the trajectory in upstream PR NVIDIA#13478 (other models being migrated to sharding-IR as default). GSM8K full 1319-sample validation, gpt-oss-120b @ TP=2 (post-rebase, W4A8 mxfp8 activations): Pre-IR (legacy modeling): 88.55 % (±0.88), 992 s Post-IR (sharding-IR): 88.55 % (±0.88), 902 s Reference (PT): 90.30 % Accuracy is identical to the post-rebase TP=2 baseline; total run-time is ~9 % faster. The existing ``quantize_mxfp4_moe_trtllm_gen`` post-load transform continues to handle MXFP4 weight prep and op retargeting on top of the IR modeling. Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
Adds ``make_mxfp4_trtllm_gen_load_hook`` to ``mxfp4_weight_prep.py`` --
a state_dict pre-hook factory that runs the trtllm-gen weight prep
(pad + shuffle + per-rank slice + block-scale interleave) at
``load_state_dict`` time instead of in a post-load transform.
Mirrors the GLM5 / DeepSeek MLA pattern (see
``modeling_glm4_moe_lite.py`` / ``mla_rope_utils._rope_deinterleave_load_hook``):
the hook walks the layer prefix in the incoming state dict, calls
``prepare_mxfp4_weights_for_trtllm_gen`` per layer, pops the six raw
HF MXFP4 keys (``gate_up_proj_{blocks,scales,bias}`` /
``down_proj_{blocks,scales,bias}``) and inserts the six prepared keys
(``fc1_w_trtllm_gen`` / ``fc1_w_scale_trtllm_gen`` /
``fc1_bias_trtllm_gen`` / ``fc2_*``) at the same experts subpath. TP
info is read from ``torch.distributed`` (fallback ``(1, 0)`` when not
initialized).
This patch only lands the helper; integration with the modeling code
(register prepared-shape params + the hook in ``GptOssExperts.__init__``
and simplify the ``quantize_mxfp4_moe_trtllm_gen`` transform to a graph
retarget) is a follow-up.
Once integrated, the trtllm-gen MXFP4 path will allocate only
prepared-shape parameters on the experts module (peak working set
identical to the steady state), avoiding the brief raw + prepared
double allocation in the current post-load-fusion flow (~150 GB on
gpt-oss-120b 128 experts x 36 layers).
Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
…oad-hook prep Moves the trtllm-gen MXFP4 weight preparation for gpt-oss from a post-load FX transform into the modeling layer + state_dict pre-hook. The experts module registers prepared-shape parameters directly and the load hook converts raw HF MXFP4 entries (gate_up_proj_blocks / _scales / _bias and down_proj_*) into the prepared keys at load time. Net effect: peak weight memory goes from raw+prepared (~150 GB on 120b: 128 experts x 36 layers) to just prepared. Key changes: * ``GptOssExperts`` registers ``fc1_w_trtllm_gen`` / ``fc1_w_scale_trtllm_gen`` / ``fc1_bias_trtllm_gen`` / ``fc2_*`` plus per-expert SwiGLU constants when the HF config advertises MXFP4. Overrides ``_apply`` to protect the kernel-required dtypes (uint8 weights, ue8m0 scales, float32 bias / swiglu) from ``model.to(bf16)``. * ``GptOssMLP.forward`` dispatches to ``trtllm_mxfp4_w4a*_moe_fused`` directly, letting the C++ runner do fused topk+softmax inside the kernel. Activation precision selected via ``AD_MXFP4_QUANT_ACT``. * New ``make_mxfp4_trtllm_gen_load_hook`` factory in ``custom_ops/fused_moe/mxfp4_weight_prep.py``. Reads TP info from ``torch.distributed`` (falls back to (1, 0)), runs ``prepare_mxfp4_weights_for_trtllm_gen`` on the raw state-dict tensors, pops the six raw keys, and writes the six prepared keys plus the three SwiGLU constants. SwiGLU injection is critical: under HF accelerate's ``init_empty_weights`` the literal alpha/beta/limit registered in ``__init__`` get demoted to meta, and the HF safetensors has no swiglu keys, so without the hook they'd stay zero-init and the SwiGLU output would be garbage (was GSM8K 0.076% before this fix). * ``AD_MXFP4_TRTLLM_GEN_MODELING`` defaults to "1" (modeling-side path is the default for any MXFP4 gpt-oss). Setting it to "0" falls back to the legacy post-load transforms, which now early-return when the modeling path is active. * Drop the leftover ``NUM_SAMPLES=50`` debug mock in the test file so CI runs the full 1319-sample GSM8K eval. Validated: * TP=1 GSM8K: 90.98% (ref 90.30%, threshold 87.10%) PASSED. Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
…ing override The modeling-side MXFP4 trtllm-gen path landed in ceff972 broke TP=2: the runtime DistConfig defaulted to MoE-EP topology (``moe_tp_size=1, moe_ep_size=world_size``) while the load hook TP-sliced the intermediate dim, so the kernel ran with TP-shape weights but no AR across ranks. Each rank's partial intermediate-sum flowed straight into the residual stream of the next layer, producing rank- divergent state and a 300-second hang at the first sampler event. Fix: * ``GptOssMLP.forward``: emit ``auto_deploy.all_reduce(out, "moe")`` unconditionally after the post-MoE view. Two reasons it must be unconditional and use the ``"moe"`` layer_type: - The previous ``if _tp_size > 1`` guard constant-folded under FX export whenever ``torch.distributed`` was not initialised at ``GptOssExperts.__init__`` time, dropping the AR even on TP > 1. - The placeholder layer_type must be in ``apply_sharding_hints``'s ``shard_layers`` list for ``AllReduceShardableNode`` to rewrite it into a real dist all_reduce. ``"auto"`` was filtered out. On TP=1 the placeholder is stripped to a passthrough by ``apply_sharding_hints``, so the always-emit is a no-op there. * ``test_mxfp4_gsm8k``: when ``model_id == "120b"`` and ``world_size == 2``, pass an inline ``transforms`` kwarg that flips ``apply_sharding_hints`` to enabled with ``dist_mapping: {tp: 2, moe_tp: 2, moe_ep: 1}`` and ``shard_layers: ["mha", "moe"]`` (mirrors the perf-yaml MoE-TP topology used in auto-deploy/gpt-oss-120b/). Inline override avoids shipping a TP-specific yaml in the model registry. Validated: * TP=1 GSM8K: 90.98% (unchanged). * TP=2 GSM8K: 88.55% (matches the baseline before Phase 2). Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
…textvar Extends the modeling-side MXFP4 trtllm-gen weight prep landed in ceff972 / bad1871 to handle MoE-EP topology correctly. Before this commit the hook always intermediate-TP-sliced based on world_size, which happened to produce numerically correct output on EP=2 (because ``apply_sharding_hints`` inserts a real AR whenever ``dc.tp_size > 1``) but kept the full 128-expert weight footprint on every rank — no EP memory savings. Plumbing: * ``utils/dist_config.py``: add ``_ACTIVE_DIST_CONFIG`` ContextVar, ``use_dist_config`` context manager, and ``get_active_dist_config`` getter. ``contextvars`` rather than a bare module-level global so threading / asyncio contexts stay isolated. * ``transform/library/build_model.py``: ``BuildModel`` and ``BuildAndLoadFactoryModel`` wrap their factory build calls in ``with use_dist_config(shared_config.dist_config):`` so modeling code constructed inside the factory can read the active topology at ``__init__`` time — needed for registering rank-correct parameter shapes BEFORE ``load_state_dict`` runs. Hook + prep: * ``custom_ops/fused_moe/mxfp4_weight_prep.py``: - Rename ``_get_default_tp_info`` to ``_get_default_dist_info`` and return ``(moe_tp_size, moe_tp_rank, moe_ep_size, moe_ep_rank)``. - ``make_mxfp4_trtllm_gen_load_hook`` gains a required ``num_experts`` arg and replaces ``tp_info_fn`` with ``dist_info_fn``. The hook now EP-slices the six raw HF MXFP4 tensors on their leading expert axis BEFORE calling ``prepare_mxfp4_weights_for_trtllm_gen``. - Diagnostic print reports ``(moe_tp=Nr<rank>, moe_ep=Mr<rank>)``. Modeling: * ``models/custom/modeling_gpt_oss.py``: - New ``_resolve_moe_dist_info()`` helper reads the active ``DistConfig`` (preferred) or falls back to torch.distributed. - ``GptOssExperts._register_mxfp4_trtllm_gen_params`` allocates ``E_local = num_experts // moe_ep_size`` experts, stores ``_local_expert_offset = moe_ep_rank * E_local``. - ``GptOssMLP.forward`` passes ``e._local_expert_offset`` (was hardcoded ``0``) to the trtllm-gen op. - ``GptOssForCausalLM.__init__`` snapshots ``_resolve_moe_dist_info()`` into a closure variable and passes ``dist_info_fn=lambda: _dist_info`` to the hook factory. Binds the hook's slicing decision to the same topology the parameters were registered against at ``__init__`` time. Test: * ``test_llm_api_autodeploy.py``: add a 4th tuple element ``moe_topology`` (``None`` / ``"tp"`` / ``"ep"``) to ``MODEL_PARAMS`` and a new ``120b-ep2`` parametrize entry. When non-``None``, the test passes an inline ``transforms`` override with the matching ``dist_mapping`` plus ``shard_layers: ["mha", "moe"]``. Validated (full 1319-sample GSM8K, threshold 87.10%): * TP=1: 89.99% — moe_tp=1r0, moe_ep=1r0, shape (128, 5888, 1536). * TP=2: 88.55% — moe_tp=2r{0,1}, moe_ep=1r0, shape (128, 3072, 1536). * EP=2: 88.02% — moe_tp=1r0, moe_ep=2r{0,1}, shape (64, 5888, 1536). Real EP: per-rank fc1_bias_abs_max differs (3.234 vs 2.578) and local_expert_offset is rank-dependent (0 / 64). Per-rank weight footprint ~40 GB (vs ~75 GB for TP=2's intermediate-halved-but- full-experts layout). Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
Mirrors the same one-liner fix already in ``gpt_oss_120b.yaml``. The HF
``config.json`` for gpt-oss-{20b,120b} omits the ``torch_dtype`` /
``dtype`` field, so transformers 5.x's ``_from_config`` (used by AD's
meta-device build path) falls back to fp32. With fp32 activations,
trtllm attention's FMHA path is disabled (it only supports fp16/bf16)
and the unfused-MHA workspace explodes:
Attention workspace size is not enough, increase the size from
268435456 bytes to 9928387479808 bytes
CUDA out of memory. Tried to allocate 9246.53 GiB.
Pinning ``model_kwargs.dtype: bfloat16`` lets the FMHA path stay active.
Validated:
* gpt-oss-20b GSM8K full 1319 samples: 85.82% (ref 85.823, PASSED).
* Modeling-side trtllm-gen hook works on 20b too:
``prepped 24/24 layers (moe_tp=1r0, moe_ep=1r0)``,
``fc1_w_shape=(32, 5888, 1536)`` — 32 experts × 24 decoder layers
consistent with the 20b config (120b is 128 × 36).
Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
…elpers Modeling-side __init__ code no longer reads the active DistConfig via the contextvars-backed get_active_dist_config (that path moved to the transform sharding load hook + FuseMXFP4Moe), so the helpers in dist_config.py have no callers. Removes _ACTIVE_DIST_CONFIG / get_active_dist_config / use_dist_config and their dead imports. build_model.py is unchanged. Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
…ments from base Brings back the architecture summary, AD-canonical-ops list, and inline forward annotations from the 3ae0b70 base that got dropped during the sharding-IR rewrite, while keeping the new sharding-hint sections of the docstring + the existing code. Also trims the now-redundant lm_head / registration comments (covered by the module docstring or stale). Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
Covers the new bias-aware fusion added to FuseGemms vs the 3ae0b70 base: * All-bias siblings fuse into one linear with stacked bias (concat dim=0). * Mixed bias / no-bias siblings on the same parent get bucketed separately (one fused with-bias linear + one fused no-bias linear). Existing FusableModel3's stale "no bias support yet" note is updated to reflect the new bucketing behavior. Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
…ly override Pipeline trace confirms the only module-level dtype walk is ``QuantConfigReader.post_process_model``'s ``model.to(new_dtype)``, which fires *before* PATTERN_MATCHER. At that point ``_dtype_protected_params`` is unset (FuseMXFP4Moe sets it later) so the override degenerates to a normal ``nn.Module._apply``. No subsequent ``gm.to(dtype)`` exists. Removes the override + both transform-side ``_dtype_protected_params`` setters + the matching docstrings. Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
…-out lines * Remove "match PT test_w4_1gpu" trailing comment on GSM8K_MAX_OUTPUT_LEN. * Remove the MODEL_PARAMS entry-format docstring — the parametrize names and the if-elif moe_topology dispatch below are already self-describing. * Remove the commented-out ``marks=pytest.mark.skip_less_device(4)``. Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
… only for MoE topology Source of truth for ``world_size`` is the registry (``_get_registry_yaml_extra``'s 2nd return value — defaults to 1 when yaml doesn't carry an explicit ``world_size_N.yaml``); MODEL_PARAMS' 3rd column is a thin ``world_size_override`` used only for the ``120b-tp2`` / ``120b-ep2`` cases that exercise MoE-TP / MoE-EP on top of the same yaml. Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
Consolidates the prep-helper invariants (previously in test_prepare_trtllm_gen_moe_mxfp4_weights.py) and the unified op's act_dtype-dispatch contract into a single file at tests/unittest/auto_deploy/singlegpu/custom_ops/moe/test_trtllm_quant_mxfp4_trtllm_gen_moe.py. Coverage: * fc1 / fc2 bias rows must follow the SAME TMA permute as the weights (gated-act-gemm + epilogue-tile reorder for w3/w1; epilogue-tile reorder only for w2) — guard for the gpt-oss-120b GSM8K 2% bug. * Byte-identical match against PT's MXFP4 reference loader. * ``act_dtype="bf16"`` and ``act_dtype="mxfp8"`` both run end-to-end on Blackwell+; invalid ``act_dtype`` raises ``ValueError``. Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
Pins the POST_LOAD_FUSION contract of ``FuseMXFP4Moe``:
* Raw HF MXFP4 buffers (``gate_up_proj_{blocks,scales,bias}`` /
``down_proj_{blocks,scales,bias}``) are deleted and replaced by the six
prepared ``*_trtllm`` params on the experts module.
* The ``trtllm_quant_mxfp4_trtllm_gen_moe_fused`` op's weight/bias arg
slots (4..9) are re-pointed at the new prepared get_attr nodes.
* ``moe_tp_size > 1`` divides ONLY ``fc2_bias_trtllm`` by ``moe_tp_size``
(so the post-AR sum reproduces the unsharded bias); all other prepared
tensors match the TP=1 prep output byte-for-byte.
* Re-running on an already-prepped graph is a no-op (idempotent skip).
Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
…g_transform_executor disables ``apply_sharding_hints`` is the only sharding pass actually used here; the explicit ``enabled: false`` lines for ``detect_sharding`` and ``sharding_transform_executor`` are no-ops (those passes are off by default in this pipeline) and just add noise. Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
- 20B now inherits the same MXFP4/sharding/fuse transforms as 120B (it was missing them despite being MXFP4 too). - world_size moves out of the model yaml and is supplied by the registry's world_size_N.yaml overlay. - models.yaml, cookbook, and supported-models.md all point at the unified config. Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
get_sm_version() is already @lru_cache(maxsize=1), so the manual _SM_VERSION cache adds nothing. The try/except fallback to 0 was dead defensive code: this branch only triggers on CUDA bf16 tensors, where torch.cuda.get_device_properties(0) cannot fail. Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
…XFP4 prep _tp_slice_intermediate_axis() pre-pads I to i_padded_tp before slicing, and _get_weight_alignment() guarantees the alignment is a multiple of tp_size, so the helper already handles non-tp-divisible intermediate sizes (the original I is never reused downstream — only per_rank_i is). The guard rejected exactly the shapes the helper was designed to support. Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
…llers already contiguous where needed). - _shuffle_per_expert: drop per-expert loop; batched torch.index_select on dim=1 (permute derived once on stacked[0], _PERMUTE_CACHE is shape-keyed). - default.yaml: comment fuse_mxfp4_moe.expect_mem_change with alignment-padding rationale. Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
- _register_mxfp4_expert_params (Triton): torch.zeros -> torch.empty with device=gu_w.device. - _apply_trtllm: raw_specs + make_swiglu_param_tensors now use device=gu_w_t.device. - Avoids materializing giant CPU buffers on meta-device builds (GPT-OSS-120B); load hook overwrites bytes anyway. Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
This reverts commit ef50383. Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
… only for HF-loaded params Previous "Keep MXFP4 placeholders on existing param device" change (ef50383) also routed make_swiglu_param_tensors through param_device, which is meta on the normal build. swiglu_alpha/beta/limit (1.702/1.0/7.0) are NOT in HF safetensors, so meta tensors silently dropped the values and tanked GSM8K. Restore the memory-saving device reuse for raw HF buffers (blocks/scales/bias) and keep SwiGLU constants on CPU with real values; add a comment so it isn't re-broken. Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
Replace three inline ((x + a - 1) // a) * a expressions with tensorrt_llm.math_utils.pad_up. Same arithmetic, less to misread. Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
…t inline
Pull the GPT-OSS sharding invariants into gpt_oss.yaml so the model
registry is the single source of truth:
- detect_sharding.enabled=false
- sharding_transform_executor.enabled=false
Both were inlined in TestGPTOSS.test_mxfp4_gsm8k only for the
tp2/ep2 parametrize cases; with them in yaml, trtllm-serve via this
config now uses the same apply_sharding_hints-only sharding path as
the test.
Test inline keeps only the per-parametrize dist_mapping override; the
already-duplicated apply_sharding_hints.{enabled, requires_shape_prop,
shard_layers} keys (also present in yaml) are dropped. pydantic-settings
deep-merges init kwargs into yaml-sourced transforms, so the effective
config is unchanged across all 4 parametrize cases.
Pattern matches _IR_SHARDING_TRANSFORMS used by the existing IR-sharding
tests (TestNemotronSuperV3_IR, TestQwen3_5_MoE_IR).
Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
…-window models - PR NVIDIA#13745 made trtllm `get_cache_initializers` propagate `sliding_window`, so models with non-uniform windows (e.g. gpt-oss-120b: 128/4096) form >1 KV pool. - trtllm enforces a single uniform pool (`requires_uniform_kv_caches`) → crash at cache_init: "KV resources are not uniform". - Temporary fix: trtllm-only revert of the pool-alloc change (handler `sliding_window=0` → single full-seq pool, SWA layers over-allocate KV); the sliding-window mask is still applied via the op's own `sliding_window` arg. Validated gsm8k[120b]=90.4. - Re-enabling multi-pool needs trtllm-native VSWA: per-pool `block_offsets` tables + real `pool_mapping` routing — the degenerate per-layer-pointer path can't address multiple pools, and flashinfer-style host-sliced views corrupt the cyclic-window trtllm kernel (naive attempt → gsm8k 6%). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
- Lazy-import _get_weight_alignment so fused_moe_mxfp4 imports without tensorrt_llm; match_dense_moe_pattern/quantize_mxfp4_moe/fuse_mxfp4_moe re-register in standalone (fixes KeyError cascade in llmc standalone tests). - Exclude trtllm-gen-only MXFP4 tests from the standalone package. Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
- linear.py imported get_sm_version from tensorrt_llm._utils (added in the bf16->cublas_mm sm>=100 routing); standalone has no tensorrt_llm, so the central linear op module silently skipped registration. - Caused 18 collection errors + 199 failures in llmc standalone tests. - Use ..._compat.get_sm_version (works in both TRT-LLM and standalone modes). Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
…n trtllm attention Enable multiple KV cache memory pools in AutoDeploy's trtllm attention so non-uniform sliding-window models (e.g. gpt-oss) work on the trtllm backend. The window-group machinery already existed for triton/flashinfer; this removes the three trtllm-specific blockers: - Gate: requires_uniform_kv_caches now returns False, so the unified KVCacheManager may host more than one pool for trtllm. - Per-group block_offsets: the trtllm planner keeps an address-stable block_offsets buffer per KV window group (keyed by the group's cache_loc input pointer) so per-group prepare_trtllm_metadata invocations no longer clobber a single shared buffer. - Cyclic-window staging: the trtllm kernel masks the sliding window internally via cyclic indexing, so the executor passes the full per-window block table and global KV length (mirroring the PyTorch backend) instead of host-slicing. A new AttentionDescriptor.kernel_handles_cyclic_swa() capability (True only for trtllm) is plumbed through the kvcache transform and CachedSequenceInterface. Adds unit coverage for per-group buffers, cyclic-view staging, the gate, the backend plumbing, and a forward-level two-pool trtllm test. Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com>
…ss branch Reverts the temporary single-pool workaround (commit 0679581) now that the trtllm backend supports multiple KV pools (per-group block_offsets + cyclic full-table staging): restores sliding_window propagation in get_cache_initializers so non-uniform-window models form one pool per window. Also fixes the VSWA metadata test backend name after rebase (triton_paged -> triton). Validated on gpt-oss-20b (AD-trtllm, GB200, gpt_oss.yaml): - gsm8k(200) multi-pool = 85.500, single-pool = 85.500 (ref 85.823): no accuracy regression. - Same 118.96GB KV budget: multi-pool exposes 6,271,552 KV tokens vs single-pool 524,288 (~12x), since the 12 SWA layers use a 128-window pool instead of over-allocating the full 4096 window. Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com>
Author
|
/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" |
d9f7f0f to
9210625
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Enables multiple KV cache memory pools in AutoDeploy's
trtllmattention backend, then re-enables it on this gpt-oss branch by reverting the temporary single-pool workaround (commit0679581675). Targetstaylor/gpt-oss-0511_rebase_0511so it composes with the gpt-oss accuracy CI test + optimizations already on that branch. Implements NVIDIA#14828.Two commits:
requires_uniform_kv_caches→False, so_identify_managed_kv_resourcesno longer raises with >1 pool (AD trtllm uses per-layer KV views, soblock_offset_multiplieris uniformlykv_factor=2).block_offsets: the planner keeps an address-stable buffer per KV window group (keyed by the group'scache_locinput ptr) so per-groupprepare_trtllm_metadatainvocations don't clobber a single shared buffer. CUDA-graph safe.AttentionDescriptor.kernel_handles_cyclic_swa()(True only for trtllm) is plumbed through the kvcache transform ->CachedSequenceInterface-> executor. triton/flashinfer keep host-slicing.sliding_windowpropagation inget_cache_initializersso non-uniform-window models (gpt-oss: 128/4096) form one pool per window again.Validation (gpt-oss-20b, AutoDeploy trtllm, 1xGB200,
gpt_oss.yaml)Built with
auto-dev install -s. gsm8k via the sametensorrt_llm.evaluate.GSM8Kevaluator asTestGPTOSS(200 samples,reasoning_effort=low), A/B vs the single-pool workaround:Caveat: the ~12x is a static KV-capacity measurement from the cache-manager
resize_kv_cacheinit logs, not a throughput/latency benchmark. Accuracy is a real end-to-end measurement; realized serving speedup was not benchmarked here.Unit tests
Per-group
block_offsets(non-clobbering + stable ptr), cyclic-view staging,kernel_handles_cyclic_swaplumbing, the multi-pool gate, and a forward-level two-pool trtllm test — all pass on the rebased branch (tests/unittest/auto_deploy/singlegpu/...).CI
AutoDeploy change — run with the AutoDeploy stages:
🤖 Generated with Claude Code