[None][feat] Add AD custom model for GPT-OSS family by lucaslie · Pull Request #234 · nv-auto-deploy/TensorRT-LLM

lucaslie · 2026-03-12T06:55:52Z

Summary

Add AutoDeploy custom model for the GPT-OSS MoE family (openai/gpt-oss-20b, openai/gpt-oss-120b), replacing the previous gptoss-mxfp4.py export patch
Custom model supports: GQA attention with learnable per-head sinks, custom clamped SwiGLU activation, YaRN RoPE, MXFP4 checkpoint dequantization
Add openai/gpt-oss-20b entry to model registry; update both entries to use gpt_oss.yaml config

Infrastructure fixes (needed for this model)

torch_moe_router fake kernel: Fixed device="meta" to device=hidden_states.device to prevent device mismatch during FakeTensorProp
torch_backend_attention sinks TP: Added slicing of sinks to local head count for tensor parallelism
CUDA graph empty batch_sizes: Added guard + eager fallback when cuda_graph_batch_sizes: []

AD End-to-End Results

Successfully runs on 4xH100 GPUs with coherent, factually accurate generation across all test prompts.

# Unit tests (18/18 PASS)
pytest tests/unittest/auto_deploy/singlegpu/models/test_gpt_oss_modeling.py -vs

# AD end-to-end (SUCCESS)
python examples/auto_deploy/build_and_run_ad.py --model openai/gpt-oss-20b --use-registry

Test plan

Block equivalence tests (experts, MLP, attention) — all PASS
Decoder layer equivalence test — PASS
Full model equivalence test (CPU + CUDA) — PASS
Export test with dynamic shapes — PASS
Structural tests (config, GQA, sliding window, state dict keys, weight shapes) — PASS
MXFP4 dequantization verified bit-exact match with HF
AD compilation on 4xH100 — PASS
AD generation quality — coherent and accurate

lucaslie

Please rebase and then re run

python examples/auto_deploy/build_and_run_ad.py --model openai/gpt-oss-20b --use-registry

and make sure to post RAW LOGS FROM ALL PROMPTS AND THEIR OUTPUT

lucaslie

[AGENT] Rebased and investigated the inference hang. Here's what I found:

Post-rebase regression: After rebasing onto the latest feat/paperclip_maximizer (20+ new commits), the model compiles successfully on 4×H100 GPUs but hangs at inference with enable_attention_dp: true. The same config worked pre-rebase (run 2026-03-12 16:37:19 with coherent generation).

Debugging done:

fuse_add_rms_norm (new transform in base branch) — disabled it, still hangs. Ruled out.
simple_shard_only — crashes with sinks TP mismatch (64 vs 16 heads), as expected since attention IS sharded.
All graph transforms complete, including sharding (EP=4, TP=1). Hang occurs at first inference forward in MoE sigmoid op while other ranks wait for a collective.

Root cause hypothesis: The EP dispatch path (all-to-all for token routing between expert ranks) may be inserting collectives around my model's vectorized BMM expert computation, but the BMM computes all experts locally without issuing collectives, causing the deadlock. Something in the base branch's 20+ new commits (likely a change to EP transform behavior for new MoE models) changed how EP works with stacked nn.Parameter expert weights.

Pre-rebase success log (for comparison): gpt_oss_20b_20260312_163719_success.log
Post-rebase hang log: gpt_oss_20b_20260312_185940_runtime_error.log

Could you help identify what changed in the EP/sharding path? Or suggest an alternative sharding approach that avoids the deadlock?

lucaslie

how about we try enabling EP only, no TP, no attention TP, no attention DP? You should be able to configure this via the manual sharding config in the gpt_oss.yaml and then re-run it and let's see what happens

lucaslie

[AGENT] Tried EP-only (sharding_dims: ['ep']) — it doesn't work:

OOM: EP-only doesn't shard attention weights, so each rank loads the full model (38.96 GB). On 80GB H100, this leaves only ~1.3GB free — not enough for KV cache (needs ~2GB).
0 MoE patterns detected: detect_sharding reports Found 0 MoE patterns because the custom model uses stacked nn.Parameter expert weights with vectorized BMM, not the standard torch_moe/torch_moe_fused pattern that the EP heuristic recognizes. So EP provides zero memory reduction.

Comparison with last working run (pre-rebase, enable_attention_dp: true): 12.26 GB per rank (attention DP replicates heads while TP shards attention weights) — this fit easily.

Options:

A) Debug the post-rebase NCCL deadlock with enable_attention_dp: true — the 2026-03-12 16:37:19 run worked with the same config on the old base
B) Add TP+EP support: fix sinks TP slicing in torch_backend_attention.py so attention can be TP-sharded with sinks, then use sharding_dims: ['tp', 'ep']

Which approach would you prefer?

lucaslie · 2026-03-13T03:20:03Z

can we configure sharding to just use simple shard + bmm sharding only? This uses all_gather, which isn't great performance-wise but at least it gives some functional support for now

also don't forget to rebase

Add AutoDeploy custom model implementation for the GPT-OSS MoE model family (openai/gpt-oss-20b, openai/gpt-oss-120b), replacing the previous gptoss-mxfp4 export patch with a full custom model. Key architectural features: - GQA attention with learnable per-head attention sinks - Custom clamped SwiGLU activation with per-expert biases - YaRN RoPE with precomputed cos/sin cache - MXFP4 checkpoint dequantization via load_state_dict pre-hook - Alternating sliding-window and full-attention layers Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>

- Restore torch_moe_router canonical op (fix fake kernel to use hidden_states.device instead of hardcoded "meta") - Add gpt_oss.yaml config with attn_backend: torch (sinks only supported by torch_attention, not flashinfer/trtllm backends) - Fix sinks TP slicing in torch_backend_attention.py for multi-GPU - Add CUDA graph capture guard for empty batch_sizes list - Add eager fallback in CapturedGraph.forward() when no graphs captured Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>

- Use compile_backend: torch-simple instead of disabling CUDA graphs - Use enable_attention_dp: true for attention replication (avoids sinks TP sharding issue without modifying torch_backend_attention) - Revert torch_cudagraph.py and torch_backend_attention.py changes - Add TODO note about sinks TP sharding limitation Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>

- Use simple_shard_only + bmm sharding per reviewer feedback (uses all_gather for functional multi-GPU support) - Guard multimodal content-to-list conversion in llm.py with hasattr(processor, "image_processor") to fix TypeError in text-only model chat templates (e.g., GPT-OSS) Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>

lucaslie · 2026-03-13T03:31:12Z

[AGENT] Rebased, applied simple_shard_only: true + sharding_dims: ['bmm'] per your suggestion, and re-ran successfully on 4×H100 GPUs.

Also fixed a chat template regression from the rebase (multimodal content-to-list conversion breaking text-only models) — guarded with hasattr(processor, "image_processor").

Reproduction command

python examples/auto_deploy/build_and_run_ad.py --model openai/gpt-oss-20b --use-registry

Raw prompt/output logs (all 10 prompts)

PROMPT 0 — How big is the universe?

analysisUser asks "How big is the universe?" We need to answer about size of observable universe, estimates... Also mention that size beyond observable may be infinite, expansion, cosmology. Provide numbers in light-years, parsecs, km, etc. Provide estimates like diameter 93 billion light-years, radius 46.5 billion light-years...

PROMPT 1 — In simple words and a single sentence, explain the concept of gravity:

analysisThe user wants a single sentence in simple words.
finalGravity is the force that pulls objects toward one another, like how the Earth pulls us down to its surface.

PROMPT 2 — How to fix slicing in golf?

analysisUser: "How to fix slicing in golf?" Provide explanation of slice: causes, drill to fix. Provide details: equipment, stance, swing path, clubface angle, drills...

PROMPT 3 — Where is the capital of Iceland?

analysisThe user asks: "Where is the capital of Iceland?" Reykjavik.
finalThe capital of Iceland is Reykjavík. It sits on the southwestern coast of the island, overlooking Faxa Bay...

PROMPT 4 — What are the three laws of thermodynamics?

analysisProvide the three laws: Zeroth, First, Second. Zeroth: Thermal equilibrium. First: Conservation of energy, ΔU = Q - W...

PROMPT 5 — Summarize the plot of Romeo and Juliet in two sentences:

finalRomeo, a Montague, falls deeply in love with Juliet, a Capulet, and they secretly marry, hoping to unite their feuding families. Their blissful romance is cut short by tragic misunderstandings, leading to...

PROMPT 6 — Write a Python function that checks if a number is prime.

analysisWe need to produce a Python function... standard simple function with checks for <=1, then test divisibility up to sqrt(n). Also an optimized one using 6k +/-1.

PROMPT 7 — Explain the difference between a compiler and an interpreter:

analysisProvide comparison table. Mention examples like C, Java... Discuss runtime, memory usage, speed, debugging. Also mention hybrid architectures like Python compiled to bytecode...

PROMPT 8 — What causes the northern lights?

analysisExplanation of auroras: charged particles from solar wind interacting with Earth's magnetosphere, ionizing and exciting atmospheric atoms, especially oxygen and nitrogen...

PROMPT 9 — What are the health benefits of drinking green tea?

analysisProvide structured sections: Antioxidants, weight management, heart health, cancer prevention, brain health, diabetes, dental health... Mention compounds: catechins, EGCG, caffeine, L-theanine.

All responses are coherent, factually accurate, and in the expected analysis/final channel format. Outputs truncated at max_tokens limit.

lucaslie

lgtm

* [None][feat] Add AD custom model for GPT-OSS family Add AutoDeploy custom model implementation for the GPT-OSS MoE model family (openai/gpt-oss-20b, openai/gpt-oss-120b), replacing the previous gptoss-mxfp4 export patch with a full custom model. Key architectural features: - GQA attention with learnable per-head attention sinks - Custom clamped SwiGLU activation with per-expert biases - YaRN RoPE with precomputed cos/sin cache - MXFP4 checkpoint dequantization via load_state_dict pre-hook - Alternating sliding-window and full-attention layers Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> * [None][fix] Address reviewer feedback for GPT-OSS custom model - Restore torch_moe_router canonical op (fix fake kernel to use hidden_states.device instead of hardcoded "meta") - Add gpt_oss.yaml config with attn_backend: torch (sinks only supported by torch_attention, not flashinfer/trtllm backends) - Fix sinks TP slicing in torch_backend_attention.py for multi-GPU - Add CUDA graph capture guard for empty batch_sizes list - Add eager fallback in CapturedGraph.forward() when no graphs captured Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> * [None][fix] Use torch-simple backend and attention DP per review - Use compile_backend: torch-simple instead of disabling CUDA graphs - Use enable_attention_dp: true for attention replication (avoids sinks TP sharding issue without modifying torch_backend_attention) - Revert torch_cudagraph.py and torch_backend_attention.py changes - Add TODO note about sinks TP sharding limitation Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> * [None][fix] Use simple shard + BMM and fix chat template for GPT-OSS - Use simple_shard_only + bmm sharding per reviewer feedback (uses all_gather for functional multi-GPU support) - Guard multimodal content-to-list conversion in llm.py with hasattr(processor, "image_processor") to fix TypeError in text-only model chat templates (e.g., GPT-OSS) Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> --------- Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>

github-actions Bot assigned lucaslie Mar 12, 2026

lucaslie commented Mar 12, 2026

View reviewed changes

Comment thread examples/auto_deploy/model_registry/configs/gpt_oss.yaml Outdated

Comment thread tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py

Comment thread tensorrt_llm/_torch/auto_deploy/custom_ops/attention/torch_backend_attention.py Outdated

lucaslie commented Mar 13, 2026

View reviewed changes

lucaslie force-pushed the ll/pcm_116 branch 2 times, most recently from 9990b70 to d3a866d Compare March 13, 2026 02:13

lucaslie commented Mar 13, 2026

View reviewed changes

lucaslie added 4 commits March 12, 2026 20:22

lucaslie force-pushed the ll/pcm_116 branch from d3a866d to 1326751 Compare March 13, 2026 03:30

lucaslie commented Mar 13, 2026

View reviewed changes

lucaslie merged commit fcdea57 into feat/paperclip_maximizer Mar 13, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[None][feat] Add AD custom model for GPT-OSS family#234

[None][feat] Add AD custom model for GPT-OSS family#234
lucaslie merged 4 commits into
feat/paperclip_maximizerfrom
ll/pcm_116

lucaslie commented Mar 12, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lucaslie left a comment

Uh oh!

lucaslie left a comment

Uh oh!

lucaslie left a comment

Uh oh!

lucaslie left a comment

Uh oh!

lucaslie commented Mar 13, 2026 •

edited

Loading

Uh oh!

lucaslie commented Mar 13, 2026

Uh oh!

lucaslie left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

lucaslie commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Infrastructure fixes (needed for this model)

AD End-to-End Results

Test plan

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lucaslie left a comment

Choose a reason for hiding this comment

Uh oh!

lucaslie left a comment

Choose a reason for hiding this comment

Uh oh!

lucaslie left a comment

Choose a reason for hiding this comment

Uh oh!

lucaslie left a comment

Choose a reason for hiding this comment

Uh oh!

lucaslie commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lucaslie commented Mar 13, 2026

Reproduction command

Raw prompt/output logs (all 10 prompts)

Uh oh!

lucaslie left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

lucaslie commented Mar 12, 2026 •

edited

Loading

lucaslie commented Mar 13, 2026 •

edited

Loading