[None][feat] Add AD custom model for GPT-OSS family#234
Conversation
lucaslie
left a comment
There was a problem hiding this comment.
Please rebase and then re run
python examples/auto_deploy/build_and_run_ad.py --model openai/gpt-oss-20b --use-registry
and make sure to post RAW LOGS FROM ALL PROMPTS AND THEIR OUTPUT
9990b70 to
d3a866d
Compare
lucaslie
left a comment
There was a problem hiding this comment.
[AGENT] Rebased and investigated the inference hang. Here's what I found:
Post-rebase regression: After rebasing onto the latest feat/paperclip_maximizer (20+ new commits), the model compiles successfully on 4×H100 GPUs but hangs at inference with enable_attention_dp: true. The same config worked pre-rebase (run 2026-03-12 16:37:19 with coherent generation).
Debugging done:
fuse_add_rms_norm(new transform in base branch) — disabled it, still hangs. Ruled out.simple_shard_only— crashes with sinks TP mismatch (64 vs 16 heads), as expected since attention IS sharded.- All graph transforms complete, including sharding (EP=4, TP=1). Hang occurs at first inference forward in MoE sigmoid op while other ranks wait for a collective.
Root cause hypothesis: The EP dispatch path (all-to-all for token routing between expert ranks) may be inserting collectives around my model's vectorized BMM expert computation, but the BMM computes all experts locally without issuing collectives, causing the deadlock. Something in the base branch's 20+ new commits (likely a change to EP transform behavior for new MoE models) changed how EP works with stacked nn.Parameter expert weights.
Pre-rebase success log (for comparison): gpt_oss_20b_20260312_163719_success.log
Post-rebase hang log: gpt_oss_20b_20260312_185940_runtime_error.log
Could you help identify what changed in the EP/sharding path? Or suggest an alternative sharding approach that avoids the deadlock?
lucaslie
left a comment
There was a problem hiding this comment.
how about we try enabling EP only, no TP, no attention TP, no attention DP? You should be able to configure this via the manual sharding config in the gpt_oss.yaml and then re-run it and let's see what happens
lucaslie
left a comment
There was a problem hiding this comment.
[AGENT] Tried EP-only (sharding_dims: ['ep']) — it doesn't work:
-
OOM: EP-only doesn't shard attention weights, so each rank loads the full model (38.96 GB). On 80GB H100, this leaves only ~1.3GB free — not enough for KV cache (needs ~2GB).
-
0 MoE patterns detected:
detect_shardingreportsFound 0 MoE patternsbecause the custom model uses stackednn.Parameterexpert weights with vectorized BMM, not the standardtorch_moe/torch_moe_fusedpattern that the EP heuristic recognizes. So EP provides zero memory reduction.
Comparison with last working run (pre-rebase, enable_attention_dp: true): 12.26 GB per rank (attention DP replicates heads while TP shards attention weights) — this fit easily.
Options:
- A) Debug the post-rebase NCCL deadlock with
enable_attention_dp: true— the 2026-03-12 16:37:19 run worked with the same config on the old base - B) Add TP+EP support: fix sinks TP slicing in
torch_backend_attention.pyso attention can be TP-sharded with sinks, then usesharding_dims: ['tp', 'ep']
Which approach would you prefer?
|
can we configure sharding to just use simple shard + bmm sharding only? This uses all_gather, which isn't great performance-wise but at least it gives some functional support for now also don't forget to rebase |
Add AutoDeploy custom model implementation for the GPT-OSS MoE model family (openai/gpt-oss-20b, openai/gpt-oss-120b), replacing the previous gptoss-mxfp4 export patch with a full custom model. Key architectural features: - GQA attention with learnable per-head attention sinks - Custom clamped SwiGLU activation with per-expert biases - YaRN RoPE with precomputed cos/sin cache - MXFP4 checkpoint dequantization via load_state_dict pre-hook - Alternating sliding-window and full-attention layers Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
- Restore torch_moe_router canonical op (fix fake kernel to use hidden_states.device instead of hardcoded "meta") - Add gpt_oss.yaml config with attn_backend: torch (sinks only supported by torch_attention, not flashinfer/trtllm backends) - Fix sinks TP slicing in torch_backend_attention.py for multi-GPU - Add CUDA graph capture guard for empty batch_sizes list - Add eager fallback in CapturedGraph.forward() when no graphs captured Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
- Use compile_backend: torch-simple instead of disabling CUDA graphs - Use enable_attention_dp: true for attention replication (avoids sinks TP sharding issue without modifying torch_backend_attention) - Revert torch_cudagraph.py and torch_backend_attention.py changes - Add TODO note about sinks TP sharding limitation Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
- Use simple_shard_only + bmm sharding per reviewer feedback (uses all_gather for functional multi-GPU support) - Guard multimodal content-to-list conversion in llm.py with hasattr(processor, "image_processor") to fix TypeError in text-only model chat templates (e.g., GPT-OSS) Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
|
[AGENT] Rebased, applied Also fixed a chat template regression from the rebase (multimodal content-to-list conversion breaking text-only models) — guarded with Reproduction commandpython examples/auto_deploy/build_and_run_ad.py --model openai/gpt-oss-20b --use-registryRaw prompt/output logs (all 10 prompts)PROMPT 0 — How big is the universe?
PROMPT 1 — In simple words and a single sentence, explain the concept of gravity:
PROMPT 2 — How to fix slicing in golf?
PROMPT 3 — Where is the capital of Iceland?
PROMPT 4 — What are the three laws of thermodynamics?
PROMPT 5 — Summarize the plot of Romeo and Juliet in two sentences:
PROMPT 6 — Write a Python function that checks if a number is prime.
PROMPT 7 — Explain the difference between a compiler and an interpreter:
PROMPT 8 — What causes the northern lights?
PROMPT 9 — What are the health benefits of drinking green tea?
All responses are coherent, factually accurate, and in the expected |
* [None][feat] Add AD custom model for GPT-OSS family Add AutoDeploy custom model implementation for the GPT-OSS MoE model family (openai/gpt-oss-20b, openai/gpt-oss-120b), replacing the previous gptoss-mxfp4 export patch with a full custom model. Key architectural features: - GQA attention with learnable per-head attention sinks - Custom clamped SwiGLU activation with per-expert biases - YaRN RoPE with precomputed cos/sin cache - MXFP4 checkpoint dequantization via load_state_dict pre-hook - Alternating sliding-window and full-attention layers Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> * [None][fix] Address reviewer feedback for GPT-OSS custom model - Restore torch_moe_router canonical op (fix fake kernel to use hidden_states.device instead of hardcoded "meta") - Add gpt_oss.yaml config with attn_backend: torch (sinks only supported by torch_attention, not flashinfer/trtllm backends) - Fix sinks TP slicing in torch_backend_attention.py for multi-GPU - Add CUDA graph capture guard for empty batch_sizes list - Add eager fallback in CapturedGraph.forward() when no graphs captured Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> * [None][fix] Use torch-simple backend and attention DP per review - Use compile_backend: torch-simple instead of disabling CUDA graphs - Use enable_attention_dp: true for attention replication (avoids sinks TP sharding issue without modifying torch_backend_attention) - Revert torch_cudagraph.py and torch_backend_attention.py changes - Add TODO note about sinks TP sharding limitation Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> * [None][fix] Use simple shard + BMM and fix chat template for GPT-OSS - Use simple_shard_only + bmm sharding per reviewer feedback (uses all_gather for functional multi-GPU support) - Guard multimodal content-to-list conversion in llm.py with hasattr(processor, "image_processor") to fix TypeError in text-only model chat templates (e.g., GPT-OSS) Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> --------- Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Summary
openai/gpt-oss-20b,openai/gpt-oss-120b), replacing the previousgptoss-mxfp4.pyexport patchopenai/gpt-oss-20bentry to model registry; update both entries to usegpt_oss.yamlconfigInfrastructure fixes (needed for this model)
device="meta"todevice=hidden_states.deviceto prevent device mismatch during FakeTensorPropcuda_graph_batch_sizes: []AD End-to-End Results
Successfully runs on 4xH100 GPUs with coherent, factually accurate generation across all test prompts.
Test plan