[NVIDIA] GPTOSS GB200 DISAGG Configurations + Assign EP explicitly for AGG#387
Conversation
…ith the master file refactor MR 2. Add GB200 DISAGG configs Signed-off-by: Jatin Gangani <jgangani@dc2-container-xterm-014.prd.it.nvidia.com>
- Explicitly assign EP=TP for DP attention AGG candidates. EP was defaulted=1 during multinode refactor Signed-off-by: Jatin Gangani <jgangani@dc2-container-xterm-014.prd.it.nvidia.com>
Summary of ChangesHello @jgangani, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces significant updates to the configuration and benchmarking infrastructure for GPTOSS models on GB200 systems. It primarily focuses on enabling and optimizing disaggregated inference configurations using Dynamo-TRT, alongside refining existing aggregated configurations by explicitly managing Expert Parallelism for Data Parallel attention. These changes aim to improve the flexibility and correctness of model deployment and benchmarking across various parallelism strategies. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request adds new disaggregated configurations for GPT-OSS on GB200 and updates existing configurations to explicitly set the expert parallelism (EP) size for DP attention, which is a good fix. The changes look mostly good, but I have a few suggestions to improve maintainability and robustness.
Specifically, I've pointed out some opportunities to reduce duplication in the YAML configuration using anchors, noted a fragile dependency on a personal git branch in a new benchmark script, and suggested some minor consistency improvements in shell scripts. I've also left reminders for placeholders like TODOs and XXX for the PR link that should be addressed before merging.
- Addressed few gemini code review comments Signed-off-by: Jatin Gangani <jgangani@dc2-container-xterm-014.prd.it.nvidia.com>
| - "DECODE_NODES=8" | ||
|
|
||
| gptoss-fp4-gb200-dynamo-trt: | ||
| image: jwillthomson/dynamo-trtllm-1.2.0rc2-min-tokens-fix-v2 |
There was a problem hiding this comment.
plz use official images
There was a problem hiding this comment.
Updated with official image with the latest commit.
b243a20 to
b53f844
Compare
Signed-off-by: Jatin Gangani <jgangani@dc2-container-xterm-014.prd.it.nvidia.com>
b53f844 to
5c93796
Compare
| ntasks_per_node=4 | ||
|
|
||
| gen_nodes=$(((DECODE_TP + 3)/4 * DECODE_NUM_WORKERS)) | ||
| total_nodes=$((PREFILL_NUM_WORKERS + gen_nodes)) |
There was a problem hiding this comment.
Node calculation formula over-allocates resources for multi-worker configs
The gen_nodes formula (DECODE_TP + 3)/4 * DECODE_NUM_WORKERS allocates one node per worker when TP < 4, but the YAML configuration expects workers to share nodes. For example, the "D:4xTP2" config has num-worker: 4, tp: 2, and DECODE_NODES=2 in additional-settings. The formula calculates gen_nodes=4 (one node per worker), but only 2 nodes are needed (8 GPUs total fits on 2 nodes). This causes the sbatch request to allocate 5 total nodes instead of the expected 3, wasting cluster resources.
Additional Locations (1)
| - gptoss-fp4-b200-trt | ||
| description: | ||
| - Explicitly add EP=TP for DP attention configs. Multinode Refactor inadvertently changed default EP=1 | ||
| - Add GPTOSS DISAGG configurations for GB200 and B200 |
There was a problem hiding this comment.
not disagg for B200 right?
There was a problem hiding this comment.
also pls specify "GPTOSS DISAGG for 1k1k and 8k1k"
There was a problem hiding this comment.
trying to make these slightly more detailed since they are now displayed on inferencemax dot ai
There was a problem hiding this comment.
Thanks for the catch. Removed B200 from DISAGG comment in latest commit.
|
@jgangani left some comments |
cquil11
left a comment
There was a problem hiding this comment.
lgtm once comments are addressed!
Signed-off-by: Jatin Gangani <jgangani@dc2-container-xterm-014.prd.it.nvidia.com>
Signed-off-by: Jatin Gangani <jgangani@dc2-container-xterm-014.prd.it.nvidia.com>
|
@cquil11 can you please merge this? |
Note
Introduces multi-node DISAGG benchmarking for GPT-OSS on GB200 and fixes explicit EP settings for DP-attention on B200 TRT.
gptoss-fp4-gb200-dynamo-trttonvidia-master.yamlwith 1k1k and 8k1k search spaces (prefill/decode worker counts, TP/EP, DP-attn, conc-lists, and token/batch/mem settings)gptoss-fp4-b200-trtsearch-space to explicitly setep: tpfor DP-attn configs and adjust concurrency rangesbenchmarks/gptoss_fp4_b200_trt_slurm.shto conditionally configure MoE AllToAll: disable whenEP_SIZE=1, useMNNVLwhenEP_SIZE>1benchmarks/gptoss_fp4_gb200_dynamo-trt_slurm.shto clone/run Dynamo TRT DISAGG sweeps and submit SLURM jobsrunners/launch_gb200-nv.shfordynamo-trt(GPT-OSS model path/served name) and broadens result directory matchingperf-changelog.yamlWritten by Cursor Bugbot for commit b011864. This will update automatically on new commits. Configure here.