[NVIDIA] Enable GPTOSS GB200 DISAGG #232
Conversation
|
thanks for this contribution @jgangani Can you explain what this means? is all of the datapoints just 4 gpus for prefill only and then 4 gpus for decode only? if not, can u explain the parallelism config & the conc for each datapoint? /submit_disagg.sh mtp=off tp 1 1 1 512 20000 "0.9" 0 0 "128 256 512"
./submit_disagg.sh mtp=off tp 1 1 2 1024 20000 "0.9" 0 0 "64 128 256"
./submit_disagg.sh mtp=off tep 1 1 2 1024 20000 "0.9" 0 0 "64 256"
./submit_disagg.sh mtp=off tp 1 1 4 2048 20000 "0.9" 0 0 "8 16 32 64 128"
./submit_disagg.sh mtp=off tp 1 1 8 2048 20000 "0.9" 0 0 "1 2 4 8 16" |
There was a problem hiding this comment.
Pull Request Overview
This PR enables disaggregation support for the GPT-OSS 120B model on GB200 hardware. The changes add GPT-OSS as a supported model alongside the existing DeepSeek-R1 configurations, implementing model-specific benchmark configurations and updating workflows to handle the new model.
Key Changes:
- Added GPT-OSS model detection and configuration in the GB200 launch script
- Implemented GPT-OSS-specific benchmark configurations with 8k/1k input/output sequence lengths
- Updated workflows to support GPT-OSS model selection and dynamic model prefix generation
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| runners/launch_gb200-nv.sh | Added GPT-OSS model path configuration, branch checkout logic, and model-specific benchmark parameters for disaggregation testing |
| .github/workflows/gb200-tests.yml | Added GPT-OSS to model options and implemented dynamic model prefix mapping for cleaner experiment naming |
| .github/workflows/full-sweep-8k1k-scheduler.yml | Added GPT-OSS configuration matrix entry and updated result collection dependencies |
| .github/workflows/benchmark-multinode-tmpl.yml | Updated filename parsing patterns to match new hyphenated format (gpus-N, ctx-N, gen-N) and added MODEL environment variable |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if [[ $MODEL == *"gpt-oss"* ]]; then | ||
| # GPT-OSS specific benchmark configurations | ||
| if [ "$isl" = "8192" ] && [ "$osl" = "1024" ]; then | ||
|
|
There was a problem hiding this comment.
Unnecessary blank line with trailing whitespace. Remove this line or the trailing spaces.
|
|
||
| # Find all result subdirectories in this logs directory | ||
| RESULT_SUBDIRS=$(find "$LOGS_DIR" -name "ctx*_gen*_[td]ep*_batch*_eplb*_mtp*" -type d) | ||
| RESULT_SUBDIRS=$(find "$LOGS_DIR" -name "ctx*_gen*_*_batch*_eplb*_mtp*" -type d) |
There was a problem hiding this comment.
The find pattern 'ctx*gen*_batch_eplb*_mtp*' uses a generic wildcard () in the middle which may match unintended directory names. Consider using a more specific pattern like 'ctxgen*[td]ep*_batch*_eplb*_mtp*' or 'ctx*gen*{tp,tep,dep}_batch_eplb*_mtp*' to match only valid parallelism strategies (tp/tep/dep).
| RESULT_SUBDIRS=$(find "$LOGS_DIR" -name "ctx*_gen*_*_batch*_eplb*_mtp*" -type d) | |
| RESULT_SUBDIRS=$(find "$LOGS_DIR" -name "ctx*_gen*_{tp,tep,dep}_batch*_eplb*_mtp*" -type d) |
| "framework": "dynamo-sglang", | ||
| "mtp": "off", | ||
| } | ||
| # GPTOSS |
There was a problem hiding this comment.
[nitpick] Inconsistent comment formatting. The DeepSeek comment on line 93 uses '# DSR1' while this uses '# GPTOSS' with different indentation. Align the comment indentation with line 93 for consistency.
| # GPTOSS | |
| # GPTOSS |
|
also @jgangani please merge this in main branch/release candidate instead of doing an side branch ai-dynamo/dynamo@release/0.5.1-rc0.20251105...jthomson04/gpt-oss-disagg-slurm |
Following is the order: |
Yes, that was the goal. wanted to test out the MR before merging this into release branch. Will update. |
|
@jgangani thanks! Can u please enable 1k/8k and 1k/1k on gptoss gb200 in this PR too? Thanks! |
|
@functionstackx Switched to dynamo release branch. |
I am working on 1k1k DISAGG pareto configs next. 1k8k DISAGG probably will be on par with AGG since it is predominantly doing just decode. Hence, I recommend we merge this MR first. does it make sense? |
|
if u can submit gb200 agg for 1k/8k in this PR too |
|
we're gonna hold off on this til #251 gets merged this week |
|
@jgangani so sorry brother but can you please rebase with main following the convention set forth in https://github.com/InferenceMAX/InferenceMAX/pull/251 ? |
|
Yes, I am working on it. Will open another MR based off post-251 merge. |
|
@jgangani hi! where are we on this? |
GB200 DISAGG for 8k1k is ready with refactored code. I can create an MR right away if need be. Still working through 1k1k config exploration. I will need few more days for 1k1k |
|
@jgangani ok, no worries |
|
happy new year @jgangani , what is the eta on this? |
Happy new year! @cquil11 I have the branch ready for 1k1k, however, GB200 runners are not picking up the jobs, they get stuck at "Cleaning up resources" for hours and then get canceled. Can you take a quick look to see if you have a fix? https://github.com/InferenceMAX/InferenceMAX/actions/runs/20614644592/job/59388613154 |
nvm. Rebase seems to have fixed it. There was an MR to remove sudo from benchmark_multinode yaml. |
This MR enables disaggregation for GPTOSS on GB200.
Modified files to add GPTOSS to Disagg runners and workflow.
Successful tests here:
https://github.com/InferenceMAX/InferenceMAX/actions/runs/19353241086/job/55369372877