[NVIDIA] Enable GPTOSS GB200 DISAGG by jgangani · Pull Request #232 · SemiAnalysisAI/InferenceX

jgangani · 2025-11-14T20:51:51Z

This MR enables disaggregation for GPTOSS on GB200.

Modified files to add GPTOSS to Disagg runners and workflow.

Successful tests here:
https://github.com/InferenceMAX/InferenceMAX/actions/runs/19353241086/job/55369372877

functionstackx · 2025-11-14T21:39:41Z

thanks for this contribution @jgangani

Can you explain what this means? is all of the datapoints just 4 gpus for prefill only and then 4 gpus for decode only? if not, can u explain the parallelism config & the conc for each datapoint?

/submit_disagg.sh mtp=off tp 1 1 1 512 20000 "0.9" 0 0 "128 256 512"
                    ./submit_disagg.sh mtp=off tp 1 1 2 1024 20000 "0.9" 0 0 "64 128 256"
                    ./submit_disagg.sh mtp=off tep 1 1 2 1024 20000 "0.9" 0 0 "64 256"
                    ./submit_disagg.sh mtp=off tp 1 1 4 2048 20000 "0.9" 0 0 "8 16 32 64 128"
                    ./submit_disagg.sh mtp=off tp 1 1 8 2048 20000 "0.9" 0 0 "1 2 4 8 16"

Copilot

Pull Request Overview

This PR enables disaggregation support for the GPT-OSS 120B model on GB200 hardware. The changes add GPT-OSS as a supported model alongside the existing DeepSeek-R1 configurations, implementing model-specific benchmark configurations and updating workflows to handle the new model.

Key Changes:

Added GPT-OSS model detection and configuration in the GB200 launch script
Implemented GPT-OSS-specific benchmark configurations with 8k/1k input/output sequence lengths
Updated workflows to support GPT-OSS model selection and dynamic model prefix generation

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File	Description
runners/launch_gb200-nv.sh	Added GPT-OSS model path configuration, branch checkout logic, and model-specific benchmark parameters for disaggregation testing
.github/workflows/gb200-tests.yml	Added GPT-OSS to model options and implemented dynamic model prefix mapping for cleaner experiment naming
.github/workflows/full-sweep-8k1k-scheduler.yml	Added GPT-OSS configuration matrix entry and updated result collection dependencies
.github/workflows/benchmark-multinode-tmpl.yml	Updated filename parsing patterns to match new hyphenated format (gpus-N, ctx-N, gen-N) and added MODEL environment variable

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-14T21:50:27Z

+        if [[ $MODEL == *"gpt-oss"* ]]; then
+            # GPT-OSS specific benchmark configurations
+            if [ "$isl" = "8192" ] && [ "$osl" = "1024" ]; then
+


Unnecessary blank line with trailing whitespace. Remove this line or the trailing spaces.

Suggested change

Copilot · 2025-11-14T21:50:27Z


    # Find all result subdirectories in this logs directory
-    RESULT_SUBDIRS=$(find "$LOGS_DIR" -name "ctx*_gen*_[td]ep*_batch*_eplb*_mtp*" -type d)
+    RESULT_SUBDIRS=$(find "$LOGS_DIR" -name "ctx*_gen*_*_batch*_eplb*_mtp*" -type d)


The find pattern 'ctx*gen*_batch_eplb*_mtp*' uses a generic wildcard () in the middle which may match unintended directory names. Consider using a more specific pattern like 'ctxgen*[td]ep*_batch*_eplb*_mtp*' or 'ctx*gen*{tp,tep,dep}_batch_eplb*_mtp*' to match only valid parallelism strategies (tp/tep/dep).

Suggested change

RESULT_SUBDIRS=$(find "$LOGS_DIR" -name "ctx*_gen*_*_batch*_eplb*_mtp*" -type d)

RESULT_SUBDIRS=$(find "$LOGS_DIR" -name "ctx*_gen*_{tp,tep,dep}_batch*_eplb*_mtp*" -type d)

Copilot · 2025-11-14T21:50:28Z

                          "framework": "dynamo-sglang",
                          "mtp": "off",
                      }
+                      # GPTOSS


[nitpick] Inconsistent comment formatting. The DeepSeek comment on line 93 uses '# DSR1' while this uses '# GPTOSS' with different indentation. Align the comment indentation with line 93 for consistency.

Suggested change

# GPTOSS

# GPTOSS

functionstackx · 2025-11-14T22:13:11Z

also @jgangani please merge this in main branch/release candidate instead of doing an side branch ai-dynamo/dynamo@release/0.5.1-rc0.20251105...jthomson04/gpt-oss-disagg-slurm

jgangani · 2025-11-14T22:31:48Z

thanks for this contribution @jgangani

Can you explain what this means? is all of the datapoints just 4 gpus for prefill only and then 4 gpus for decode only? if not, can u explain the parallelism config & the conc for each datapoint?

/submit_disagg.sh mtp=off tp 1 1 1 512 20000 "0.9" 0 0 "128 256 512"
                    ./submit_disagg.sh mtp=off tp 1 1 2 1024 20000 "0.9" 0 0 "64 128 256"
                    ./submit_disagg.sh mtp=off tep 1 1 2 1024 20000 "0.9" 0 0 "64 256"
                    ./submit_disagg.sh mtp=off tp 1 1 4 2048 20000 "0.9" 0 0 "8 16 32 64 128"
                    ./submit_disagg.sh mtp=off tp 1 1 8 2048 20000 "0.9" 0 0 "1 2 4 8 16"

Following is the order:
<gen_server_config> <ctx_num> <gen_num_servers> <gen_tp_size> <gen_bs <gen_max_num_tokens>.
1 gpu for prefill. 2/4/8 for decode.

jgangani · 2025-11-14T22:38:33Z

also @jgangani please merge this in main branch/release candidate instead of doing an side branch ai-dynamo/dynamo@release/0.5.1-rc0.20251105...jthomson04/gpt-oss-disagg-slurm

Yes, that was the goal. wanted to test out the MR before merging this into release branch. Will update.

functionstackx · 2025-11-14T23:20:22Z

@jgangani thanks! Can u please enable 1k/8k and 1k/1k on gptoss gb200 in this PR too? Thanks!

jgangani · 2025-11-15T03:02:18Z

@functionstackx Switched to dynamo release branch.

jgangani · 2025-11-16T19:05:40Z

@jgangani thanks! Can u please enable 1k/8k and 1k/1k on gptoss gb200 in this PR too? Thanks!

I am working on 1k1k DISAGG pareto configs next. 1k8k DISAGG probably will be on par with AGG since it is predominantly doing just decode. Hence, I recommend we merge this MR first. does it make sense?

functionstackx · 2025-11-16T20:13:43Z

if u can submit gb200 agg for 1k/8k in this PR too

cquil11 · 2025-12-03T21:48:11Z

we're gonna hold off on this til #251 gets merged this week

cquil11 · 2025-12-07T21:11:13Z

@jgangani so sorry brother but can you please rebase with main following the convention set forth in https://github.com/InferenceMAX/InferenceMAX/pull/251 ?

jgangani · 2025-12-07T21:45:23Z

Yes, I am working on it. Will open another MR based off post-251 merge.

cquil11 · 2025-12-17T00:12:29Z

@jgangani hi! where are we on this?

jgangani · 2025-12-17T07:44:31Z

@jgangani hi! where are we on this?

GB200 DISAGG for 8k1k is ready with refactored code. I can create an MR right away if need be. Still working through 1k1k config exploration. I will need few more days for 1k1k

cquil11 · 2025-12-18T21:17:55Z

@jgangani ok, no worries

functionstackx · 2026-01-05T16:08:34Z

happy new year @jgangani , what is the eta on this?

jgangani · 2026-01-05T17:59:28Z

happy new year @jgangani , what is the eta on this?

Happy new year! @cquil11 I have the branch ready for 1k1k, however, GB200 runners are not picking up the jobs, they get stuck at "Cleaning up resources" for hours and then get canceled. Can you take a quick look to see if you have a fix? https://github.com/InferenceMAX/InferenceMAX/actions/runs/20614644592/job/59388613154

jgangani · 2026-01-05T22:30:44Z

happy new year @jgangani , what is the eta on this?

Happy new year! @cquil11 I have the branch ready for 1k1k, however, GB200 runners are not picking up the jobs, they get stuck at "Cleaning up resources" for hours and then get canceled. Can you take a quick look to see if you have a fix? https://github.com/InferenceMAX/InferenceMAX/actions/runs/20614644592/job/59388613154

nvm. Rebase seems to have fixed it. There was an MR to remove sudo from benchmark_multinode yaml.

Jatin Gangani added 14 commits November 14, 2025 11:58

gptoss test

0de1704

update container

a636eaf

typo fix

6fead87

update log dir name

5ed73e4

test

f784686

test

0fb16c8

typo

7bbda7f

typo1

12094b2

regex update

8c7a9be

regex update2

7153d95

update regex 3

21c91ab

enable all configs

92c751a

test1

e1e9475

reenable all configs

f877df1

jgangani requested review from cquil11, functionstackx, kedarpotdar-nv and yunzhoul-nv November 14, 2025 20:51

jgangani requested a review from a team as a code owner November 14, 2025 20:51

functionstackx requested a review from Copilot November 14, 2025 21:49

Copilot AI reviewed Nov 14, 2025

View reviewed changes

switch to dynamo release branch

49b2a73

functionstackx added the NVIDIA label Dec 7, 2025

functionstackx temporarily deployed to fork-pr-validation December 7, 2025 21:18 — with GitHub Actions Inactive

functionstackx added this to InferenceMAX Board Dec 7, 2025

functionstackx moved this to In Progress in InferenceMAX Board Dec 7, 2025

cquil11 mentioned this pull request Jan 6, 2026

[NVIDIA] GPTOSS GB200 DISAGG Configurations + Assign EP explicitly for AGG #387

Merged

jgangani closed this Jan 6, 2026

github-project-automation Bot moved this from In Progress to Done in InferenceMAX Board Jan 6, 2026

functionstackx deleted the jgangani_gptoss_disagg branch January 11, 2026 19:50

cquil11 changed the title ~~Enable GPTOSS GB200 DISAGG~~ [NVIDIA] Enable GPTOSS GB200 DISAGG Apr 8, 2026

	RESULT_SUBDIRS=$(find "$LOGS_DIR" -name "ctx_gen__batch_eplb_mtp" -type d)
	RESULT_SUBDIRS=$(find "$LOGS_DIR" -name "ctx_gen_{tp,tep,dep}_batch_eplb_mtp*" -type d)

Conversation

jgangani commented Nov 14, 2025

Uh oh!

functionstackx commented Nov 14, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

functionstackx commented Nov 14, 2025

Uh oh!

jgangani commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jgangani commented Nov 14, 2025

Uh oh!

functionstackx commented Nov 14, 2025

Uh oh!

jgangani commented Nov 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jgangani commented Nov 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

functionstackx commented Nov 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cquil11 commented Dec 3, 2025

Uh oh!

cquil11 commented Dec 7, 2025

Uh oh!

jgangani commented Dec 7, 2025

Uh oh!

cquil11 commented Dec 17, 2025

Uh oh!

jgangani commented Dec 17, 2025

Uh oh!

cquil11 commented Dec 18, 2025

Uh oh!

functionstackx commented Jan 5, 2026

Uh oh!

jgangani commented Jan 5, 2026

Uh oh!

jgangani commented Jan 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jgangani commented Nov 14, 2025 •

edited

Loading

jgangani commented Nov 15, 2025 •

edited

Loading

jgangani commented Nov 16, 2025 •

edited

Loading

functionstackx commented Nov 16, 2025 •

edited

Loading