Commit ed5867f
Use local MODEL_PATH for B200 DGXC single-node bench scripts (#1317)
* Use local MODEL_PATH for B200 DGXC single-node bench scripts
Hoist the MODEL_PATH override out of the multinode-only block so single-node
launches on the B200 DGXC cluster also load models from /lustre/fsw/models
instead of pulling through the HF hub cache. Single-node now mounts
MODEL_PATH into the container and exports MODEL=$MODEL_PATH so existing
bench scripts pick up the local directory via --model-path.
Guard `hf download "$MODEL"` in 89 single-node bench scripts using the
leading-slash check that already exists in agentic/*.sh and
dsv4_fp4_b300_sglang*.sh, so the download is skipped when MODEL is a local
path. Print the /lustre/fsw/models listing in the error path when an
unknown MODEL_PREFIX/PRECISION combo is requested. Preserve the
multinode dsv4-only-with-dynamo-vllm constraint as an explicit guard
since hoisting dropped the framework filter from the path-resolution
branch.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Remove stale b200-dgxc_10..16 runners from b200 pool
The B200 DGXC cluster only has nodes 00-09; entries 10-16 in the b200
runner pool were stale and would cause jobs to queue against runners
that don't exist.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Update b200-multinode runner pool to slurm_7..9
The actual b200 multinode pool is b200-dgxc-slurm_7, _8, _9 — _6 was
stale.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Switch B200 DGXC MODEL_PATH resolver to /raid/models + cover all configs
Lustre-over-TCP at /lustre/fsw/models is ~98% full and slower than the
per-node local RAID array. Point MODEL_PATH at /raid/models/* instead.
Also add resolver branches for every (model-prefix, precision) combo that
nvidia-master.yaml declares as single-node runner: b200 — qwen3.5
(bf16/fp8/fp4), glm5 (fp8/fp4), kimik2.5 (int4/fp4), minimaxm2.5
(fp8/fp4), and gptoss (fp4). Previously these 16 configs would hard-fail
at `exit 1` when they landed on a b200-dgxc runner because the override
table only covered dsr1 and dsv4.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Revert /raid model paths back to /lustre/fsw/models
/raid/models/* is populated on only a subset of dgxc compute nodes today
(survey via srun showed 6 of 10 gpu-2 nodes had the dsr1 dir). The
preview verification run hit `enroot-mount: failed to mount
/raid/models/dsr1-0528-nvfp4-v2 ... No such file or directory` on
gpu-2-2.
Point all 13 resolver branches back at /lustre/fsw/models/* until /raid
staging is reliable across the fleet.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>1 parent 2d3a3f3 commit ed5867f
91 files changed
Lines changed: 163 additions & 124 deletions
File tree
- .github/configs
- benchmarks/single_node
- runners
Some content is hidden
Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
79 | 79 | | |
80 | 80 | | |
81 | 81 | | |
82 | | - | |
83 | | - | |
84 | | - | |
85 | | - | |
86 | | - | |
87 | | - | |
88 | | - | |
89 | 82 | | |
90 | | - | |
91 | 83 | | |
92 | 84 | | |
| 85 | + | |
93 | 86 | | |
94 | 87 | | |
95 | 88 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
16 | 16 | | |
17 | 17 | | |
18 | 18 | | |
19 | | - | |
| 19 | + | |
20 | 20 | | |
21 | 21 | | |
22 | 22 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
20 | 20 | | |
21 | 21 | | |
22 | 22 | | |
23 | | - | |
| 23 | + | |
24 | 24 | | |
25 | 25 | | |
26 | 26 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
20 | 20 | | |
21 | 21 | | |
22 | 22 | | |
23 | | - | |
| 23 | + | |
24 | 24 | | |
25 | 25 | | |
26 | 26 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
20 | 20 | | |
21 | 21 | | |
22 | 22 | | |
23 | | - | |
| 23 | + | |
24 | 24 | | |
25 | 25 | | |
26 | 26 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
15 | 15 | | |
16 | 16 | | |
17 | 17 | | |
18 | | - | |
| 18 | + | |
19 | 19 | | |
20 | 20 | | |
21 | 21 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
18 | 18 | | |
19 | 19 | | |
20 | 20 | | |
21 | | - | |
| 21 | + | |
22 | 22 | | |
23 | 23 | | |
24 | 24 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
18 | 18 | | |
19 | 19 | | |
20 | 20 | | |
21 | | - | |
| 21 | + | |
22 | 22 | | |
23 | 23 | | |
24 | 24 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
20 | 20 | | |
21 | 21 | | |
22 | 22 | | |
23 | | - | |
| 23 | + | |
24 | 24 | | |
25 | 25 | | |
26 | 26 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
20 | 20 | | |
21 | 21 | | |
22 | 22 | | |
23 | | - | |
| 23 | + | |
24 | 24 | | |
25 | 25 | | |
26 | 26 | | |
| |||
0 commit comments