Skip to content

Commit ed5867f

Browse files
cquil11claude
andauthored
Use local MODEL_PATH for B200 DGXC single-node bench scripts (#1317)
* Use local MODEL_PATH for B200 DGXC single-node bench scripts Hoist the MODEL_PATH override out of the multinode-only block so single-node launches on the B200 DGXC cluster also load models from /lustre/fsw/models instead of pulling through the HF hub cache. Single-node now mounts MODEL_PATH into the container and exports MODEL=$MODEL_PATH so existing bench scripts pick up the local directory via --model-path. Guard `hf download "$MODEL"` in 89 single-node bench scripts using the leading-slash check that already exists in agentic/*.sh and dsv4_fp4_b300_sglang*.sh, so the download is skipped when MODEL is a local path. Print the /lustre/fsw/models listing in the error path when an unknown MODEL_PREFIX/PRECISION combo is requested. Preserve the multinode dsv4-only-with-dynamo-vllm constraint as an explicit guard since hoisting dropped the framework filter from the path-resolution branch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Remove stale b200-dgxc_10..16 runners from b200 pool The B200 DGXC cluster only has nodes 00-09; entries 10-16 in the b200 runner pool were stale and would cause jobs to queue against runners that don't exist. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Update b200-multinode runner pool to slurm_7..9 The actual b200 multinode pool is b200-dgxc-slurm_7, _8, _9 — _6 was stale. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Switch B200 DGXC MODEL_PATH resolver to /raid/models + cover all configs Lustre-over-TCP at /lustre/fsw/models is ~98% full and slower than the per-node local RAID array. Point MODEL_PATH at /raid/models/* instead. Also add resolver branches for every (model-prefix, precision) combo that nvidia-master.yaml declares as single-node runner: b200 — qwen3.5 (bf16/fp8/fp4), glm5 (fp8/fp4), kimik2.5 (int4/fp4), minimaxm2.5 (fp8/fp4), and gptoss (fp4). Previously these 16 configs would hard-fail at `exit 1` when they landed on a b200-dgxc runner because the override table only covered dsr1 and dsv4. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Revert /raid model paths back to /lustre/fsw/models /raid/models/* is populated on only a subset of dgxc compute nodes today (survey via srun showed 6 of 10 gpu-2 nodes had the dsr1 dir). The preview verification run hit `enroot-mount: failed to mount /raid/models/dsr1-0528-nvfp4-v2 ... No such file or directory` on gpu-2-2. Point all 13 resolver branches back at /lustre/fsw/models/* until /raid staging is reliable across the fleet. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 2d3a3f3 commit ed5867f

91 files changed

Lines changed: 163 additions & 124 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/configs/runners.yaml

Lines changed: 1 addition & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -79,17 +79,10 @@ b200:
7979
- 'b200-dgxc_07'
8080
- 'b200-dgxc_08'
8181
- 'b200-dgxc_09'
82-
- 'b200-dgxc_10'
83-
- 'b200-dgxc_11'
84-
- 'b200-dgxc_12'
85-
- 'b200-dgxc_13'
86-
- 'b200-dgxc_14'
87-
- 'b200-dgxc_15'
88-
- 'b200-dgxc_16'
8982
b200-multinode:
90-
- 'b200-dgxc-slurm_6'
9183
- 'b200-dgxc-slurm_7'
9284
- 'b200-dgxc-slurm_8'
85+
- 'b200-dgxc-slurm_9'
9386
mi300x:
9487
- 'mi300x-amds_00'
9588
- 'mi300x-amds_01'

benchmarks/single_node/dsr1_fp4_b200.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ if [[ -n "$SLURM_JOB_ID" ]]; then
1616
echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME"
1717
fi
1818

19-
hf download "$MODEL"
19+
if [[ "$MODEL" != /* ]]; then hf download "$MODEL"; fi
2020

2121
nvidia-smi
2222

benchmarks/single_node/dsr1_fp4_b200_trt.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ fi
2020

2121
echo "TP: $TP, CONC: $CONC, ISL: $ISL, OSL: $OSL, EP_SIZE: $EP_SIZE, DP_ATTENTION: $DP_ATTENTION"
2222

23-
hf download "$MODEL"
23+
if [[ "$MODEL" != /* ]]; then hf download "$MODEL"; fi
2424

2525
# ========= Determine other parameters based on ISL, OSL, CONC =========
2626
CUDA_GRAPH_MAX_BATCH_SIZE=$CONC

benchmarks/single_node/dsr1_fp4_b200_trt_mtp.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ fi
2020

2121
echo "TP: $TP, CONC: $CONC, ISL: $ISL, OSL: $OSL, EP_SIZE: $EP_SIZE, DP_ATTENTION: $DP_ATTENTION"
2222

23-
hf download "$MODEL"
23+
if [[ "$MODEL" != /* ]]; then hf download "$MODEL"; fi
2424

2525
# ========= Determine MOE_BACKEND and MTP based on DP_ATTENTION =========
2626
MOE_BACKEND="TRTLLM"

benchmarks/single_node/dsr1_fp4_b300.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ if [[ -n "$SLURM_JOB_ID" ]]; then
2020
echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME"
2121
fi
2222

23-
hf download "$MODEL"
23+
if [[ "$MODEL" != /* ]]; then hf download "$MODEL"; fi
2424

2525
nvidia-smi
2626

benchmarks/single_node/dsr1_fp4_mi355x.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ if [[ -n "$SLURM_JOB_ID" ]]; then
1515
echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME"
1616
fi
1717

18-
hf download "$MODEL"
18+
if [[ "$MODEL" != /* ]]; then hf download "$MODEL"; fi
1919

2020
export SGLANG_USE_AITER=1
2121
export ROCM_QUICK_REDUCE_QUANTIZATION=INT4

benchmarks/single_node/dsr1_fp8_b200.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ fi
1818

1919
nvidia-smi
2020

21-
hf download "$MODEL"
21+
if [[ "$MODEL" != /* ]]; then hf download "$MODEL"; fi
2222

2323
export SGL_ENABLE_JIT_DEEPGEMM=false
2424
export SGLANG_ENABLE_FLASHINFER_GEMM=true

benchmarks/single_node/dsr1_fp8_b200_mtp.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ fi
1818

1919
nvidia-smi
2020

21-
hf download "$MODEL"
21+
if [[ "$MODEL" != /* ]]; then hf download "$MODEL"; fi
2222

2323
export SGLANG_ENABLE_JIT_DEEPGEMM=false
2424

benchmarks/single_node/dsr1_fp8_b200_trt.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ fi
2020

2121
echo "TP: $TP, CONC: $CONC, ISL: $ISL, OSL: $OSL, EP_SIZE: $EP_SIZE, DP_ATTENTION: $DP_ATTENTION"
2222

23-
hf download "$MODEL"
23+
if [[ "$MODEL" != /* ]]; then hf download "$MODEL"; fi
2424

2525
# temporary, avoids risk of OOM error
2626
export TLLM_OVERRIDE_LAYER_NUM=61

benchmarks/single_node/dsr1_fp8_b200_trt_mtp.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ fi
2020

2121
echo "TP: $TP, CONC: $CONC, ISL: $ISL, OSL: $OSL, EP_SIZE: $EP_SIZE, DP_ATTENTION: $DP_ATTENTION"
2222

23-
hf download "$MODEL"
23+
if [[ "$MODEL" != /* ]]; then hf download "$MODEL"; fi
2424

2525
# ========= Determine other parameters based on ISL, OSL, CONC =========
2626
MOE_BACKEND="TRTLLM"

0 commit comments

Comments
 (0)