Skip to content

Commit d1d2c82

Browse files
Use $EP_SIZE variable instead of hardcoded 8 and add ep: 8 to nvidia-master.yaml for B200 SGLang configs
Co-authored-by: functionstackx <47992694+functionstackx@users.noreply.github.com>
1 parent d6f38f2 commit d1d2c82

3 files changed

Lines changed: 8 additions & 8 deletions

File tree

.github/configs/nvidia-master.yaml

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -10,17 +10,17 @@ dsr1-fp4-b200-sglang:
1010
osl: 1024
1111
search-space:
1212
- { tp: 4, conc-start: 4, conc-end: 128 }
13-
- { tp: 8, conc-start: 4, conc-end: 128 }
13+
- { tp: 8, ep: 8, conc-start: 4, conc-end: 128 }
1414
- isl: 1024
1515
osl: 8192
1616
search-space:
1717
- { tp: 4, conc-start: 4, conc-end: 128 }
18-
- { tp: 8, conc-start: 4, conc-end: 128 }
18+
- { tp: 8, ep: 8, conc-start: 4, conc-end: 128 }
1919
- isl: 8192
2020
osl: 1024
2121
search-space:
2222
- { tp: 4, conc-start: 4, conc-end: 128 }
23-
- { tp: 8, conc-start: 4, conc-end: 16 }
23+
- { tp: 8, ep: 8, conc-start: 4, conc-end: 16 }
2424

2525
dsr1-fp4-b200-trt:
2626
image: nvcr.io#nvidia/tensorrt-llm/release:1.1.0rc2.post2
@@ -83,15 +83,15 @@ dsr1-fp8-b200-sglang:
8383
- isl: 1024
8484
osl: 1024
8585
search-space:
86-
- { tp: 8, conc-start: 4, conc-end: 64 }
86+
- { tp: 8, ep: 8, conc-start: 4, conc-end: 64 }
8787
- isl: 1024
8888
osl: 8192
8989
search-space:
90-
- { tp: 8, conc-start: 4, conc-end: 64 }
90+
- { tp: 8, ep: 8, conc-start: 4, conc-end: 64 }
9191
- isl: 8192
9292
osl: 1024
9393
search-space:
94-
- { tp: 8, conc-start: 4, conc-end: 64 }
94+
- { tp: 8, ep: 8, conc-start: 4, conc-end: 64 }
9595

9696
dsr1-fp8-b200-trt:
9797
image: nvcr.io#nvidia/tensorrt-llm/release:1.1.0rc2.post2

benchmarks/dsr1_fp4_b200_docker.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,6 @@ PYTHONNOUSERSITE=1 python3 -m sglang.launch_server --model-path $MODEL --host 0.
2121
--tensor-parallel-size=$TP --data-parallel-size=1 \
2222
--cuda-graph-max-bs 256 --max-running-requests 256 --mem-fraction-static 0.85 --kv-cache-dtype fp8_e4m3 \
2323
--chunked-prefill-size 16384 \
24-
--ep-size 8 --quantization modelopt_fp4 --enable-flashinfer-allreduce-fusion --scheduler-recv-interval $SCHEDULER_RECV_INTERVAL \
24+
--ep-size $EP_SIZE --quantization modelopt_fp4 --enable-flashinfer-allreduce-fusion --scheduler-recv-interval $SCHEDULER_RECV_INTERVAL \
2525
--enable-symm-mem --disable-radix-cache --attention-backend trtllm_mla --moe-runner-backend flashinfer_trtllm --stream-interval 10
2626

benchmarks/dsr1_fp8_b200_docker.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -34,4 +34,4 @@ PYTHONNOUSERSITE=1 python3 -m sglang.launch_server --model-path=$MODEL --host=0.
3434
--cuda-graph-max-bs 128 --max-running-requests 128 \
3535
--mem-fraction-static 0.82 --kv-cache-dtype fp8_e4m3 --chunked-prefill-size 32768 --max-prefill-tokens 32768 \
3636
--enable-flashinfer-allreduce-fusion --scheduler-recv-interval $SCHEDULER_RECV_INTERVAL --disable-radix-cache \
37-
--attention-backend trtllm_mla --stream-interval 30 --moe-runner-backend flashinfer_trtllm --quantization fp8
37+
--attention-backend trtllm_mla --stream-interval 30 --ep-size $EP_SIZE --moe-runner-backend flashinfer_trtllm --quantization fp8

0 commit comments

Comments
 (0)