1- # - config-keys:
2- # - 70b-fp8-*-vllm
3- # description:
4- # - 'Add compilation-config ''{"custom_ops": ["-rms_norm", "-quant_fp8", "-silu_and_mul"]}'' as extra config to all benchmarks/70b_fp8_mi*.sh scripts'
5- # - "6-7% uplift for llama for 6/8 configs"
6- # pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/95
7-
8- - config-keys :
9- - gptoss-fp4-*-trt
10- description :
11- - " Upgrade GPT-OSS TRT images from 'release:1.1.0rc2.post2' to '1.2.0rc0.post1'"
12- - " Add NCCL_GRAPH_REGISTER=0 to benchmarks/gptoss_fp4_b200_trt_slurm.sh"
13- - " Change kv_cache_config.dtype from 'auto' to 'fp8' in benchmarks/gptoss_fp4_b200_trt_slurm.sh"
14- - " Remove MOE_BACKEND=CUTLASS, now just defaults to TRTLLM"
15- pr-link : https://github.com/InferenceMAX/InferenceMAX/pull/110
16-
17- - config-keys :
18- - gptoss*
19- - dsr1*
20- description :
21- - " Remove Llama 70B runs to make room for multi-node disagg prefill+wideEP on h100/h200/b200/mi300/mi325/mi355"
22- pr-link : https://github.com/InferenceMAX/InferenceMAX/pull/149
23-
24- - config-keys :
25- - gptoss-fp4-b200-vllm
26- - gptoss-fp4-h100-vllm
27- - gptoss-fp4-h200-vllm
28- description :
29- - " Upgrade vLLM from 0.10.2 to 0.11.0 for GPT-OSS NVIDIA single-node configs"
30- - ' Add compilation-config '' {"cudagraph_mode":"PIECEWISE"}'' since vLLM 0.11.0 now defaults to FULL_AND_PIECEWISE'
31- pr-link : https://github.com/InferenceMAX/InferenceMAX/pull/159
32-
33- - config-keys :
34- - dsr1*
35- description :
36- - " Fix bug where 1k8k and 8k1k full sweeps had incorrect max-model-len for DeepSeek"
37- pr-link : https://github.com/InferenceMAX/InferenceMAX/pull/163
38-
39- - config-keys :
40- - dsr1-fp4-b200-sglang
41- - dsr1-fp8-b200-sglang
42- - dsr1-fp8-h200-sglang
43- description :
44- - " Consolidate H200 and B200 SGLang configurations to use unified v0.5.5-cu129-amd64 image tag"
45- - " Update deprecated SGLang server arguments to current equivalents"
46- - " Replace --enable-ep-moe with --ep-size $EP_SIZE"
47- - " Replace --enable-flashinfer-trtllm-moe with --moe-runner-backend flashinfer_trtllm"
48- - " Add -e EP_SIZE to Docker run commands in launch scripts"
49- - " Set ep:4 for all tp:4 entries, ep:8 for all tp:8 entries"
50- pr-link : https://github.com/InferenceMAX/InferenceMAX/pull/204
51-
52- - config-keys :
53- - gptoss-fp4-mi355x-vllm
54- - gptoss-fp4-b200-vllm
55- description :
56- - " Extend concurrency to 128 for gptoss mi355x/b200 vllm configurations"
57- pr-link : https://github.com/InferenceMAX/InferenceMAX/pull/209
58-
59- - config-keys :
60- - gptoss-fp4-b200-trt
61- description :
62- - " Extend concurrency to 128 for gptoss b200 TRT configurations"
63- pr-link : https://github.com/InferenceMAX/InferenceMAX/pull/233
64-
65- - config-keys :
66- - " *gb200-dynamo-sglang"
67- description :
68- - " Introduce improvements in GB200 SGLang DSR1 submission"
69- pr-link : https://github.com/InferenceMAX/InferenceMAX/pull/257
70-
71- - config-keys :
72- - dsr1-fp8-h200-trt
73- description :
74- - " Update TRT image from nvcr.io#nvidia/tensorrt-llm/release:1.2.0rc0.post1 to nvcr.io#nvidia/tensorrt-llm/release:1.2.0rc2"
75- - " Increase concurrency for some configurations"
76- pr-link : https://github.com/InferenceMAX/InferenceMAX/pull/266
77-
781- config-keys :
792 - gptoss-fp4-b200-vllm
803 - gptoss-fp4-h100-vllm
814 - gptoss-fp4-h200-vllm
825 description :
836 - " Update vLLM image for NVIDIA configs from vLLM 0.11.0 to vLLM 0.11.2"
84- - " Add kv-cache-dtype: fp8 to benchmarks/gptoss_fp4_b200_docker.sh"
85- pr-link : https://github.com/InferenceMAX/InferenceMAX/pull/273
86-
87- - config-keys :
88- - dsr1-fp4-mi355x-sglang
89- description :
90- - " Update MI355x Deepseek-R1 FP4 SGLang Image to upstream v0.5.6.post1"
91- pr-link : https://github.com/InferenceMAX/InferenceMAX/pull/330
92-
93- - config-keys :
94- - gptoss-fp4-b200-trt
95- description :
96- - " Add benchmark script for GPTOSS FP4 B200 TRT-LLM"
97- pr-link : https://github.com/InferenceMAX/InferenceMAX/pull/256
7+ - " Adds kv-cache-dtype: fp8 to benchmarks/gptoss_fp4_b200_docker.sh"
8+ pr-link : https://github.com/InferenceMAX/InferenceMAX/pull/273
0 commit comments