|
1 | 1 | - config-keys: |
2 | 2 | - 70b-fp8-*-vllm |
3 | | - description: | |
4 | | - - Add compilation-config: '{"custom_ops": ["-rms_norm", "-quant_fp8", "-silu_and_mul"]}' as |
5 | | - extra config to all benchmarks/70b_fp8_mi*.sh scripts |
6 | | - - 6-7% uplift for llama for 6/8 configs |
7 | | - PR: https://github.com/InferenceMAX/InferenceMAX/pull/95 |
| 3 | + description: |
| 4 | + - 'Add compilation-config ''{"custom_ops": ["-rms_norm", "-quant_fp8", "-silu_and_mul"]}'' as extra config to all benchmarks/70b_fp8_mi*.sh scripts' |
| 5 | + - "6-7% uplift for llama for 6/8 configs" |
| 6 | + pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/95 |
| 7 | + |
8 | 8 | - config-keys: |
9 | 9 | - gptoss-fp4-*-trt |
10 | | - description: | |
11 | | - - Upgrade GPT-OSS TRT images from 'release:1.1.0rc2.post2' to '1.2.0rc0.post1' |
12 | | - - Add NCCL_GRAPH_REGISTER=0 to benchmarks/gptoss_fp4_b200_trt_slurm.sh |
13 | | - - Change kv_cache_config.dtype from 'auto' to 'fp8' in benchmarks/gptoss_fp4_b200_trt_slurm.sh |
14 | | - - Remove MOE_BACKEND=CUTLASS, now just defaults to TRTLLM |
15 | | - PR: https://github.com/InferenceMAX/InferenceMAX/pull/110 |
| 10 | + description: |
| 11 | + - "Upgrade GPT-OSS TRT images from 'release:1.1.0rc2.post2' to '1.2.0rc0.post1'" |
| 12 | + - "Add NCCL_GRAPH_REGISTER=0 to benchmarks/gptoss_fp4_b200_trt_slurm.sh" |
| 13 | + - "Change kv_cache_config.dtype from 'auto' to 'fp8' in benchmarks/gptoss_fp4_b200_trt_slurm.sh" |
| 14 | + - "Remove MOE_BACKEND=CUTLASS, now just defaults to TRTLLM" |
| 15 | + pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/110 |
| 16 | + |
16 | 17 | - config-keys: |
17 | 18 | - gptoss* |
18 | 19 | - dsr1* |
19 | | - description: | |
20 | | - - Remove Llama 70B runs to make room for multi-node disagg prefill+wideEP on |
21 | | - h100/h200/b200/mi300/mi325/mi355 |
22 | | - PR: https://github.com/InferenceMAX/InferenceMAX/pull/149 |
| 20 | + description: |
| 21 | + - "Remove Llama 70B runs to make room for multi-node disagg prefill+wideEP on h100/h200/b200/mi300/mi325/mi355" |
| 22 | + pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/149 |
| 23 | + |
23 | 24 | - config-keys: |
24 | 25 | - gptoss-fp4-b200-vllm |
25 | 26 | - gptoss-fp4-h100-vllm |
26 | 27 | - gptoss-fp4-h200-vllm |
27 | | - description: | |
28 | | - - Upgrade vLLM from 0.10.2 to 0.11.0 for GPT-OSS NVIDIA single-node configs |
29 | | - - Adds compilation-config: '{"cudagraph_mode":"PIECEWISE"} accordingly since vLLM 0.11.0 |
30 | | - requires now defaults to FULL_AND_PIECEWISE |
31 | | - PR: https://github.com/InferenceMAX/InferenceMAX/pull/159 |
| 28 | + description: |
| 29 | + - "Upgrade vLLM from 0.10.2 to 0.11.0 for GPT-OSS NVIDIA single-node configs" |
| 30 | + - 'Add compilation-config ''{"cudagraph_mode":"PIECEWISE"}'' since vLLM 0.11.0 now defaults to FULL_AND_PIECEWISE' |
| 31 | + pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/159 |
| 32 | + |
32 | 33 | - config-keys: |
33 | 34 | - dsr1* |
34 | | - description: | |
35 | | - - Fixes bug where 1k8k and 8k1k full sweeps had incorrect max-model-len for DeepSeek |
36 | | - PR: https://github.com/InferenceMAX/InferenceMAX/pull/163 |
| 35 | + description: |
| 36 | + - "Fix bug where 1k8k and 8k1k full sweeps had incorrect max-model-len for DeepSeek" |
| 37 | + pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/163 |
| 38 | + |
37 | 39 | - config-keys: |
38 | 40 | - dsr1-fp4-b200-sglang |
39 | 41 | - dsr1-fp8-b200-sglang |
40 | 42 | - dsr1-fp8-h200-sglang |
41 | | - description: | |
42 | | - - Consolidates H200 and B200 SGLang configurations to use unified v0.5.5-cu129-amd64 |
43 | | - image tag and updates deprecated SGLang server arguments to their current equivalents. |
44 | | - - --enable-flashinfer-trtllm-moe & --enable-ep-moe is no longer available in sglang so we needed to change it |
45 | | - - ep: 4 for all tp: 4 entries (3 occurrences in dsr1-fp4-b200-sglang) |
46 | | - - ep: 8 for all tp: 8 entries (6 occurrences across dsr1-fp4-b200-sglang and dsr1-fp8-b200-sglang) |
47 | | - - dsr1_fp4_b200_docker.sh: Replaced --enable-ep-moe with --ep-size $EP_SIZE and --enable-flashinfer-trtllm-moe with |
48 | | - --moe-runner-backend flashinfer_trtllm |
49 | | - - dsr1_fp8_b200_docker.sh: Replaced --enable-flashinfer-trtllm-moe with --moe-runner-backend flashinfer_trtllm and |
50 | | - added --ep-size $EP_SIZE |
51 | | - - launch_b200-nvd.sh: Added -e EP_SIZE to Docker run command to pass environment variable to container |
52 | | - - launch_b200-tg.sh: Added -e EP_SIZE to Docker run command to pass environment variable to container |
53 | | - PR: https://github.com/InferenceMAX/InferenceMAX/pull/204 |
| 43 | + description: |
| 44 | + - "Consolidate H200 and B200 SGLang configurations to use unified v0.5.5-cu129-amd64 image tag" |
| 45 | + - "Update deprecated SGLang server arguments to current equivalents" |
| 46 | + - "Replace --enable-ep-moe with --ep-size $EP_SIZE" |
| 47 | + - "Replace --enable-flashinfer-trtllm-moe with --moe-runner-backend flashinfer_trtllm" |
| 48 | + - "Add -e EP_SIZE to Docker run commands in launch scripts" |
| 49 | + - "Set ep:4 for all tp:4 entries, ep:8 for all tp:8 entries" |
| 50 | + pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/204 |
| 51 | + |
54 | 52 | - config-keys: |
55 | 53 | - gptoss-fp4-mi355x-vllm |
56 | 54 | - gptoss-fp4-b200-vllm |
57 | | - description: | |
58 | | - - Extend concurrency to 128 for gptoss mi355x/b200 vllm configurations |
59 | | - PR: https://github.com/InferenceMAX/InferenceMAX/pull/209 |
| 55 | + description: |
| 56 | + - "Extend concurrency to 128 for gptoss mi355x/b200 vllm configurations" |
| 57 | + pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/209 |
| 58 | + |
60 | 59 | - config-keys: |
61 | 60 | - gptoss-fp4-b200-trt |
62 | | - description: | |
63 | | - - Extend concurrency to 128 for gptoss b200 TRT configurations |
64 | | - PR: https://github.com/InferenceMAX/InferenceMAX/pull/233 |
| 61 | + description: |
| 62 | + - "Extend concurrency to 128 for gptoss b200 TRT configurations" |
| 63 | + pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/233 |
| 64 | + |
65 | 65 | - config-keys: |
66 | | - - "*gb200-sglang" |
67 | | - description: | |
68 | | - - Introducing some improvements in GB200 SGLang DSR1 submission |
69 | | - PR: https://github.com/InferenceMAX/InferenceMAX/pull/257 |
| 66 | + - "*gb200-dynamo-sglang" |
| 67 | + description: |
| 68 | + - "Introduce improvements in GB200 SGLang DSR1 submission" |
| 69 | + pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/257 |
| 70 | + |
70 | 71 | - config-keys: |
71 | 72 | - dsr1-fp8-h200-trt |
72 | | - description: | |
73 | | - - Update TRT image from nvcr.io#nvidia/tensorrt-llm/release:1.2.0rc0.post1 to nvcr.io#nvidia/tensorrt-llm/release:1.2.0rc2 |
74 | | - - Increase concurrency for some configurations |
75 | | - PR: https://github.com/InferenceMAX/InferenceMAX/pull/266 |
| 73 | + description: |
| 74 | + - "Update TRT image from nvcr.io#nvidia/tensorrt-llm/release:1.2.0rc0.post1 to nvcr.io#nvidia/tensorrt-llm/release:1.2.0rc2" |
| 75 | + - "Increase concurrency for some configurations" |
| 76 | + pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/266 |
| 77 | + |
76 | 78 | - config-keys: |
77 | 79 | - gptoss-fp4-b200-vllm |
78 | 80 | - gptoss-fp4-h100-vllm |
79 | 81 | - gptoss-fp4-h200-vllm |
80 | | - description: | |
81 | | - - Update vLLM image for NVIDIA configs from vLLM 0.11.0 to vLLM 0.11.2 |
82 | | - - Adds kv-cache-dtype: fp8 to benchmarks/gptoss_fp4_b200_docker.sh |
83 | | - PR: https://github.com/InferenceMAX/InferenceMAX/pull/273 |
| 82 | + description: |
| 83 | + - "Update vLLM image for NVIDIA configs from vLLM 0.11.0 to vLLM 0.11.2" |
| 84 | + - "Add kv-cache-dtype: fp8 to benchmarks/gptoss_fp4_b200_docker.sh" |
| 85 | + pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/273 |
| 86 | + |
84 | 87 | - config-keys: |
85 | 88 | - dsr1-fp4-mi355x-sglang |
86 | | - description: | |
87 | | - - Updating MI355x Deepseek-R1 FP4 SGLang Image to upstream v0.5.6.post1 |
88 | | - PR: https://github.com/InferenceMAX/InferenceMAX/pull/330 |
| 89 | + description: |
| 90 | + - "Update MI355x Deepseek-R1 FP4 SGLang Image to upstream v0.5.6.post1" |
| 91 | + pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/330 |
| 92 | + |
89 | 93 | - config-keys: |
90 | 94 | - gptoss-fp4-b200-trt |
91 | | - description: | |
92 | | - - Add benchmark script for GPTOSS FP4 B200 TRT-LLM |
93 | | - PR: https://github.com/InferenceMAX/InferenceMAX/pull/256 |
| 95 | + description: |
| 96 | + - "Add benchmark script for GPTOSS FP4 B200 TRT-LLM" |
| 97 | + pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/256 |
0 commit comments