Skip to content

Commit 38429b3

Browse files
Yadan-WeiYadan Wei
andauthored
Benchmark Refactor (#5861)
* fix: vllm benchmark runner-scale-sets GPU isolation and concurrency - max-parallel: 1 to prevent parallel jobs on shared p4d nodes - Use nvidia-smi GPU UUIDs instead of --gpus all (pod sees 4 of 8 GPUs) - Use download-model action with flock-based caching/eviction - Do not docker rmi on shared nodes (breaks parallel pod containers) - Kill lock PID on cleanup to allow model eviction Signed-off-by: Yadan Wei <yadanwei@amazon.com> * fix: use output tokens/s instead of total tokens/s for throughput metric vllm's JSON 'tokens_per_second' is total (input+output), not output-only. For benchmarking, output tokens/s is the correct metric since input tokens are just prefill. Compute output_tokens_per_second from num_requests * output_len / elapsed_time and enrich the JSON for the report. Signed-off-by: Yadan Wei <yadanwei@amazon.com> * fix: adjust benchmark thresholds from total tokens/s to output tokens/s Scale min_throughput by output_len/(input_len+output_len): - input=512,output=128: ×0.2 (gpt-oss-20b, qwen3.5-9b, llama-3.3-70b, etc.) - input=512,output=256: ×0.333 (qwen3-coder-next-fp8, qwen3-32b) Signed-off-by: Yadan Wei <yadanwei@amazon.com> * fix: use --random-input-len/--random-output-len for random dataset vllm prefers --random-input-len over --input-len when using random dataset. The old --input-len/--output-len were silently ignored, causing vllm to use its defaults (1024/128) instead of config values. Signed-off-by: Yadan Wei <yadanwei@amazon.com> * fix: use download-model action for codebuild-fleet, fix cleanup - codebuild-fleet now uses download-model action (ETag caching, skip re-download if model exists and matches) - Remove rm -rf /dlc-models from cleanup (let cache persist) - Release lock PID on cleanup for both job types Signed-off-by: Yadan Wei <yadanwei@amazon.com> * fix: add --ipc=host --shm-size=10g to codebuild-fleet container Align with sglang model tests. Required for NCCL shared memory communication on multi-GPU instances. Signed-off-by: Yadan Wei <yadanwei@amazon.com> * fix: remove max-parallel: 1, GPU isolation via pod UUIDs + flock is sufficient Signed-off-by: Yadan Wei <yadanwei@amazon.com> * fix: codebuild-fleet runner contention with single g6e12xl instance - Add max-parallel: 1 so matrix jobs queue instead of all requesting the same fleet simultaneously (only 1 g6e.12xlarge available) - Add strategy.job-index to runner label so each job gets a unique runner identity, preventing CodeBuild from reusing a finished runner label that other jobs are still waiting on Signed-off-by: Yadan Wei <yadanwei@amazon.com> * fix: remove max-parallel: 1 from codebuild-fleet, rely on job-index for runner identity Models on different fleets (g6xl, g6exl, g6e12xl) can run in parallel. The strategy.job-index in the runner label ensures each matrix job gets its own CodeBuild runner, preventing the hanging issue. Signed-off-by: Yadan Wei <yadanwei@amazon.com> * fix: remove GPU cleanup from runner-scale-sets job docker stop/rm sees ALL containers on the shared host (not just this pod's), and nvidia-smi --gpu-reset affects GPUs used by other pods. GPU isolation is handled by passing only this pod's GPU UUIDs to the container. Signed-off-by: Yadan Wei <yadanwei@amazon.com> * fix: revert to standard codebuild runner label, use max-parallel: 1 CodeBuild requires exact label format codebuild-runner-<run_id>-<run_attempt>. Adding strategy.job-index broke runner provisioning. Use max-parallel: 1 to serialize matrix jobs on the single g6e12xl fleet instance instead. Signed-off-by: Yadan Wei <yadanwei@amazon.com> * fix: use per-fleet concurrency group instead of max-parallel: 1 Jobs on the same fleet (e.g. 4 models on g6e12xl) queue and run one at a time. Jobs on different fleets (g6xl, g6exl, g6e12xl) run in parallel. cancel-in-progress: false ensures queued jobs are not dropped. Signed-off-by: Yadan Wei <yadanwei@amazon.com> * fix: revert to max-parallel: 1, concurrency groups cancel pending jobs GitHub concurrency groups only allow 1 active + 1 pending per group, cancelling the rest. max-parallel: 1 properly queues all matrix jobs and runs them sequentially without cancelling any. Signed-off-by: Yadan Wei <yadanwei@amazon.com> * fix: adjust qwen3.5-9b min_throughput to 20 output tokens/s Actual output: 24.22 output tokens/s on g6.xlarge (1x L4). Set threshold to 20 with ~17% margin. Signed-off-by: Yadan Wei <yadanwei@amazon.com> * fix: get runner from config lookup, not filename parsing The JSON filename is throughput_{model}.json with no runner suffix. The runner info is in the config (fleet or runner-scale-sets), which load_model_config already resolves. Signed-off-by: Yadan Wei <yadanwei@amazon.com> * fix: parse runner from filename using known model names Filename is throughput_{model}_{runner}.json. Both model and runner contain hyphens/underscores, so rsplit on _ fails. Match against known model names (longest first) to split correctly. Signed-off-by: Yadan Wei <yadanwei@amazon.com> * fix: restore ARTIFACT_PREFIX in throughput output filenames Throughput JSON/log filenames should use ${ARTIFACT_PREFIX} (model_runner) not ${MODEL_NAME}, matching the latency files and enabling the report to parse the runner from the filename. Signed-off-by: Yadan Wei <yadanwei@amazon.com> * fix: revert to simple rsplit parsing, underscore only joins model and runner Signed-off-by: Yadan Wei <yadanwei@amazon.com> * fix: increase max-parallel to 2 for codebuild-fleet benchmarks Signed-off-by: Yadan Wei <yadanwei@amazon.com> * remove parallel Signed-off-by: Yadan Wei <yadanwei@amazon.com> * add parallel restriction Signed-off-by: Yadan Wei <yadanwei@amazon.com> * add back file Signed-off-by: Yadan Wei <yadanwei@amazon.com> --------- Signed-off-by: Yadan Wei <yadanwei@amazon.com> Co-authored-by: Yadan Wei <yadanwei@amazon.com>
1 parent 830adb9 commit 38429b3

4 files changed

Lines changed: 65 additions & 75 deletions

File tree

.github/config/vllm-model-tests.yml

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ benchmark:
2828
output_len: 128
2929
num_prompts: 64
3030
batch_size: 4
31-
min_throughput: 6000
31+
min_throughput: 1200
3232
min_rps: 5
3333

3434
- name: "qwen3.5-9b"
@@ -39,7 +39,7 @@ benchmark:
3939
output_len: 128
4040
num_prompts: 64
4141
batch_size: 4
42-
min_throughput: 180
42+
min_throughput: 20
4343
min_rps: 0.15
4444

4545
- name: "llama-3.3-70b"
@@ -50,7 +50,7 @@ benchmark:
5050
output_len: 128
5151
num_prompts: 32
5252
batch_size: 2
53-
min_throughput: 400
53+
min_throughput: 80
5454
min_rps: 0.35
5555

5656
# https://github.com/vllm-project/vllm/issues/32637
@@ -64,7 +64,7 @@ benchmark:
6464
# output_len: 128
6565
# num_prompts: 64
6666
# batch_size: 4
67-
# min_throughput: 100
67+
# min_throughput: 20
6868
# min_rps: 1
6969

7070
- name: "qwen3.5-35b-a3b-fp8"
@@ -77,7 +77,7 @@ benchmark:
7777
output_len: 128
7878
num_prompts: 64
7979
batch_size: 4
80-
min_throughput: 400
80+
min_throughput: 80
8181
min_rps: 0.35
8282

8383
# A100 is compute capability 8.0 — FP8 requires 8.9+ (H100/L40S).
@@ -90,7 +90,7 @@ benchmark:
9090
output_len: 128
9191
num_prompts: 64
9292
batch_size: 4
93-
min_throughput: 100
93+
min_throughput: 20
9494
min_rps: 0.2
9595

9696
- name: "qwen3-coder-next-fp8"
@@ -101,7 +101,7 @@ benchmark:
101101
output_len: 256
102102
num_prompts: 32
103103
batch_size: 2
104-
min_throughput: 280
104+
min_throughput: 93
105105
min_rps: 0.25
106106

107107
runner-scale-sets:
@@ -112,7 +112,7 @@ benchmark:
112112
output_len: 256
113113
num_prompts: 32
114114
batch_size: 2
115-
min_throughput: 3400
115+
min_throughput: 1133
116116
min_rps: 3
117117

118118
- name: "qwen3.5-35b-a3b-fp8"
@@ -124,7 +124,7 @@ benchmark:
124124
output_len: 128
125125
num_prompts: 64
126126
batch_size: 4
127-
min_throughput: 400
127+
min_throughput: 80
128128
min_rps: 0.35
129129

130130
- name: "qwen3.5-27b-fp8"
@@ -135,7 +135,7 @@ benchmark:
135135
output_len: 128
136136
num_prompts: 64
137137
batch_size: 4
138-
min_throughput: 100
138+
min_throughput: 20
139139
min_rps: 0.2
140140

141141
- name: "qwen3-coder-next-fp8"
@@ -145,7 +145,7 @@ benchmark:
145145
output_len: 256
146146
num_prompts: 32
147147
batch_size: 2
148-
min_throughput: 280
148+
min_throughput: 93
149149
min_rps: 0.25
150150

151151
- name: "llama-3.3-70b"
@@ -155,7 +155,7 @@ benchmark:
155155
output_len: 128
156156
num_prompts: 32
157157
batch_size: 2
158-
min_throughput: 400
158+
min_throughput: 80
159159
min_rps: 0.35
160160

161161
# upstream

.github/workflows/vllm-benchmark.yml

Lines changed: 23 additions & 44 deletions
Original file line numberDiff line numberDiff line change
@@ -61,6 +61,8 @@ jobs:
6161
needs: [load-benchmarks]
6262
strategy:
6363
fail-fast: false
64+
# we only have 1 g6e12xl 4 models need it action only schedules once for the same label
65+
max-parallel: 2
6466
matrix:
6567
include: ${{ fromJson(needs.load-benchmarks.outputs.codebuild-fleet-matrix) }}
6668
runs-on:
@@ -92,22 +94,17 @@ jobs:
9294
nvidia-smi
9395
9496
- name: Download model from S3
95-
run: |
96-
MODEL_DIR="/dlc-models/${{ matrix.name }}"
97-
mkdir -p "${MODEL_DIR}"
98-
aws s3 cp "${{ matrix.s3_path }}" "/dlc-models/${{ matrix.name }}.tar.gz"
99-
tar xzf "/dlc-models/${{ matrix.name }}.tar.gz" -C "${MODEL_DIR}"
100-
rm -f "/dlc-models/${{ matrix.name }}.tar.gz"
101-
SUBDIRS=("${MODEL_DIR}"/*)
102-
if [ ${#SUBDIRS[@]} -eq 1 ] && [ -d "${SUBDIRS[0]}" ]; then
103-
mv "${SUBDIRS[0]}"/* "${MODEL_DIR}"/
104-
rmdir "${SUBDIRS[0]}"
105-
fi
97+
uses: ./.github/actions/download-model
98+
id: model
99+
with:
100+
s3-path: ${{ matrix.s3_path }}
101+
model-name: ${{ matrix.name }}
106102

107103
- name: Start container
108104
run: |
109105
docker pull ${{ env.IMAGE_URI }}
110106
CONTAINER_ID=$(docker run -d -it --gpus all --entrypoint /bin/bash \
107+
--ipc=host --shm-size=10g \
111108
-v /dlc-models:/models \
112109
${{ env.IMAGE_URI }})
113110
echo "CONTAINER_ID=$CONTAINER_ID" >> $GITHUB_ENV
@@ -149,16 +146,14 @@ jobs:
149146
run: |
150147
docker stop ${CONTAINER_ID} 2>/dev/null || true
151148
docker rm -f ${CONTAINER_ID} 2>/dev/null || true
152-
docker rmi ${{ env.IMAGE_URI }} 2>/dev/null || true
153-
rm -rf /dlc-models
149+
kill ${{ steps.model.outputs.lock-pid }} 2>/dev/null || true
154150
155151
benchmark-runner-scale-sets:
156152
name: benchmark (${{ matrix.name }} / gpu-efa-runners)
157153
if: ${{ fromJson(needs.load-benchmarks.outputs.runner-scale-sets-matrix)[0] != null }}
158154
needs: [load-benchmarks]
159155
strategy:
160156
fail-fast: false
161-
max-parallel: 1
162157
matrix:
163158
include: ${{ fromJson(needs.load-benchmarks.outputs.runner-scale-sets-matrix) }}
164159
runs-on: gpu-efa-runners
@@ -172,47 +167,29 @@ jobs:
172167
aws-account-id: ${{ env.ACCOUNT_ID }}
173168
aws-region: ${{ env.REGION }}
174169

175-
- name: GPU cleanup and status
176-
run: |
177-
echo "=== Pre-cleanup GPU state ==="
178-
nvidia-smi
179-
echo ""
180-
echo "=== Stopping stale containers ==="
181-
docker ps -q | xargs -r docker stop 2>/dev/null || true
182-
docker ps -aq | xargs -r docker rm -f 2>/dev/null || true
183-
echo "=== Clearing GPU memory ==="
184-
nvidia-smi --gpu-reset 2>/dev/null || true
185-
echo ""
186-
echo "=== Post-cleanup GPU state ==="
187-
nvidia-smi
188-
189170
- name: Download model from S3
190-
run: |
191-
MODEL_DIR="/dlc-models/${{ matrix.name }}"
192-
mkdir -p "${MODEL_DIR}"
193-
aws s3 cp "${{ matrix.s3_path }}" "/dlc-models/${{ matrix.name }}.tar.gz"
194-
tar xzf "/dlc-models/${{ matrix.name }}.tar.gz" -C "${MODEL_DIR}"
195-
rm -f "/dlc-models/${{ matrix.name }}.tar.gz"
196-
SUBDIRS=("${MODEL_DIR}"/*)
197-
if [ ${#SUBDIRS[@]} -eq 1 ] && [ -d "${SUBDIRS[0]}" ]; then
198-
mv "${SUBDIRS[0]}"/* "${MODEL_DIR}"/
199-
rmdir "${SUBDIRS[0]}"
200-
fi
171+
uses: ./.github/actions/download-model
172+
id: model
173+
with:
174+
s3-path: ${{ matrix.s3_path }}
175+
model-name: ${{ matrix.name }}
201176

202177
- name: Start container
203178
run: |
179+
# Get GPU UUIDs visible to this pod (k8s assigns a subset of host GPUs)
180+
POD_GPUS=$(nvidia-smi --query-gpu=uuid --format=csv,noheader | paste -sd,)
181+
echo "Pod GPU UUIDs: ${POD_GPUS}"
204182
docker pull ${{ env.IMAGE_URI }}
205-
CONTAINER_ID=$(docker run -d -it --gpus all --entrypoint /bin/bash \
183+
CONTAINER_ID=$(docker run -d -it --gpus "\"device=${POD_GPUS}\"" --entrypoint /bin/bash \
206184
--ipc=host --shm-size=10g \
207185
${{ env.IMAGE_URI }})
208186
echo "CONTAINER_ID=$CONTAINER_ID" >> $GITHUB_ENV
209187
210188
- name: Copy files into container
211189
run: |
212190
docker exec ${CONTAINER_ID} mkdir -p /models
213-
docker cp /dlc-models/${{ matrix.name }} ${CONTAINER_ID}:/models/${{ matrix.name }}
191+
docker cp ${{ steps.model.outputs.model-dir }} ${CONTAINER_ID}:/models/${{ matrix.name }}
214192
docker cp scripts/vllm/benchmark/vllm_benchmark_test.sh ${CONTAINER_ID}:/models/
215-
rm -rf /dlc-models
216193
217194
- name: Run benchmark
218195
run: |
@@ -242,13 +219,15 @@ jobs:
242219
path: benchmark_results/
243220
retention-days: 30
244221

222+
# Do NOT docker rmi on shared runner-scale-sets nodes — multiple pods
223+
# share the same host Docker daemon, removing an image could break a
224+
# parallel job's container. Image cleanup is handled by DaemonSet.
245225
- name: Cleanup
246226
if: always()
247227
run: |
248228
docker stop ${CONTAINER_ID} 2>/dev/null || true
249229
docker rm -f ${CONTAINER_ID} 2>/dev/null || true
250-
docker rmi ${{ env.IMAGE_URI }} 2>/dev/null || true
251-
rm -rf /dlc-models
230+
kill ${{ steps.model.outputs.lock-pid }} 2>/dev/null || true
252231
253232
benchmark-report:
254233
name: benchmark-report

scripts/vllm/benchmark/benchmark_report.py

Lines changed: 6 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -11,12 +11,8 @@
1111

1212

1313
def _parse_artifact_name(filename, prefix):
14-
"""Parse model name and runner type from artifact filename.
15-
16-
Filename format: {prefix}_{model}_{runner}.json
17-
"""
14+
"""Parse model name and runner from filename like throughput_qwen3.5-9b_x86-g6xl-runner.json."""
1815
base = os.path.basename(filename).replace(f"{prefix}_", "", 1).replace(".json", "")
19-
# Runner type is the last segment after the final underscore
2016
parts = base.rsplit("_", 1)
2117
if len(parts) == 2:
2218
return parts[0], parts[1]
@@ -50,21 +46,23 @@ def main(results_dir):
5046

5147
print("## Throughput\n")
5248
print(
53-
"| Model | Runner | TP | Input Len | Output Len | Prompts | Tokens/s | Requests/s | Elapsed (s) |"
49+
"| Model | Runner | TP | Input Len | Output Len | Prompts | Output Tokens/s | Total Tokens/s | Requests/s | Elapsed (s) |"
5450
)
5551
print(
56-
"|-------|--------|----|-----------|------------|---------|----------|------------|-------------|"
52+
"|-------|--------|----|-----------|------------|---------|-----------------|----------------|------------|-------------|"
5753
)
5854
for f in sorted(glob.glob(f"{results_dir}/**/throughput_*.json", recursive=True)):
5955
name, runner = _parse_artifact_name(f, "throughput")
6056
c = models.get(name, {})
6157
tp = get_tp(c.get("extra_args", ""))
6258
with open(f) as fh:
6359
r = json.load(fh)
60+
output_tps = r.get("output_tokens_per_second", 0)
6461
print(
6562
f"| {name} | {runner} | {tp} "
6663
f"| {c.get('input_len', '')} | {c.get('output_len', '')} "
67-
f"| {c.get('num_prompts', '')} | {r['tokens_per_second']:.2f} "
64+
f"| {c.get('num_prompts', '')} | {output_tps:.2f} "
65+
f"| {r['tokens_per_second']:.2f} "
6866
f"| {r['requests_per_second']:.2f} | {r['elapsed_time']:.2f} |"
6967
)
7068

scripts/vllm/benchmark/vllm_benchmark_test.sh

Lines changed: 24 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -46,28 +46,41 @@ echo "=== Running throughput benchmark ==="
4646
vllm bench throughput \
4747
--model "${MODEL_DIR}" \
4848
--dataset-name random \
49-
--input-len "${INPUT_LEN}" \
50-
--output-len "${OUTPUT_LEN}" \
49+
--random-input-len "${INPUT_LEN}" \
50+
--random-output-len "${OUTPUT_LEN}" \
5151
--num-prompts "${NUM_PROMPTS}" \
5252
--output-json "${RESULTS_DIR}/throughput_${ARTIFACT_PREFIX}.json" \
53-
${EXTRA_ARGS}
53+
${EXTRA_ARGS} 2>&1 | tee "${RESULTS_DIR}/throughput_${ARTIFACT_PREFIX}.log"
5454

5555
echo ""
5656
echo "=== Throughput results ==="
57-
cat "${RESULTS_DIR}/throughput_${ARTIFACT_PREFIX}.json"
5857

59-
# Validate throughput
58+
# Parse output tokens/s and requests/s from vllm stdout:
59+
# Throughput: 0.18 requests/s, 204.92 total tokens/s, 22.77 output tokens/s
6060
python3 -c "
61-
import json, sys
61+
import json, re, sys
62+
63+
log = open('${RESULTS_DIR}/throughput_${ARTIFACT_PREFIX}.log').read()
64+
m = re.search(r'([\d.]+)\s+requests/s,\s+([\d.]+)\s+total tokens/s,\s+([\d.]+)\s+output tokens/s', log)
65+
if not m:
66+
print('ERROR: could not parse throughput line from vllm output')
67+
sys.exit(1)
68+
69+
rps, total_tps, output_tps = float(m.group(1)), float(m.group(2)), float(m.group(3))
70+
71+
# Enrich JSON with parsed values
6272
with open('${RESULTS_DIR}/throughput_${ARTIFACT_PREFIX}.json') as f:
6373
r = json.load(f)
64-
tps = r['tokens_per_second']
65-
rps = r['requests_per_second']
66-
print(f'Output tokens/s: {tps:.2f} (min: ${MIN_THROUGHPUT})')
74+
r['output_tokens_per_second'] = output_tps
75+
with open('${RESULTS_DIR}/throughput_${ARTIFACT_PREFIX}.json', 'w') as f:
76+
json.dump(r, f, indent=4)
77+
78+
print(f'Total tokens/s: {total_tps:.2f} (input+output)')
79+
print(f'Output tokens/s: {output_tps:.2f} (min: ${MIN_THROUGHPUT})')
6780
print(f'Requests/s: {rps:.2f} (min: ${MIN_RPS})')
6881
ok = True
69-
if tps < ${MIN_THROUGHPUT}:
70-
print(f'FAIL: tokens/s {tps:.2f} < ${MIN_THROUGHPUT}')
82+
if output_tps < ${MIN_THROUGHPUT}:
83+
print(f'FAIL: output tokens/s {output_tps:.2f} < ${MIN_THROUGHPUT}')
7184
ok = False
7285
if rps < ${MIN_RPS}:
7386
print(f'FAIL: requests/s {rps:.2f} < ${MIN_RPS}')

0 commit comments

Comments
 (0)