Benchmark Refactor (#5861)

Yadan-Wei · Yadan Wei · web-flow · commit 38429b316015 · 2026-04-01T11:03:20.000-07:00
* fix: vllm benchmark runner-scale-sets GPU isolation and concurrency

- max-parallel: 1 to prevent parallel jobs on shared p4d nodes
- Use nvidia-smi GPU UUIDs instead of --gpus all (pod sees 4 of 8 GPUs)
- Use download-model action with flock-based caching/eviction
- Do not docker rmi on shared nodes (breaks parallel pod containers)
- Kill lock PID on cleanup to allow model eviction

Signed-off-by: Yadan Wei &lt;yadanwei@amazon.com&gt;

* fix: use output tokens/s instead of total tokens/s for throughput metric

vllm's JSON 'tokens_per_second' is total (input+output), not output-only.
For benchmarking, output tokens/s is the correct metric since input tokens
are just prefill. Compute output_tokens_per_second from num_requests *
output_len / elapsed_time and enrich the JSON for the report.

Signed-off-by: Yadan Wei &lt;yadanwei@amazon.com&gt;

* fix: adjust benchmark thresholds from total tokens/s to output tokens/s

Scale min_throughput by output_len/(input_len+output_len):
- input=512,output=128: ×0.2 (gpt-oss-20b, qwen3.5-9b, llama-3.3-70b, etc.)
- input=512,output=256: ×0.333 (qwen3-coder-next-fp8, qwen3-32b)

Signed-off-by: Yadan Wei &lt;yadanwei@amazon.com&gt;

* fix: use --random-input-len/--random-output-len for random dataset

vllm prefers --random-input-len over --input-len when using random
dataset. The old --input-len/--output-len were silently ignored,
causing vllm to use its defaults (1024/128) instead of config values.

Signed-off-by: Yadan Wei &lt;yadanwei@amazon.com&gt;

* fix: use download-model action for codebuild-fleet, fix cleanup

- codebuild-fleet now uses download-model action (ETag caching, skip
  re-download if model exists and matches)
- Remove rm -rf /dlc-models from cleanup (let cache persist)
- Release lock PID on cleanup for both job types

Signed-off-by: Yadan Wei &lt;yadanwei@amazon.com&gt;

* fix: add --ipc=host --shm-size=10g to codebuild-fleet container

Align with sglang model tests. Required for NCCL shared memory
communication on multi-GPU instances.

Signed-off-by: Yadan Wei &lt;yadanwei@amazon.com&gt;

* fix: remove max-parallel: 1, GPU isolation via pod UUIDs + flock is sufficient

Signed-off-by: Yadan Wei &lt;yadanwei@amazon.com&gt;

* fix: codebuild-fleet runner contention with single g6e12xl instance

- Add max-parallel: 1 so matrix jobs queue instead of all requesting
  the same fleet simultaneously (only 1 g6e.12xlarge available)
- Add strategy.job-index to runner label so each job gets a unique
  runner identity, preventing CodeBuild from reusing a finished runner
  label that other jobs are still waiting on

Signed-off-by: Yadan Wei &lt;yadanwei@amazon.com&gt;

* fix: remove max-parallel: 1 from codebuild-fleet, rely on job-index for runner identity

Models on different fleets (g6xl, g6exl, g6e12xl) can run in parallel.
The strategy.job-index in the runner label ensures each matrix job gets
its own CodeBuild runner, preventing the hanging issue.

Signed-off-by: Yadan Wei &lt;yadanwei@amazon.com&gt;

* fix: remove GPU cleanup from runner-scale-sets job

docker stop/rm sees ALL containers on the shared host (not just this
pod's), and nvidia-smi --gpu-reset affects GPUs used by other pods.
GPU isolation is handled by passing only this pod's GPU UUIDs to the
container.

Signed-off-by: Yadan Wei &lt;yadanwei@amazon.com&gt;

* fix: revert to standard codebuild runner label, use max-parallel: 1

CodeBuild requires exact label format codebuild-runner-&lt;run_id&gt;-&lt;run_attempt&gt;.
Adding strategy.job-index broke runner provisioning. Use max-parallel: 1
to serialize matrix jobs on the single g6e12xl fleet instance instead.

Signed-off-by: Yadan Wei &lt;yadanwei@amazon.com&gt;

* fix: use per-fleet concurrency group instead of max-parallel: 1

Jobs on the same fleet (e.g. 4 models on g6e12xl) queue and run one
at a time. Jobs on different fleets (g6xl, g6exl, g6e12xl) run in
parallel. cancel-in-progress: false ensures queued jobs are not dropped.

Signed-off-by: Yadan Wei &lt;yadanwei@amazon.com&gt;

* fix: revert to max-parallel: 1, concurrency groups cancel pending jobs

GitHub concurrency groups only allow 1 active + 1 pending per group,
cancelling the rest. max-parallel: 1 properly queues all matrix jobs
and runs them sequentially without cancelling any.

Signed-off-by: Yadan Wei &lt;yadanwei@amazon.com&gt;

* fix: adjust qwen3.5-9b min_throughput to 20 output tokens/s

Actual output: 24.22 output tokens/s on g6.xlarge (1x L4). Set
threshold to 20 with ~17% margin.

Signed-off-by: Yadan Wei &lt;yadanwei@amazon.com&gt;

* fix: get runner from config lookup, not filename parsing

The JSON filename is throughput_{model}.json with no runner suffix.
The runner info is in the config (fleet or runner-scale-sets), which
load_model_config already resolves.

Signed-off-by: Yadan Wei &lt;yadanwei@amazon.com&gt;

* fix: parse runner from filename using known model names

Filename is throughput_{model}_{runner}.json. Both model and runner
contain hyphens/underscores, so rsplit on _ fails. Match against
known model names (longest first) to split correctly.

Signed-off-by: Yadan Wei &lt;yadanwei@amazon.com&gt;

* fix: restore ARTIFACT_PREFIX in throughput output filenames

Throughput JSON/log filenames should use ${ARTIFACT_PREFIX}
(model_runner) not ${MODEL_NAME}, matching the latency files
and enabling the report to parse the runner from the filename.

Signed-off-by: Yadan Wei &lt;yadanwei@amazon.com&gt;

* fix: revert to simple rsplit parsing, underscore only joins model and runner

Signed-off-by: Yadan Wei &lt;yadanwei@amazon.com&gt;

* fix: increase max-parallel to 2 for codebuild-fleet benchmarks

Signed-off-by: Yadan Wei &lt;yadanwei@amazon.com&gt;

* remove parallel

Signed-off-by: Yadan Wei &lt;yadanwei@amazon.com&gt;

* add parallel restriction

Signed-off-by: Yadan Wei &lt;yadanwei@amazon.com&gt;

* add back file

Signed-off-by: Yadan Wei &lt;yadanwei@amazon.com&gt;

---------

Signed-off-by: Yadan Wei &lt;yadanwei@amazon.com&gt;
Co-authored-by: Yadan Wei &lt;yadanwei@amazon.com&gt;
diff --git a/.github/config/vllm-model-tests.yml b/.github/config/vllm-model-tests.yml
@@ -28,7 +28,7 @@ benchmark:
       output_len: 128
       num_prompts: 64
       batch_size: 4
-      min_throughput: 6000
+      min_throughput: 1200
       min_rps: 5
 
     - name: "qwen3.5-9b"
@@ -39,7 +39,7 @@ benchmark:
       output_len: 128
       num_prompts: 64
       batch_size: 4
-      min_throughput: 180
+      min_throughput: 20
       min_rps: 0.15
 
     - name: "llama-3.3-70b"
@@ -50,7 +50,7 @@ benchmark:
       output_len: 128
       num_prompts: 32
       batch_size: 2
-      min_throughput: 400
+      min_throughput: 80
       min_rps: 0.35
 
     # https://github.com/vllm-project/vllm/issues/32637
@@ -64,7 +64,7 @@ benchmark:
     #   output_len: 128
     #   num_prompts: 64
     #   batch_size: 4
-    #   min_throughput: 100
+    #   min_throughput: 20
     #   min_rps: 1
 
     - name: "qwen3.5-35b-a3b-fp8"
@@ -77,7 +77,7 @@ benchmark:
       output_len: 128
       num_prompts: 64
       batch_size: 4
-      min_throughput: 400
+      min_throughput: 80
       min_rps: 0.35
 
 # A100 is compute capability 8.0 — FP8 requires 8.9+ (H100/L40S).
@@ -90,7 +90,7 @@ benchmark:
       output_len: 128
       num_prompts: 64
       batch_size: 4
-      min_throughput: 100
+      min_throughput: 20
       min_rps: 0.2
 
     - name: "qwen3-coder-next-fp8"
@@ -101,7 +101,7 @@ benchmark:
       output_len: 256
       num_prompts: 32
       batch_size: 2
-      min_throughput: 280
+      min_throughput: 93
       min_rps: 0.25
 
   runner-scale-sets:
@@ -112,7 +112,7 @@ benchmark:
       output_len: 256
       num_prompts: 32
       batch_size: 2
-      min_throughput: 3400
+      min_throughput: 1133
       min_rps: 3
 
     - name: "qwen3.5-35b-a3b-fp8"
@@ -124,7 +124,7 @@ benchmark:
       output_len: 128
       num_prompts: 64
       batch_size: 4
-      min_throughput: 400
+      min_throughput: 80
       min_rps: 0.35
 
     - name: "qwen3.5-27b-fp8"
@@ -135,7 +135,7 @@ benchmark:
       output_len: 128
       num_prompts: 64
       batch_size: 4
-      min_throughput: 100
+      min_throughput: 20
       min_rps: 0.2
 
     - name: "qwen3-coder-next-fp8"
@@ -145,7 +145,7 @@ benchmark:
       output_len: 256
       num_prompts: 32
       batch_size: 2
-      min_throughput: 280
+      min_throughput: 93
       min_rps: 0.25
 
     - name: "llama-3.3-70b"
@@ -155,7 +155,7 @@ benchmark:
       output_len: 128
       num_prompts: 32
       batch_size: 2
-      min_throughput: 400
+      min_throughput: 80
       min_rps: 0.35
 
 # upstream
diff --git a/.github/workflows/vllm-benchmark.yml b/.github/workflows/vllm-benchmark.yml
@@ -61,6 +61,8 @@ jobs:
     needs: [load-benchmarks]
     strategy:
       fail-fast: false
+      # we only have 1 g6e12xl 4 models need it action only schedules once for the same label
+      max-parallel: 2
       matrix:
         include: ${{ fromJson(needs.load-benchmarks.outputs.codebuild-fleet-matrix) }}
     runs-on:
@@ -92,22 +94,17 @@ jobs:
           nvidia-smi
 
       - name: Download model from S3
-        run: |
-          MODEL_DIR="/dlc-models/${{ matrix.name }}"
-          mkdir -p "${MODEL_DIR}"
-          aws s3 cp "${{ matrix.s3_path }}" "/dlc-models/${{ matrix.name }}.tar.gz"
-          tar xzf "/dlc-models/${{ matrix.name }}.tar.gz" -C "${MODEL_DIR}"
-          rm -f "/dlc-models/${{ matrix.name }}.tar.gz"
-          SUBDIRS=("${MODEL_DIR}"/*)
-          if [ ${#SUBDIRS[@]} -eq 1 ] && [ -d "${SUBDIRS[0]}" ]; then
-            mv "${SUBDIRS[0]}"/* "${MODEL_DIR}"/
-            rmdir "${SUBDIRS[0]}"
-          fi
+        uses: ./.github/actions/download-model
+        id: model
+        with:
+          s3-path: ${{ matrix.s3_path }}
+          model-name: ${{ matrix.name }}
 
       - name: Start container
         run: |
           docker pull ${{ env.IMAGE_URI }}
           CONTAINER_ID=$(docker run -d -it --gpus all --entrypoint /bin/bash \
+            --ipc=host --shm-size=10g \
             -v /dlc-models:/models \
             ${{ env.IMAGE_URI }})
           echo "CONTAINER_ID=$CONTAINER_ID" >> $GITHUB_ENV
@@ -149,16 +146,14 @@ jobs:
         run: |
           docker stop ${CONTAINER_ID} 2>/dev/null || true
           docker rm -f ${CONTAINER_ID} 2>/dev/null || true
-          docker rmi ${{ env.IMAGE_URI }} 2>/dev/null || true
-          rm -rf /dlc-models
+          kill ${{ steps.model.outputs.lock-pid }} 2>/dev/null || true
 
   benchmark-runner-scale-sets:
     name: benchmark (${{ matrix.name }} / gpu-efa-runners)
     if: ${{ fromJson(needs.load-benchmarks.outputs.runner-scale-sets-matrix)[0] != null }}
     needs: [load-benchmarks]
     strategy:
       fail-fast: false
-      max-parallel: 1
       matrix:
         include: ${{ fromJson(needs.load-benchmarks.outputs.runner-scale-sets-matrix) }}
     runs-on: gpu-efa-runners
@@ -172,47 +167,29 @@ jobs:
           aws-account-id: ${{ env.ACCOUNT_ID }}
           aws-region: ${{ env.REGION }}
 
-      - name: GPU cleanup and status
-        run: |
-          echo "=== Pre-cleanup GPU state ==="
-          nvidia-smi
-          echo ""
-          echo "=== Stopping stale containers ==="
-          docker ps -q | xargs -r docker stop 2>/dev/null || true
-          docker ps -aq | xargs -r docker rm -f 2>/dev/null || true
-          echo "=== Clearing GPU memory ==="
-          nvidia-smi --gpu-reset 2>/dev/null || true
-          echo ""
-          echo "=== Post-cleanup GPU state ==="
-          nvidia-smi
-
       - name: Download model from S3
-        run: |
-          MODEL_DIR="/dlc-models/${{ matrix.name }}"
-          mkdir -p "${MODEL_DIR}"
-          aws s3 cp "${{ matrix.s3_path }}" "/dlc-models/${{ matrix.name }}.tar.gz"
-          tar xzf "/dlc-models/${{ matrix.name }}.tar.gz" -C "${MODEL_DIR}"
-          rm -f "/dlc-models/${{ matrix.name }}.tar.gz"
-          SUBDIRS=("${MODEL_DIR}"/*)
-          if [ ${#SUBDIRS[@]} -eq 1 ] && [ -d "${SUBDIRS[0]}" ]; then
-            mv "${SUBDIRS[0]}"/* "${MODEL_DIR}"/
-            rmdir "${SUBDIRS[0]}"
-          fi
+        uses: ./.github/actions/download-model
+        id: model
+        with:
+          s3-path: ${{ matrix.s3_path }}
+          model-name: ${{ matrix.name }}
 
       - name: Start container
         run: |
+          # Get GPU UUIDs visible to this pod (k8s assigns a subset of host GPUs)
+          POD_GPUS=$(nvidia-smi --query-gpu=uuid --format=csv,noheader | paste -sd,)
+          echo "Pod GPU UUIDs: ${POD_GPUS}"
           docker pull ${{ env.IMAGE_URI }}
-          CONTAINER_ID=$(docker run -d -it --gpus all --entrypoint /bin/bash \
+          CONTAINER_ID=$(docker run -d -it --gpus "\"device=${POD_GPUS}\"" --entrypoint /bin/bash \
             --ipc=host --shm-size=10g \
             ${{ env.IMAGE_URI }})
           echo "CONTAINER_ID=$CONTAINER_ID" >> $GITHUB_ENV
 
       - name: Copy files into container
         run: |
           docker exec ${CONTAINER_ID} mkdir -p /models
-          docker cp /dlc-models/${{ matrix.name }} ${CONTAINER_ID}:/models/${{ matrix.name }}
+          docker cp ${{ steps.model.outputs.model-dir }} ${CONTAINER_ID}:/models/${{ matrix.name }}
           docker cp scripts/vllm/benchmark/vllm_benchmark_test.sh ${CONTAINER_ID}:/models/
-          rm -rf /dlc-models
 
       - name: Run benchmark
         run: |
@@ -242,13 +219,15 @@ jobs:
           path: benchmark_results/
           retention-days: 30
 
+      # Do NOT docker rmi on shared runner-scale-sets nodes — multiple pods
+      # share the same host Docker daemon, removing an image could break a
+      # parallel job's container. Image cleanup is handled by DaemonSet.
       - name: Cleanup
         if: always()
         run: |
           docker stop ${CONTAINER_ID} 2>/dev/null || true
           docker rm -f ${CONTAINER_ID} 2>/dev/null || true
-          docker rmi ${{ env.IMAGE_URI }} 2>/dev/null || true
-          rm -rf /dlc-models
+          kill ${{ steps.model.outputs.lock-pid }} 2>/dev/null || true
 
   benchmark-report:
     name: benchmark-report
diff --git a/scripts/vllm/benchmark/benchmark_report.py b/scripts/vllm/benchmark/benchmark_report.py
@@ -11,12 +11,8 @@
 
 
 def _parse_artifact_name(filename, prefix):
-    """Parse model name and runner type from artifact filename.
-
-    Filename format: {prefix}_{model}_{runner}.json
-    """
+    """Parse model name and runner from filename like throughput_qwen3.5-9b_x86-g6xl-runner.json."""
     base = os.path.basename(filename).replace(f"{prefix}_", "", 1).replace(".json", "")
-    # Runner type is the last segment after the final underscore
     parts = base.rsplit("_", 1)
     if len(parts) == 2:
         return parts[0], parts[1]
@@ -50,21 +46,23 @@ def main(results_dir):
 
     print("## Throughput\n")
     print(
-        "| Model | Runner | TP | Input Len | Output Len | Prompts | Tokens/s | Requests/s | Elapsed (s) |"
+        "| Model | Runner | TP | Input Len | Output Len | Prompts | Output Tokens/s | Total Tokens/s | Requests/s | Elapsed (s) |"
     )
     print(
-        "|-------|--------|----|-----------|------------|---------|----------|------------|-------------|"
+        "|-------|--------|----|-----------|------------|---------|-----------------|----------------|------------|-------------|"
     )
     for f in sorted(glob.glob(f"{results_dir}/**/throughput_*.json", recursive=True)):
         name, runner = _parse_artifact_name(f, "throughput")
         c = models.get(name, {})
         tp = get_tp(c.get("extra_args", ""))
         with open(f) as fh:
             r = json.load(fh)
+        output_tps = r.get("output_tokens_per_second", 0)
         print(
             f"| {name} | {runner} | {tp} "
             f"| {c.get('input_len', '')} | {c.get('output_len', '')} "
-            f"| {c.get('num_prompts', '')} | {r['tokens_per_second']:.2f} "
+            f"| {c.get('num_prompts', '')} | {output_tps:.2f} "
+            f"| {r['tokens_per_second']:.2f} "
             f"| {r['requests_per_second']:.2f} | {r['elapsed_time']:.2f} |"
         )
 
diff --git a/scripts/vllm/benchmark/vllm_benchmark_test.sh b/scripts/vllm/benchmark/vllm_benchmark_test.sh
@@ -46,28 +46,41 @@ echo "=== Running throughput benchmark ==="
 vllm bench throughput \
   --model "${MODEL_DIR}" \
   --dataset-name random \
-  --input-len "${INPUT_LEN}" \
-  --output-len "${OUTPUT_LEN}" \
+  --random-input-len "${INPUT_LEN}" \
+  --random-output-len "${OUTPUT_LEN}" \
   --num-prompts "${NUM_PROMPTS}" \
   --output-json "${RESULTS_DIR}/throughput_${ARTIFACT_PREFIX}.json" \
-  ${EXTRA_ARGS}
+  ${EXTRA_ARGS} 2>&1 | tee "${RESULTS_DIR}/throughput_${ARTIFACT_PREFIX}.log"
 
 echo ""
 echo "=== Throughput results ==="
-cat "${RESULTS_DIR}/throughput_${ARTIFACT_PREFIX}.json"
 
-# Validate throughput
+# Parse output tokens/s and requests/s from vllm stdout:
+#   Throughput: 0.18 requests/s, 204.92 total tokens/s, 22.77 output tokens/s
 python3 -c "
-import json, sys
+import json, re, sys
+
+log = open('${RESULTS_DIR}/throughput_${ARTIFACT_PREFIX}.log').read()
+m = re.search(r'([\d.]+)\s+requests/s,\s+([\d.]+)\s+total tokens/s,\s+([\d.]+)\s+output tokens/s', log)
+if not m:
+    print('ERROR: could not parse throughput line from vllm output')
+    sys.exit(1)
+
+rps, total_tps, output_tps = float(m.group(1)), float(m.group(2)), float(m.group(3))
+
+# Enrich JSON with parsed values
 with open('${RESULTS_DIR}/throughput_${ARTIFACT_PREFIX}.json') as f:
     r = json.load(f)
-tps = r['tokens_per_second']
-rps = r['requests_per_second']
-print(f'Output tokens/s: {tps:.2f} (min: ${MIN_THROUGHPUT})')
+r['output_tokens_per_second'] = output_tps
+with open('${RESULTS_DIR}/throughput_${ARTIFACT_PREFIX}.json', 'w') as f:
+    json.dump(r, f, indent=4)
+
+print(f'Total tokens/s: {total_tps:.2f} (input+output)')
+print(f'Output tokens/s: {output_tps:.2f} (min: ${MIN_THROUGHPUT})')
 print(f'Requests/s: {rps:.2f} (min: ${MIN_RPS})')
 ok = True
-if tps < ${MIN_THROUGHPUT}:
-    print(f'FAIL: tokens/s {tps:.2f} < ${MIN_THROUGHPUT}')
+if output_tps < ${MIN_THROUGHPUT}:
+    print(f'FAIL: output tokens/s {output_tps:.2f} < ${MIN_THROUGHPUT}')
     ok = False
 if rps < ${MIN_RPS}:
     print(f'FAIL: requests/s {rps:.2f} < ${MIN_RPS}')