You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* fix: vllm benchmark runner-scale-sets GPU isolation and concurrency
- max-parallel: 1 to prevent parallel jobs on shared p4d nodes
- Use nvidia-smi GPU UUIDs instead of --gpus all (pod sees 4 of 8 GPUs)
- Use download-model action with flock-based caching/eviction
- Do not docker rmi on shared nodes (breaks parallel pod containers)
- Kill lock PID on cleanup to allow model eviction
Signed-off-by: Yadan Wei <yadanwei@amazon.com>
* fix: use output tokens/s instead of total tokens/s for throughput metric
vllm's JSON 'tokens_per_second' is total (input+output), not output-only.
For benchmarking, output tokens/s is the correct metric since input tokens
are just prefill. Compute output_tokens_per_second from num_requests *
output_len / elapsed_time and enrich the JSON for the report.
Signed-off-by: Yadan Wei <yadanwei@amazon.com>
* fix: adjust benchmark thresholds from total tokens/s to output tokens/s
Scale min_throughput by output_len/(input_len+output_len):
- input=512,output=128: ×0.2 (gpt-oss-20b, qwen3.5-9b, llama-3.3-70b, etc.)
- input=512,output=256: ×0.333 (qwen3-coder-next-fp8, qwen3-32b)
Signed-off-by: Yadan Wei <yadanwei@amazon.com>
* fix: use --random-input-len/--random-output-len for random dataset
vllm prefers --random-input-len over --input-len when using random
dataset. The old --input-len/--output-len were silently ignored,
causing vllm to use its defaults (1024/128) instead of config values.
Signed-off-by: Yadan Wei <yadanwei@amazon.com>
* fix: use download-model action for codebuild-fleet, fix cleanup
- codebuild-fleet now uses download-model action (ETag caching, skip
re-download if model exists and matches)
- Remove rm -rf /dlc-models from cleanup (let cache persist)
- Release lock PID on cleanup for both job types
Signed-off-by: Yadan Wei <yadanwei@amazon.com>
* fix: add --ipc=host --shm-size=10g to codebuild-fleet container
Align with sglang model tests. Required for NCCL shared memory
communication on multi-GPU instances.
Signed-off-by: Yadan Wei <yadanwei@amazon.com>
* fix: remove max-parallel: 1, GPU isolation via pod UUIDs + flock is sufficient
Signed-off-by: Yadan Wei <yadanwei@amazon.com>
* fix: codebuild-fleet runner contention with single g6e12xl instance
- Add max-parallel: 1 so matrix jobs queue instead of all requesting
the same fleet simultaneously (only 1 g6e.12xlarge available)
- Add strategy.job-index to runner label so each job gets a unique
runner identity, preventing CodeBuild from reusing a finished runner
label that other jobs are still waiting on
Signed-off-by: Yadan Wei <yadanwei@amazon.com>
* fix: remove max-parallel: 1 from codebuild-fleet, rely on job-index for runner identity
Models on different fleets (g6xl, g6exl, g6e12xl) can run in parallel.
The strategy.job-index in the runner label ensures each matrix job gets
its own CodeBuild runner, preventing the hanging issue.
Signed-off-by: Yadan Wei <yadanwei@amazon.com>
* fix: remove GPU cleanup from runner-scale-sets job
docker stop/rm sees ALL containers on the shared host (not just this
pod's), and nvidia-smi --gpu-reset affects GPUs used by other pods.
GPU isolation is handled by passing only this pod's GPU UUIDs to the
container.
Signed-off-by: Yadan Wei <yadanwei@amazon.com>
* fix: revert to standard codebuild runner label, use max-parallel: 1
CodeBuild requires exact label format codebuild-runner-<run_id>-<run_attempt>.
Adding strategy.job-index broke runner provisioning. Use max-parallel: 1
to serialize matrix jobs on the single g6e12xl fleet instance instead.
Signed-off-by: Yadan Wei <yadanwei@amazon.com>
* fix: use per-fleet concurrency group instead of max-parallel: 1
Jobs on the same fleet (e.g. 4 models on g6e12xl) queue and run one
at a time. Jobs on different fleets (g6xl, g6exl, g6e12xl) run in
parallel. cancel-in-progress: false ensures queued jobs are not dropped.
Signed-off-by: Yadan Wei <yadanwei@amazon.com>
* fix: revert to max-parallel: 1, concurrency groups cancel pending jobs
GitHub concurrency groups only allow 1 active + 1 pending per group,
cancelling the rest. max-parallel: 1 properly queues all matrix jobs
and runs them sequentially without cancelling any.
Signed-off-by: Yadan Wei <yadanwei@amazon.com>
* fix: adjust qwen3.5-9b min_throughput to 20 output tokens/s
Actual output: 24.22 output tokens/s on g6.xlarge (1x L4). Set
threshold to 20 with ~17% margin.
Signed-off-by: Yadan Wei <yadanwei@amazon.com>
* fix: get runner from config lookup, not filename parsing
The JSON filename is throughput_{model}.json with no runner suffix.
The runner info is in the config (fleet or runner-scale-sets), which
load_model_config already resolves.
Signed-off-by: Yadan Wei <yadanwei@amazon.com>
* fix: parse runner from filename using known model names
Filename is throughput_{model}_{runner}.json. Both model and runner
contain hyphens/underscores, so rsplit on _ fails. Match against
known model names (longest first) to split correctly.
Signed-off-by: Yadan Wei <yadanwei@amazon.com>
* fix: restore ARTIFACT_PREFIX in throughput output filenames
Throughput JSON/log filenames should use ${ARTIFACT_PREFIX}
(model_runner) not ${MODEL_NAME}, matching the latency files
and enabling the report to parse the runner from the filename.
Signed-off-by: Yadan Wei <yadanwei@amazon.com>
* fix: revert to simple rsplit parsing, underscore only joins model and runner
Signed-off-by: Yadan Wei <yadanwei@amazon.com>
* fix: increase max-parallel to 2 for codebuild-fleet benchmarks
Signed-off-by: Yadan Wei <yadanwei@amazon.com>
* remove parallel
Signed-off-by: Yadan Wei <yadanwei@amazon.com>
* add parallel restriction
Signed-off-by: Yadan Wei <yadanwei@amazon.com>
* add back file
Signed-off-by: Yadan Wei <yadanwei@amazon.com>
---------
Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Co-authored-by: Yadan Wei <yadanwei@amazon.com>
0 commit comments