Skip to content

Add vLLM Llama-70B all-reduce reproducer#535

Draft
micmelesse wants to merge 3 commits into
ROCm:mainfrom
micmelesse:micmelesse/reproducer/all_reduce
Draft

Add vLLM Llama-70B all-reduce reproducer#535
micmelesse wants to merge 3 commits into
ROCm:mainfrom
micmelesse:micmelesse/reproducer/all_reduce

Conversation

@micmelesse

@micmelesse micmelesse commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Motivation

This pr adds a benchmark and tests to iterate on vllm iris integration for llama 70b. It has two dockerfiles that bake a baseline and experimental environment. You can then run two scripts one to check correctness and another to check performance. There is a README attached for instructions.

The main difference is that the baseline pins commits from the main of vllm, aiter and iris. The experimental pins the branches with iris integration work which is enabled by setting VLLM_ROCM_USE_AITER_COMMS=1.

Technical Details

Test Plan

Test Result

Submission Checklist

@mawad-amd mawad-amd requested a review from aamarnat June 16, 2026 13:34
@aamarnat

Copy link
Copy Markdown
Collaborator

bench.sh runs the measured vllm bench serve cold — --num-warmups defaults to 0, so the only thing before timing is the single "Initial test run" readiness request. The first measured requests then absorb NCCL communicator init and other first-call costs. On my MI300X runs, a cold measured arm was the main driver of an inflated baseline → ~3.9× artifact; adding warmup collapsed it to ~1.05–1.08×.

Since vllm bench serve already has built-in warmup (and excludes those requests from the reported metrics), this is a one-flag fix on the existing command:

timeout 1200 vllm bench serve \
    --host "$HOST" --port "$PORT" --model "$MODEL" \
    --dataset-name random --random-input-len "$INPUT_LEN" --random-output-len "$OUTPUT_LEN" \
    --max-concurrency "$CONCURRENCY" --num-prompts "$NUM_PROMPTS" \
    --num-warmups "$CONCURRENCY" \
    --percentile-metrics ttft,tpot,itl,e2el \
    --ignore-eos \
    --save-result --result-filename "$OUTDIR/result_${ARM}.json" --label "$ARM"

The baked per-arm images already prevent cross-arm cache contamination, so this is purely about removing cold-start from the single measured arm — TTFT is the metric most affected. One nuance: built-in warmup uses a single fixed test input, so it primes the engine/NCCL/comms path but won't touch every cudagraph batch-size bucket at conc=64; if you want those warmed too, a short pre-pass over the random dataset does it. --num-warmups is most of the value for one flag.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants