Add vLLM Llama-70B all-reduce reproducer by micmelesse · Pull Request #535 · ROCm/iris

micmelesse · 2026-06-16T08:06:08Z

Motivation

This pr adds a benchmark and tests to iterate on vllm iris integration for llama 70b. It has two dockerfiles that bake a baseline and experimental environment. You can then run two scripts one to check correctness and another to check performance. There is a README attached for instructions.

The main difference is that the baseline pins commits from the main of vllm, aiter and iris. The experimental pins the branches with iris integration work which is enabled by setting VLLM_ROCM_USE_AITER_COMMS=1.

Technical Details

Test Plan

Test Result

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

aamarnat · 2026-06-25T03:19:22Z

bench.sh runs the measured vllm bench serve cold — --num-warmups defaults to 0, so the only thing before timing is the single "Initial test run" readiness request. The first measured requests then absorb NCCL communicator init and other first-call costs. On my MI300X runs, a cold measured arm was the main driver of an inflated baseline → ~3.9× artifact; adding warmup collapsed it to ~1.05–1.08×.

Since vllm bench serve already has built-in warmup (and excludes those requests from the reported metrics), this is a one-flag fix on the existing command:

timeout 1200 vllm bench serve \
    --host "$HOST" --port "$PORT" --model "$MODEL" \
    --dataset-name random --random-input-len "$INPUT_LEN" --random-output-len "$OUTPUT_LEN" \
    --max-concurrency "$CONCURRENCY" --num-prompts "$NUM_PROMPTS" \
    --num-warmups "$CONCURRENCY" \
    --percentile-metrics ttft,tpot,itl,e2el \
    --ignore-eos \
    --save-result --result-filename "$OUTDIR/result_${ARM}.json" --label "$ARM"

The baked per-arm images already prevent cross-arm cache contamination, so this is purely about removing cold-start from the single measured arm — TTFT is the metric most affected. One nuance: built-in warmup uses a single fixed test input, so it primes the engine/NCCL/comms path but won't touch every cudagraph batch-size bucket at conc=64; if you want those warmed too, a short pre-pass over the random dataset does it. --num-warmups is most of the value for one flag.

Add vLLM Llama-70B all-reduce reproducer

4d69352

mawad-amd requested a review from aamarnat June 16, 2026 13:34

micmelesse added 2 commits June 18, 2026 12:02

benchmark/llama70b: rename TEST knob to EVAL (gsm8k accuracy gate)

59d2b97

benchmark/llama70b: rename output dir traces -> output

1640eff

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add vLLM Llama-70B all-reduce reproducer#535

Add vLLM Llama-70B all-reduce reproducer#535
micmelesse wants to merge 3 commits into
ROCm:mainfrom
micmelesse:micmelesse/reproducer/all_reduce

micmelesse commented Jun 16, 2026 •

edited

Loading

Uh oh!

aamarnat commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

micmelesse commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

aamarnat commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

micmelesse commented Jun 16, 2026 •

edited

Loading