Add vLLM Llama-70B all-reduce reproducer#535
Conversation
|
Since timeout 1200 vllm bench serve \
--host "$HOST" --port "$PORT" --model "$MODEL" \
--dataset-name random --random-input-len "$INPUT_LEN" --random-output-len "$OUTPUT_LEN" \
--max-concurrency "$CONCURRENCY" --num-prompts "$NUM_PROMPTS" \
--num-warmups "$CONCURRENCY" \
--percentile-metrics ttft,tpot,itl,e2el \
--ignore-eos \
--save-result --result-filename "$OUTDIR/result_${ARM}.json" --label "$ARM"The baked per-arm images already prevent cross-arm cache contamination, so this is purely about removing cold-start from the single measured arm — TTFT is the metric most affected. One nuance: built-in warmup uses a single fixed test input, so it primes the engine/NCCL/comms path but won't touch every cudagraph batch-size bucket at conc=64; if you want those warmed too, a short pre-pass over the random dataset does it. |
Motivation
This pr adds a benchmark and tests to iterate on vllm iris integration for llama 70b. It has two dockerfiles that bake a baseline and experimental environment. You can then run two scripts one to check correctness and another to check performance. There is a README attached for instructions.
The main difference is that the baseline pins commits from the main of vllm, aiter and iris. The experimental pins the branches with iris integration work which is enabled by setting
VLLM_ROCM_USE_AITER_COMMS=1.Technical Details
Test Plan
Test Result
Submission Checklist