feat(scenarios): multi-variant WVA benchmark by biranofer · Pull Request #1451 · llm-d/llm-d-benchmark

biranofer · 2026-06-02T22:18:08Z

What

Adds a multi-variant benchmark scenario for the
Workload Variant Autoscaler (WVA):
one model deployed as two Deployments with differing variantCost,
both registered into the same InferencePool / EPP.

Addresses #1425 — "Add multi-variants benchmark for testing WVA behaviour."

Files

Path	Role
`config/scenarios/guides/two-variant-wva.yaml`	Primary-variant standup (Llama-3.1-8B / H100, cost 10.0) with WVA + HPA per stack
`config/specification/guides/two-variant-wva.yaml.j2`	Spec wrapper
`config/scenarios/guides/two-variant-wva-v2-config.yaml`	Standalone ConfigMap that flips WVA into V2 saturation mode (the cost-aware analyzer)
`config/scenarios/guides/variants/v2-cost-only.yaml`	Variant override (cost-only secondary at cost 5.0)
`tools/add_variant.py`	Helper that builds the secondary `Deployment` / `VariantAutoscaling` / `HPA` from a variant override yaml
`docs/multi-variant-benchmark.md`	End-to-end recipe

Why draft

Blocking dependency on a WVA chart release that includes
llm-d/llm-d-workload-variant-autoscaler#1198
(drop model_name filter from the cache_config_info query). Without
that fix, WVA's V2 analyzer falls back to the per-step batch budget
(~6.5K tokens/pod vs ~412K real KV on Llama-3.1-8B / H100),
under-counting capacity ~50× and making the cost-aware demonstration
ineffective.

The fix landed on WVA main on 2026-05-27. The currently-tagged WVA
chart versions (v0.6.0, v0.7.0) predate it. This PR pins the
chart at v0.7.0 to align with current upstream conventions; once a
chart release that includes #1198 is cut, the tag: in the scenario
yaml will be bumped and the PR converted to ready-for-review.

Follow-ups (not in this PR)

WVA chart-side change to expose analyzerName /
scaleUpThreshold / scaleDownBoundary in the chart's values
surface. Once that lands, the standalone
two-variant-wva-v2-config.yaml ConfigMap and the "Step 2 — enable
V2" instruction in the doc become unnecessary (analyzerName would
be set inline in the scenario yaml's wva.capacityScaling.default).
HPA-EPP comparison scenario (separate PR) that toggles the same
topology between WVA-driven scaling and a per-Deployment HPA-EPP
baseline, so we can quantify WVA's cost-per-successful-request
advantage with the same workload and infra.
Shape-diversity variants (e.g. v2-tp2.yaml with TP=2 / 2 GPUs)
for "different hardware shape" demonstrations on a single-hardware
cluster.

Testing

End-to-end on an OpenShift cluster with H100 GPUs:

NS=<your-namespace>
llmdbenchmark --spec guides/two-variant-wva standup -p $NS
kubectl apply -n $NS -f config/scenarios/guides/two-variant-wva-v2-config.yaml
python tools/add_variant.py -n $NS \
    --config config/scenarios/guides/variants/v2-cost-only.yaml
llmdbenchmark --spec guides/two-variant-wva run \
    -p $NS -l guidellm -w prefill_heavy.yaml

Verified end-to-end with a WVA main-built image (post-#1198) on
Llama-3.1-8B at rate=5: primary held at minReplicas=1 throughout, v2
absorbed the load curve up to ~5 replicas — the cost-aware behavior the
issue calls for. With the upstream v0.6.0/v0.7.0 chart (without
#1198), both variants run away to maxReplicas instead.

Adds a two-variant scenario that deploys one model as two Deployments at different variantCost values, both registering into the same InferencePool / EPP. Exercises WVA's cost-aware optimizer, which steers scale-up toward the cheaper variant first under saturation. Addresses llm-d#1425. Files: config/scenarios/guides/two-variant-wva.yaml - primary standup config/specification/guides/two-variant-wva.yaml.j2 - spec wrapper config/scenarios/guides/two-variant-wva-v2-config.yaml - V2 saturation cm config/scenarios/guides/variants/v2-cost-only.yaml - secondary at cost 5.0 tools/add_variant.py - secondary Deployment/VA/HPA generator docs/multi-variant-benchmark.md - end-to-end recipe Opening as DRAFT due to release dependency on llm-d/llm-d-workload-variant-autoscaler#1198 (drop model_name filter from the cache_config_info query). Without that fix, V2 falls back to the per-step batch budget (~6.5K tokens/pod vs ~412K real KV on Llama-3.1-8B / H100), under-counting capacity ~50x and the cost-aware demonstration collapses. Fix landed on WVA main 2026-05-27 but is not in tagged WVA chart releases through v0.7.0. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The "Topology B" label was an internal shorthand and isn't established in the llm-d / llm-d-benchmark / WVA upstream docs. Replace with the plain functional description ("one shared InferencePool/EPP fed by two Deployments") so reviewers don't need to know an undefined naming convention. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

biranofer and others added 2 commits June 3, 2026 01:17

biranofer changed the title ~~feat(scenarios): multi-variant WVA benchmark (Topology B)~~ feat(scenarios): multi-variant WVA benchmark Jun 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(scenarios): multi-variant WVA benchmark#1451

feat(scenarios): multi-variant WVA benchmark#1451
biranofer wants to merge 2 commits into
llm-d:mainfrom
biranofer:feat/multi-variant-benchmark

biranofer commented Jun 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

biranofer commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Files

Why draft

Follow-ups (not in this PR)

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

biranofer commented Jun 2, 2026 •

edited

Loading