feat(scenarios): multi-variant WVA benchmark#1451
Draft
biranofer wants to merge 2 commits into
Draft
Conversation
Adds a two-variant scenario that deploys one model as two Deployments at different variantCost values, both registering into the same InferencePool / EPP. Exercises WVA's cost-aware optimizer, which steers scale-up toward the cheaper variant first under saturation. Addresses llm-d#1425. Files: config/scenarios/guides/two-variant-wva.yaml - primary standup config/specification/guides/two-variant-wva.yaml.j2 - spec wrapper config/scenarios/guides/two-variant-wva-v2-config.yaml - V2 saturation cm config/scenarios/guides/variants/v2-cost-only.yaml - secondary at cost 5.0 tools/add_variant.py - secondary Deployment/VA/HPA generator docs/multi-variant-benchmark.md - end-to-end recipe Opening as DRAFT due to release dependency on llm-d/llm-d-workload-variant-autoscaler#1198 (drop model_name filter from the cache_config_info query). Without that fix, V2 falls back to the per-step batch budget (~6.5K tokens/pod vs ~412K real KV on Llama-3.1-8B / H100), under-counting capacity ~50x and the cost-aware demonstration collapses. Fix landed on WVA main 2026-05-27 but is not in tagged WVA chart releases through v0.7.0. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The "Topology B" label was an internal shorthand and isn't established
in the llm-d / llm-d-benchmark / WVA upstream docs. Replace with the
plain functional description ("one shared InferencePool/EPP fed by two
Deployments") so reviewers don't need to know an undefined naming
convention.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds a multi-variant benchmark scenario for the
Workload Variant Autoscaler (WVA):
one model deployed as two
Deployments with differingvariantCost,both registered into the same
InferencePool/ EPP.Addresses #1425 — "Add multi-variants benchmark for testing WVA behaviour."
Files
config/scenarios/guides/two-variant-wva.yamlconfig/specification/guides/two-variant-wva.yaml.j2config/scenarios/guides/two-variant-wva-v2-config.yamlconfig/scenarios/guides/variants/v2-cost-only.yamltools/add_variant.pyDeployment/VariantAutoscaling/HPAfrom a variant override yamldocs/multi-variant-benchmark.mdWhy draft
Blocking dependency on a WVA chart release that includes
llm-d/llm-d-workload-variant-autoscaler#1198
(drop
model_namefilter from thecache_config_infoquery). Withoutthat fix, WVA's V2 analyzer falls back to the per-step batch budget
(~6.5K tokens/pod vs ~412K real KV on Llama-3.1-8B / H100),
under-counting capacity ~50× and making the cost-aware demonstration
ineffective.
The fix landed on WVA
mainon 2026-05-27. The currently-tagged WVAchart versions (
v0.6.0,v0.7.0) predate it. This PR pins thechart at
v0.7.0to align with current upstream conventions; once achart release that includes #1198 is cut, the
tag:in the scenarioyaml will be bumped and the PR converted to ready-for-review.
Follow-ups (not in this PR)
analyzerName/scaleUpThreshold/scaleDownBoundaryin the chart's valuessurface. Once that lands, the standalone
two-variant-wva-v2-config.yamlConfigMap and the "Step 2 — enableV2" instruction in the doc become unnecessary (
analyzerNamewouldbe set inline in the scenario yaml's
wva.capacityScaling.default).topology between WVA-driven scaling and a per-Deployment HPA-EPP
baseline, so we can quantify WVA's cost-per-successful-request
advantage with the same workload and infra.
v2-tp2.yamlwith TP=2 / 2 GPUs)for "different hardware shape" demonstrations on a single-hardware
cluster.
Testing
End-to-end on an OpenShift cluster with H100 GPUs:
Verified end-to-end with a WVA
main-built image (post-#1198) onLlama-3.1-8B at rate=5: primary held at minReplicas=1 throughout, v2
absorbed the load curve up to ~5 replicas — the cost-aware behavior the
issue calls for. With the upstream
v0.6.0/v0.7.0chart (without#1198), both variants run away to
maxReplicasinstead.