Skip to content

feat(scenarios): multi-variant WVA benchmark#1451

Draft
biranofer wants to merge 2 commits into
llm-d:mainfrom
biranofer:feat/multi-variant-benchmark
Draft

feat(scenarios): multi-variant WVA benchmark#1451
biranofer wants to merge 2 commits into
llm-d:mainfrom
biranofer:feat/multi-variant-benchmark

Conversation

@biranofer
Copy link
Copy Markdown

@biranofer biranofer commented Jun 2, 2026

What

Adds a multi-variant benchmark scenario for the
Workload Variant Autoscaler (WVA):
one model deployed as two Deployments with differing variantCost,
both registered into the same InferencePool / EPP.

Addresses #1425"Add multi-variants benchmark for testing WVA behaviour."

Files

Path Role
config/scenarios/guides/two-variant-wva.yaml Primary-variant standup (Llama-3.1-8B / H100, cost 10.0) with WVA + HPA per stack
config/specification/guides/two-variant-wva.yaml.j2 Spec wrapper
config/scenarios/guides/two-variant-wva-v2-config.yaml Standalone ConfigMap that flips WVA into V2 saturation mode (the cost-aware analyzer)
config/scenarios/guides/variants/v2-cost-only.yaml Variant override (cost-only secondary at cost 5.0)
tools/add_variant.py Helper that builds the secondary Deployment / VariantAutoscaling / HPA from a variant override yaml
docs/multi-variant-benchmark.md End-to-end recipe

Why draft

Blocking dependency on a WVA chart release that includes
llm-d/llm-d-workload-variant-autoscaler#1198
(drop model_name filter from the cache_config_info query). Without
that fix, WVA's V2 analyzer falls back to the per-step batch budget
(~6.5K tokens/pod vs ~412K real KV on Llama-3.1-8B / H100),
under-counting capacity ~50× and making the cost-aware demonstration
ineffective.

The fix landed on WVA main on 2026-05-27. The currently-tagged WVA
chart versions (v0.6.0, v0.7.0) predate it. This PR pins the
chart at v0.7.0 to align with current upstream conventions; once a
chart release that includes #1198 is cut, the tag: in the scenario
yaml will be bumped and the PR converted to ready-for-review.

Follow-ups (not in this PR)

  1. WVA chart-side change to expose analyzerName /
    scaleUpThreshold / scaleDownBoundary in the chart's values
    surface. Once that lands, the standalone
    two-variant-wva-v2-config.yaml ConfigMap and the "Step 2 — enable
    V2" instruction in the doc become unnecessary (analyzerName would
    be set inline in the scenario yaml's wva.capacityScaling.default).
  2. HPA-EPP comparison scenario (separate PR) that toggles the same
    topology between WVA-driven scaling and a per-Deployment HPA-EPP
    baseline, so we can quantify WVA's cost-per-successful-request
    advantage with the same workload and infra.
  3. Shape-diversity variants (e.g. v2-tp2.yaml with TP=2 / 2 GPUs)
    for "different hardware shape" demonstrations on a single-hardware
    cluster.

Testing

End-to-end on an OpenShift cluster with H100 GPUs:

NS=<your-namespace>
llmdbenchmark --spec guides/two-variant-wva standup -p $NS
kubectl apply -n $NS -f config/scenarios/guides/two-variant-wva-v2-config.yaml
python tools/add_variant.py -n $NS \
    --config config/scenarios/guides/variants/v2-cost-only.yaml
llmdbenchmark --spec guides/two-variant-wva run \
    -p $NS -l guidellm -w prefill_heavy.yaml

Verified end-to-end with a WVA main-built image (post-#1198) on
Llama-3.1-8B at rate=5: primary held at minReplicas=1 throughout, v2
absorbed the load curve up to ~5 replicas — the cost-aware behavior the
issue calls for. With the upstream v0.6.0/v0.7.0 chart (without
#1198), both variants run away to maxReplicas instead.

biranofer and others added 2 commits June 3, 2026 01:17
Adds a two-variant scenario that deploys one model as two Deployments at
different variantCost values, both registering into the same
InferencePool / EPP. Exercises WVA's cost-aware optimizer, which steers
scale-up toward the cheaper variant first under saturation.

Addresses llm-d#1425.

Files:
  config/scenarios/guides/two-variant-wva.yaml - primary standup
  config/specification/guides/two-variant-wva.yaml.j2 - spec wrapper
  config/scenarios/guides/two-variant-wva-v2-config.yaml - V2 saturation cm
  config/scenarios/guides/variants/v2-cost-only.yaml - secondary at cost 5.0
  tools/add_variant.py - secondary Deployment/VA/HPA generator
  docs/multi-variant-benchmark.md - end-to-end recipe

Opening as DRAFT due to release dependency on
llm-d/llm-d-workload-variant-autoscaler#1198 (drop model_name filter from
the cache_config_info query). Without that fix, V2 falls back to the
per-step batch budget (~6.5K tokens/pod vs ~412K real KV on Llama-3.1-8B /
H100), under-counting capacity ~50x and the cost-aware demonstration
collapses. Fix landed on WVA main 2026-05-27 but is not in tagged WVA
chart releases through v0.7.0.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The "Topology B" label was an internal shorthand and isn't established
in the llm-d / llm-d-benchmark / WVA upstream docs. Replace with the
plain functional description ("one shared InferencePool/EPP fed by two
Deployments") so reviewers don't need to know an undefined naming
convention.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@biranofer biranofer changed the title feat(scenarios): multi-variant WVA benchmark (Topology B) feat(scenarios): multi-variant WVA benchmark Jun 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant