Add Cilium ClusterMesh scale-test scenario by skosuri1 · Pull Request #1157 · Azure/telescope

skosuri1 · 2026-04-28T15:56:03Z

Draft PR for the Cilium ClusterMesh scale-test scenario on AKS. Phase 1 + Phase 2 validated end-to-end on consecutive green runs; pre-merge cleanup pending (see end).

What this adds

A new perf-eval scenario, clustermesh-scale, that scale-tests Cilium ClusterMesh on AKS end-to-end through Telescope.

The scenario provisions N AKS clusters in separate VNets (with peering), joins them into a Cilium ClusterMesh via Azure Fleet Manager's ClusterMeshProfile, runs a ClusterLoader2 workload on every cluster, and aggregates per-cluster results into a single JSONL keyed by source/target cluster.

Reference: #1053 (CNL) — same structural patterns, but #1053 is single-cluster. This PR introduces the multi-cluster execution + aggregation path.

Work is split into phases. Phase 1 adds the vertical slice with all wiring (terraform modules, Python harness, topology/engine YAMLs, pipeline YAML, cross-cluster data-path smoke test) for N=2. Phase 2 adds full observability (Cilium + clustermesh-apiserver + etcd + logs per spec testing.txt lines 34–38, 131–135) and the first real scenario (Cross-Cluster Event Throughput, scenario #1 in the spec). Phase 3 runs at 5 / 10 / 20 clusters (and bumps the baseline cluster size to 20 nodes per spec line 24) and builds Kusto dashboards. Phase 4 adds the remaining six scenarios. Phase 5 is polish, tests, and the official pipeline request. Vertical-slice ordering is intentional — every later phase reuses Phase 1's plumbing, so debugging at N=2 is far cheaper than at N=20.

Phase 1 contents: new terraform submodules fleet/ and vnet-peering/ plus a --pod-subnet-id flag in aks-cli/; scenario tfvars under scenarios/perf-eval/clustermesh-scale/ with a vendored fleet-2.0.4 wheel; a per-cluster prompool extra_node_pool (Standard_D8s_v3 × 1, label prometheus=true) for Prometheus isolation, mirroring the prompool pattern in cnl-azurecni-overlay-cilium; --max-pods=110 on the default pool to fit the 200-pod workload (Phase 2); a single-cluster scale.py harness with --cluster-name for per-cluster JSONL attribution; PodMonitor scraping ports 9963 (etcd) + 9964 (kvstoremesh) on the clustermesh-apiserver pod; topology and engine YAMLs that discover clusters via Azure resource tags, fan out CL2 sequentially across clusters (parallel fan-out is Phase 3), and aggregate JSONLs; a cross-cluster data-path smoke test (global Service curl from mesh-2 to mesh-1) gating CL2 so we don't ship a "green" Phase 1 with control plane healthy but data plane broken; a conditional Fleet-wheel install in setup-tests.yml gated on the scenario name (no impact on other scenarios); and a pipeline YAML in pipelines/perf-eval/Network Benchmark/clustermesh-scale.yml running weekly in eastus2euap with manual triggers always available.

External prerequisite: the pipeline service principal needs Microsoft.Authorization/roleAssignments/write at sub scope, granted as RBAC Administrator with an ABAC condition restricting to the Network Contributor role GUID. Cilium-managed clustermesh-apiserver provisions an internal LB on each cluster's VNet and requires this role on the underlying VNet to mutate it.

Phase 2 contents: scale-scenario #1 (Cross-Cluster Event Throughput) — a new event-throughput.yaml CL2 config and supporting modules that deploy N namespaces × M deployments × R replicas (default 5/4/10 = 200 pods) across the mesh, then exercise create / warmup / rolling-restart burst / settle / delete to drive measurable kvstore traffic; reused cilium.yaml and control-plane.yaml from PR #1053 (cilium agent/operator CPU/mem, container restarts, API responsiveness, pod startup latency); new clustermesh-metrics.yaml for always-on mesh measurements (remote clusters connected, remote cluster failures, kvstore events rate aggregate AND per-type by scope label — Identities/Services/Endpoints, kvstore operation duration p50/p90/p99, watch queue depth, identity count); new clustermesh-throughput.yaml for scenario-specific measurements (event backlog rate, global services count, kvstore op-duration p95 split); new etcd-metrics.yaml covering the embedded etcd inside clustermesh-apiserver (watch count, slow watchers, pending events, MVCC keys, compaction keys + duration, backend write latency) — sourced via the existing PodMonitor on port 9963, no new scrape target needed, so all of spec testing.txt line 34 (Cilium / clustermesh-apiserver / etcd) is covered with one PodMonitor; pod logs (clustermesh-apiserver × 3 containers + cilium-agent + cilium-operator) archived to $report_dir/logs per spec line 35; network bytes per component (Tx/Rx) added to cilium.yaml per spec line 38; junit-aware success gate in execute.yml that distinguishes CL2 logic failures from infra failures; Python unit tests + mock-data fixtures covering single-cluster, multi-cluster aggregation, and failure paths.

Fleet members with labels and ClusterMeshProfile have no native Terraform support today, so both go through terraform_data + local-exec az calls following the aks-cli/main.tf precedent. The az fleet 2.0.4 extension also exposes no detach/remove-member API, so destroying a clustermesh hits a chicken-and-egg: member delete is rejected while the member is in any profile, and clustermeshprofile delete is rejected while members exist. The destroy provisioner on terraform_data.clustermeshprofile breaks this by relabeling members off the profile selector (az fleet member update --labels REPLACES the labels map, dropping mesh=true), re-applying the profile, then polling list-members until the applied set drains to 0 (10-minute budget, periodic re-apply nudge in case the first apply was a no-op), then deleting the profile with a 30×5s backstop retry. Tested across multiple consecutive runs; post-timeout backstop covers cases where Fleet RP's list-members view lags the actual deletable state.

Known limitations / deferred to Phase 3+:

Cross-cluster propagation latency (spec line 54) is approximated by cilium_kvstoremesh_kvstore_operations_duration_seconds_bucket p99. A synthetic probe (pod created on A → visible on B at time T) is not implemented because CL2's per-cluster execution model doesn't natively support cross-cluster timing.
Etcd Compaction Duration histogram returns "no samples" on short runs because etcd is configured --auto-compaction-retention=1h; the metric is wired and will populate at Phase 3 long runs.
Per-cluster CL2 runs sequentially, not in parallel. Parallel fan-out (with bounded concurrency) is deferred to Phase 3 — at N=20 with per-cluster Prometheus, the AzDO agent would be CPU/RAM bound.
Cluster size is 3 nodes (2 default + 1 prompool) for Phase 1/2 harness validation; Phase 3 bumps to the spec's 20-node baseline (line 24).

Known infra flakes (accepted, not fixed):

AKS RP race between --enable-acns extension reconcile and prompool add → OperationNotAllowed: PutExtensionAddonHandler.PUT in progress. Recurs ~1 in 5 runs; AzDO RetryHelper absorbs in 2-3 attempts.
Azure VMSS LB-sync 412 PreconditionFailed on the clustermesh-apiserver Service (concurrent VMSS modification race). Rare; observed once, resolved by next build.

Changes:

ClusterMesh / Fleet / VNet peering setup

modules/terraform/azure/fleet/{main.tf, variables.tf, outputs.tf, versions.tf}
modules/terraform/azure/vnet-peering/{main.tf, variables.tf, outputs.tf}
scenarios/perf-eval/clustermesh-scale/terraform-inputs/azure-2.tfvars
modules/python/clusterloader2/clustermesh-scale/config/modules/clustermesh.yaml
modules/python/clusterloader2/clustermesh-scale/config/modules/clustermesh/podmonitor.yaml

Terraform / AKS provisioning

modules/terraform/azure/main.tf
modules/terraform/azure/variables.tf
modules/terraform/azure/aks-cli/main.tf
scenarios/perf-eval/clustermesh-scale/terraform-test-inputs/azure-2.json

Mesh validation + cross-cluster data path

steps/topology/clustermesh-scale/validate-resources.yml

Telescope pipeline / step conventions

pipelines/perf-eval/Network Benchmark/clustermesh-scale.yml
pipelines/system/new-pipeline-test.yml (dev-only — will revert before merge)
jobs/competitive-test.yml (added skip_publish parameter for dev runs)
steps/setup-tests.yml (only edit to a shared file)
steps/topology/clustermesh-scale/{execute-clusterloader2.yml, collect-clusterloader2.yml}
steps/engine/clusterloader2/clustermesh-scale/{execute.yml, collect.yml}

ClusterLoader2 harness + multi-cluster aggregation

modules/python/clusterloader2/clustermesh-scale/scale.py
modules/python/clusterloader2/clustermesh-scale/init.py
modules/python/clusterloader2/utils.py (added prometheus_memory_request CLI flag)
modules/python/clusterloader2/clustermesh-scale/config/config.yaml
modules/python/clusterloader2/clustermesh-scale/config/event-throughput.yaml
modules/python/clusterloader2/clustermesh-scale/config/modules/scale-test.yaml
modules/python/clusterloader2/clustermesh-scale/config/modules/scale-test-deployment.yaml
modules/python/clusterloader2/clustermesh-scale/config/modules/event-throughput-{deployment,service,workload}.yaml

CL2 measurement modules (Phase 2)

modules/python/clusterloader2/clustermesh-scale/config/modules/measurements/cilium.yaml
modules/python/clusterloader2/clustermesh-scale/config/modules/measurements/control-plane.yaml
modules/python/clusterloader2/clustermesh-scale/config/modules/measurements/clustermesh-metrics.yaml
modules/python/clusterloader2/clustermesh-scale/config/modules/measurements/clustermesh-throughput.yaml
modules/python/clusterloader2/clustermesh-scale/config/modules/measurements/etcd-metrics.yaml

Tests

modules/python/tests/test_clustermesh_scale.py
modules/python/tests/mock_data/clustermesh-scale/report/{mesh-1,mesh-2,mesh-fail}/{junit.xml, GenericPrometheusQuery_*.json}

Vendored binary

scenarios/perf-eval/clustermesh-scale/vendor/fleet-2.0.4-py3-none-any.whl

Pre-merge cleanup pending:

Strip DEBUG-DUMP block from steps/topology/clustermesh-scale/validate-resources.yml
Revert pipelines/system/new-pipeline-test.yml to its original placeholder

…e mesh diagnostics

…uring mesh validation

…policy

…ovisioning)

…to azure-2.json to match tfvars stem

skosuri1 · 2026-04-28T20:54:29Z

@microsoft-github-policy-service agree company="Microsoft"

…th multi-cluster aggregation invariant

…ntrol-plane, clustermesh-metrics)

…-apiserver ports

…roughput

…ut workload

…lure

…true (AKS managed Cilium gates sync at the namespace level per CFP-39876)

…esn't expand in env: blocks)

…h bash comments

…R knobs)

…nored)

skosuri added 7 commits April 28, 2026 08:54

Add Cilium ClusterMesh scale-test scenario (Phase 1 vertical slice)

2000574

Point new-pipeline-test.yml at clustermesh-scale for dev pipeline runs

b482bc5

Use cilium-dbg status for in-pod check; add cilium-cli for runner-sid…

44d106d

…e mesh diagnostics

[debug] dump pods + cilium-cli + fleet member state every 3 retries d…

54e581b

…uring mesh validation

fix(cssc): use mcr.microsoft.com pause image to satisfy supply chain …

21f0835

…policy

debug: surface fleet clustermeshprofile connection state (not just pr…

76c1ae5

…ovisioning)

fix(ci): drop __init__.py (script, not module) and rename test-input …

7cf9703

…to azure-2.json to match tfvars stem

skosuri added 22 commits April 29, 2026 09:51

test(clustermesh-scale): unit tests for scale.py configure/collect wi…

aa43ffb

…th multi-cluster aggregation invariant

feat(clustermesh-scale): wire phase 2 measurement modules (cilium, co…

ea51dea

…ntrol-plane, clustermesh-metrics)

feat(clustermesh-scale): plumb mesh_size end-to-end + log clustermesh…

879a6e9

…-apiserver ports

feat(clustermesh-scale): add scale scenario #1 cross-cluster event th…

84d98e2

…roughput

feat(clustermesh-scale): bump pod subnet to /22 to fit event-throughp…

562e57c

…ut workload

fix: grant Network Contributor on VNet to AKS identity

3f2664c

ci: skip results upload while iterating clustermesh-scale

d45a5ad

debug(clustermesh-scale): dump cilium svc + pod-IP probe on smoke fai…

c804872

…lure

validate: bump mesh convergence retries 30->60 (~10 min budget)

3601254

smoke: annotate cm-smoke namespace with clustermesh.cilium.io/global=…

08c6d98

…true (AKS managed Cilium gates sync at the namespace level per CFP-39876)

fix(cl2): export CL2_* from auto-exported matrix env vars ($(name) do…

a615507

…esn't expand in env: blocks)

debug(cl2): dump env + try lowercase matrix-var names + macro test

0ea89f5

debug(cl2): dump full env (no grep filter) to find matrix vars

8f3ad72

fix(cl2): drop ${{ }} from comment — AzDO template parser sees throug…

5018c5f

…h bash comments

fix: align dev pipeline matrix with production; remove env-dump diag

cca8a69

fleet: bump retry 30->60

2d0d3dc

fix(cl2): right-size prometheus stack + detect failure via junit.xml

8ba31c4

cl2: shrink prometheus to 0.1x defaults + dump pod state on failure

96a7d78

cl2: keep prometheus alive on failure + dump operator logs/CR/events

41a1a28

cl2: explicit prometheus mem request=1Gi/limit=2Gi (drop broken FACTO…

4251e55

…R knobs)

cl2: pass --prometheus-memory-request=1Gi (the overrides key isn't ho…

ac28c20

…nored)

scale-test.yaml: remove template refs from comment

6358ac3

skosuri added 17 commits May 3, 2026 19:08

fix: 4 issues from last run

a2eb5c3

fix: scale-test start measurement + retry profile delete

ef8759e

fix: cl2 start params + destroy relabel-then-apply

ac963f5

dev pipeline: add n2_event_throughput matrix entry

50f5e0e

fix: event-throughput start params + relax api SLO violation gate

39f22b8

bump max-pods to 110 + drop n2 from dev pipeline

d49f2b0

shrink pause pod limits to 5m/20Mi

3dd2e92

add prompool + pin prometheus to it

4225852

drop FS latency queries + add prom metric-name probe

fabbee9

fleet destroy: poll list-members before profile delete

630dd3f

fix clustermesh prom queries: use kvstoremesh prefix; restore fs latency

be79cce

drop fs write latency; sum() backlog subtraction to fix label mismatch

4c89a19

add watch queue depth metric; etcd port discovery probe

8aadb02

wire etcd metrics; drop discovery probe

fd34a5c

close phase 1/2 spec gaps: pod logs, network usage, per-type event rate

cd5f794

probe kvstoremesh metric labels for per-type event rate

76c4665

fix per-type event rate: scope label not prefix

bd8edf1

skosuri1 mentioned this pull request May 6, 2026

ClusterMesh scale: Phase 3 +4 — scale tiers + parallel CL2 fan-out + add all scenarios #1168

Draft

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Cilium ClusterMesh scale-test scenario#1157

Add Cilium ClusterMesh scale-test scenario#1157
skosuri1 wants to merge 46 commits into
mainfrom
skosuri/clustermesh-scale

skosuri1 commented Apr 28, 2026 •

edited

Loading

Uh oh!

skosuri1 commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

skosuri1 commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

skosuri1 commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

skosuri1 commented Apr 28, 2026 •

edited

Loading