Skip to content

Add Cilium ClusterMesh scale-test scenario#1157

Draft
skosuri1 wants to merge 46 commits into
mainfrom
skosuri/clustermesh-scale
Draft

Add Cilium ClusterMesh scale-test scenario#1157
skosuri1 wants to merge 46 commits into
mainfrom
skosuri/clustermesh-scale

Conversation

@skosuri1
Copy link
Copy Markdown

@skosuri1 skosuri1 commented Apr 28, 2026

Draft PR for the Cilium ClusterMesh scale-test scenario on AKS. Phase 1 + Phase 2 validated end-to-end on consecutive green runs; pre-merge cleanup pending (see end).

What this adds

A new perf-eval scenario, clustermesh-scale, that scale-tests Cilium ClusterMesh on AKS end-to-end through Telescope.

The scenario provisions N AKS clusters in separate VNets (with peering), joins them into a Cilium ClusterMesh via Azure Fleet Manager's ClusterMeshProfile, runs a ClusterLoader2 workload on every cluster, and aggregates per-cluster results into a single JSONL keyed by source/target cluster.

Reference: #1053 (CNL) — same structural patterns, but #1053 is single-cluster. This PR introduces the multi-cluster execution + aggregation path.

Work is split into phases. Phase 1 adds the vertical slice with all wiring (terraform modules, Python harness, topology/engine YAMLs, pipeline YAML, cross-cluster data-path smoke test) for N=2. Phase 2 adds full observability (Cilium + clustermesh-apiserver + etcd + logs per spec testing.txt lines 34–38, 131–135) and the first real scenario (Cross-Cluster Event Throughput, scenario #1 in the spec). Phase 3 runs at 5 / 10 / 20 clusters (and bumps the baseline cluster size to 20 nodes per spec line 24) and builds Kusto dashboards. Phase 4 adds the remaining six scenarios. Phase 5 is polish, tests, and the official pipeline request. Vertical-slice ordering is intentional — every later phase reuses Phase 1's plumbing, so debugging at N=2 is far cheaper than at N=20.

Phase 1 contents: new terraform submodules fleet/ and vnet-peering/ plus a --pod-subnet-id flag in aks-cli/; scenario tfvars under scenarios/perf-eval/clustermesh-scale/ with a vendored fleet-2.0.4 wheel; a per-cluster prompool extra_node_pool (Standard_D8s_v3 × 1, label prometheus=true) for Prometheus isolation, mirroring the prompool pattern in cnl-azurecni-overlay-cilium; --max-pods=110 on the default pool to fit the 200-pod workload (Phase 2); a single-cluster scale.py harness with --cluster-name for per-cluster JSONL attribution; PodMonitor scraping ports 9963 (etcd) + 9964 (kvstoremesh) on the clustermesh-apiserver pod; topology and engine YAMLs that discover clusters via Azure resource tags, fan out CL2 sequentially across clusters (parallel fan-out is Phase 3), and aggregate JSONLs; a cross-cluster data-path smoke test (global Service curl from mesh-2 to mesh-1) gating CL2 so we don't ship a "green" Phase 1 with control plane healthy but data plane broken; a conditional Fleet-wheel install in setup-tests.yml gated on the scenario name (no impact on other scenarios); and a pipeline YAML in pipelines/perf-eval/Network Benchmark/clustermesh-scale.yml running weekly in eastus2euap with manual triggers always available.

External prerequisite: the pipeline service principal needs Microsoft.Authorization/roleAssignments/write at sub scope, granted as RBAC Administrator with an ABAC condition restricting to the Network Contributor role GUID. Cilium-managed clustermesh-apiserver provisions an internal LB on each cluster's VNet and requires this role on the underlying VNet to mutate it.

Phase 2 contents: scale-scenario #1 (Cross-Cluster Event Throughput) — a new event-throughput.yaml CL2 config and supporting modules that deploy N namespaces × M deployments × R replicas (default 5/4/10 = 200 pods) across the mesh, then exercise create / warmup / rolling-restart burst / settle / delete to drive measurable kvstore traffic; reused cilium.yaml and control-plane.yaml from PR #1053 (cilium agent/operator CPU/mem, container restarts, API responsiveness, pod startup latency); new clustermesh-metrics.yaml for always-on mesh measurements (remote clusters connected, remote cluster failures, kvstore events rate aggregate AND per-type by scope label — Identities/Services/Endpoints, kvstore operation duration p50/p90/p99, watch queue depth, identity count); new clustermesh-throughput.yaml for scenario-specific measurements (event backlog rate, global services count, kvstore op-duration p95 split); new etcd-metrics.yaml covering the embedded etcd inside clustermesh-apiserver (watch count, slow watchers, pending events, MVCC keys, compaction keys + duration, backend write latency) — sourced via the existing PodMonitor on port 9963, no new scrape target needed, so all of spec testing.txt line 34 (Cilium / clustermesh-apiserver / etcd) is covered with one PodMonitor; pod logs (clustermesh-apiserver × 3 containers + cilium-agent + cilium-operator) archived to $report_dir/logs per spec line 35; network bytes per component (Tx/Rx) added to cilium.yaml per spec line 38; junit-aware success gate in execute.yml that distinguishes CL2 logic failures from infra failures; Python unit tests + mock-data fixtures covering single-cluster, multi-cluster aggregation, and failure paths.

Fleet members with labels and ClusterMeshProfile have no native Terraform support today, so both go through terraform_data + local-exec az calls following the aks-cli/main.tf precedent. The az fleet 2.0.4 extension also exposes no detach/remove-member API, so destroying a clustermesh hits a chicken-and-egg: member delete is rejected while the member is in any profile, and clustermeshprofile delete is rejected while members exist. The destroy provisioner on terraform_data.clustermeshprofile breaks this by relabeling members off the profile selector (az fleet member update --labels REPLACES the labels map, dropping mesh=true), re-applying the profile, then polling list-members until the applied set drains to 0 (10-minute budget, periodic re-apply nudge in case the first apply was a no-op), then deleting the profile with a 30×5s backstop retry. Tested across multiple consecutive runs; post-timeout backstop covers cases where Fleet RP's list-members view lags the actual deletable state.

Known limitations / deferred to Phase 3+:

  • Cross-cluster propagation latency (spec line 54) is approximated by cilium_kvstoremesh_kvstore_operations_duration_seconds_bucket p99. A synthetic probe (pod created on A → visible on B at time T) is not implemented because CL2's per-cluster execution model doesn't natively support cross-cluster timing.
  • Etcd Compaction Duration histogram returns "no samples" on short runs because etcd is configured --auto-compaction-retention=1h; the metric is wired and will populate at Phase 3 long runs.
  • Per-cluster CL2 runs sequentially, not in parallel. Parallel fan-out (with bounded concurrency) is deferred to Phase 3 — at N=20 with per-cluster Prometheus, the AzDO agent would be CPU/RAM bound.
  • Cluster size is 3 nodes (2 default + 1 prompool) for Phase 1/2 harness validation; Phase 3 bumps to the spec's 20-node baseline (line 24).

Known infra flakes (accepted, not fixed):

  • AKS RP race between --enable-acns extension reconcile and prompool add → OperationNotAllowed: PutExtensionAddonHandler.PUT in progress. Recurs ~1 in 5 runs; AzDO RetryHelper absorbs in 2-3 attempts.
  • Azure VMSS LB-sync 412 PreconditionFailed on the clustermesh-apiserver Service (concurrent VMSS modification race). Rare; observed once, resolved by next build.

Changes:

ClusterMesh / Fleet / VNet peering setup

  • modules/terraform/azure/fleet/{main.tf, variables.tf, outputs.tf, versions.tf}
  • modules/terraform/azure/vnet-peering/{main.tf, variables.tf, outputs.tf}
  • scenarios/perf-eval/clustermesh-scale/terraform-inputs/azure-2.tfvars
  • modules/python/clusterloader2/clustermesh-scale/config/modules/clustermesh.yaml
  • modules/python/clusterloader2/clustermesh-scale/config/modules/clustermesh/podmonitor.yaml

Terraform / AKS provisioning

  • modules/terraform/azure/main.tf
  • modules/terraform/azure/variables.tf
  • modules/terraform/azure/aks-cli/main.tf
  • scenarios/perf-eval/clustermesh-scale/terraform-test-inputs/azure-2.json

Mesh validation + cross-cluster data path

  • steps/topology/clustermesh-scale/validate-resources.yml

Telescope pipeline / step conventions

  • pipelines/perf-eval/Network Benchmark/clustermesh-scale.yml
  • pipelines/system/new-pipeline-test.yml (dev-only — will revert before merge)
  • jobs/competitive-test.yml (added skip_publish parameter for dev runs)
  • steps/setup-tests.yml (only edit to a shared file)
  • steps/topology/clustermesh-scale/{execute-clusterloader2.yml, collect-clusterloader2.yml}
  • steps/engine/clusterloader2/clustermesh-scale/{execute.yml, collect.yml}

ClusterLoader2 harness + multi-cluster aggregation

  • modules/python/clusterloader2/clustermesh-scale/scale.py
  • modules/python/clusterloader2/clustermesh-scale/init.py
  • modules/python/clusterloader2/utils.py (added prometheus_memory_request CLI flag)
  • modules/python/clusterloader2/clustermesh-scale/config/config.yaml
  • modules/python/clusterloader2/clustermesh-scale/config/event-throughput.yaml
  • modules/python/clusterloader2/clustermesh-scale/config/modules/scale-test.yaml
  • modules/python/clusterloader2/clustermesh-scale/config/modules/scale-test-deployment.yaml
  • modules/python/clusterloader2/clustermesh-scale/config/modules/event-throughput-{deployment,service,workload}.yaml

CL2 measurement modules (Phase 2)

  • modules/python/clusterloader2/clustermesh-scale/config/modules/measurements/cilium.yaml
  • modules/python/clusterloader2/clustermesh-scale/config/modules/measurements/control-plane.yaml
  • modules/python/clusterloader2/clustermesh-scale/config/modules/measurements/clustermesh-metrics.yaml
  • modules/python/clusterloader2/clustermesh-scale/config/modules/measurements/clustermesh-throughput.yaml
  • modules/python/clusterloader2/clustermesh-scale/config/modules/measurements/etcd-metrics.yaml

Tests

  • modules/python/tests/test_clustermesh_scale.py
  • modules/python/tests/mock_data/clustermesh-scale/report/{mesh-1,mesh-2,mesh-fail}/{junit.xml, GenericPrometheusQuery_*.json}

Vendored binary

  • scenarios/perf-eval/clustermesh-scale/vendor/fleet-2.0.4-py3-none-any.whl

Pre-merge cleanup pending:

  • Strip DEBUG-DUMP block from steps/topology/clustermesh-scale/validate-resources.yml
  • Revert pipelines/system/new-pipeline-test.yml to its original placeholder

@skosuri1
Copy link
Copy Markdown
Author

@microsoft-github-policy-service agree company="Microsoft"

skosuri added 22 commits April 29, 2026 09:51
…true (AKS managed Cilium gates sync at the namespace level per CFP-39876)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant