Skip to content

ClusterMesh scale: Phase 3 — scale tiers + parallel CL2 fan-out#1168

Draft
skosuri1 wants to merge 26 commits into
skosuri/clustermesh-scalefrom
skosuri/clustermesh-scale-2
Draft

ClusterMesh scale: Phase 3 — scale tiers + parallel CL2 fan-out#1168
skosuri1 wants to merge 26 commits into
skosuri/clustermesh-scalefrom
skosuri/clustermesh-scale-2

Conversation

@skosuri1
Copy link
Copy Markdown

@skosuri1 skosuri1 commented May 6, 2026

Stacked on top of #1157 (skosuri/clustermesh-scale). Do not merge until #1157 is merged; review/merge order matters.

This PR continues the ClusterMesh scale-test scenario with Phase 3 work — moving from harness validation (2 small clusters) to real scale measurement across cluster-count tiers.

Phase 3 Deliverables

  • 20-node baseline cluster size (spec line 24). Current clusters are 3 nodes (2 default + 1 prompool) — sized for harness validation, not real scale measurement.
  • Cluster-count tiers: add azure-5.tfvars, azure-10.tfvars, azure-20.tfvars and corresponding pipeline matrix entries. Each tier: validate quota, validate peering count (N·(N-1) at separate-VNet mode — 380 at N=20), tune CL2 timeouts, document breaking points.
  • Parallel CL2 fan-out: replace sequential per-cluster CL2 with bounded concurrency (default 4). Requires async wrapping of utils.run_cl2_command (currently synchronous, modules/python/clusterloader2/utils.py:66-72) and confirming the AzDO agent has CPU/RAM headroom for N concurrent CL2 + Prometheus.
  • etcd PodMonitor capacity check at 20 clusters: 28 watchers per cluster × 20 = 560 watchers; verify Prom scrape budget holds.
  • Scaling-curve dashboards from cluster-attributed results (Kusto).

Out of Scope (deferred to later phases / pre-merge of #1157)

skosuri and others added 26 commits May 6, 2026 13:59
…idn't fix root cause); fix n5 condition syntax
… referenced it but variables.tf didn't declare)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant