feat(fluxcd): add flux-shard-operator for tenant helm-controller sharding#2821
feat(fluxcd): add flux-shard-operator for tenant helm-controller sharding#2821Andrei Kvapil (kvaps) wants to merge 17 commits into
Conversation
Tenant-granular placement for sharding the management-cluster tenant helm-controller: tenant key extraction from HelmRelease metadata, greedy least-loaded assignment weighted by HelmRelease count (with the N-tenants-over-N-shards 1-per-shard guarantee), minimal-movement rescale, threshold-gated rebalance with per-tenant cooldown hook, pinned tenants, and the autosizing recommendation formula. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <andrei.kvapil@aenix.io>
Three leader-elected/served parts on metadata-only watches, so the operator never decodes the helm-controller status-patch firehose: - placement controller: singleton full-sync reconciler that rebuilds the tenant index from HelmRelease labels (restart-safe), records the assignment on the tenant namespace, self-heals HR labels and paces reassignments so a backfill never floods a target shard; - shard provisioner: reconciles helm-controller-shard<i> Deployments cloned from the flux-aio helm-controller container and sanitised (no host networking, no localhost cross-container wiring, per-shard watch-label-selector), prunes drained extra shards and retires the legacy flux-tenants Deployment once the tenants bucket is empty; - CREATE-only mutating webhook: stamps the tenant's recorded shard on newborn HelmReleases so each one is born on the correct shard; every miss is permissive and falls back to the placement controller. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <andrei.kvapil@aenix.io>
Manager wiring modeled on lineage-controller-webhook: leader-elected controllers, webhook served on all replicas, metadata-only caches for HelmReleases (shard-key label selector) and namespaces, Deployments cached from the flux namespace only. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <andrei.kvapil@aenix.io>
New cozy-fluxcd system package (Deployment with 2 replicas, webhook cert via cert-manager, CREATE-only MutatingWebhookConfiguration with failurePolicy Ignore scoped to HelmReleases born on the legacy tenants bucket), registered in the platform system bundle and the root image build. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <andrei.kvapil@aenix.io>
The installer no longer ships the hand-rolled tenant helm-controller shard: flux-shard-operator provisions operator-managed shard Deployments cloned from flux-aio at runtime instead, and deletes a running flux-tenants Deployment only after the legacy tenants bucket drains. flux-aio itself stays unchanged and keeps owning everything without a shard key; the Makefile lines that synced the image/version into the removed manifest go away with it. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <andrei.kvapil@aenix.io>
Defensively removes a leftover flux-tenants Deployment only when no HelmRelease carries sharding.fluxcd.io/key=tenants anymore (the flux-shard-operator backfill owns the live drain), and stamps cozystack-version 45 mirroring migration 43's labeled manifest. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <andrei.kvapil@aenix.io>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (5)
🚧 Files skipped from review as they are similar to previous changes (4)
📝 WalkthroughWalkthroughAdds a new flux-shard-operator: placement and provisioner controllers, CREATE-only mutating webhook, operator entrypoint, Helm chart and packaging, tests, and a migration that conditionally retires the legacy flux-tenants Deployment when drained. ChangesFlux Shard Operator
Sequence DiagramsequenceDiagram
participant Client as Kubernetes API
participant Manager as controller-runtime Manager
participant Placement as PlacementReconciler
participant Provisioner as ShardSetReconciler
participant Webhook as ShardWebhook
Client->>Manager: Start watches (HelmReleases, Namespaces, Deployments)
Client->>Placement: HelmRelease/Namespace change event
Placement->>Client: List HelmRelease metadata
Placement->>Client: List Namespaces (metadata)
Placement->>Client: List managed shard Deployments
Placement->>Placement: ComputePlacement with rebalance
Placement->>Client: Patch Namespace shard label
Placement->>Client: Patch HelmRelease shard labels
Client->>Provisioner: Flux source Deployment change
Provisioner->>Client: Get source Deployment
Provisioner->>Provisioner: Compute effective shard count
Provisioner->>Client: Server-side apply shard Deployments
Provisioner->>Client: Delete drained extra shards
Provisioner->>Client: Delete legacy flux-tenants when drained
Client->>Webhook: Admission CREATE HelmRelease
Webhook->>Client: Get tenant Namespace
Webhook->>Webhook: Resolve tenant shard
Webhook-->>Client: JSON Patch to stamp ShardKeyLabel
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
…s on shard readiness E2E caught the cloned shard crashlooping on Talos: the installer injects the node-local KubePrism endpoint (KUBERNETES_SERVICE_HOST=localhost, KUBERNETES_SERVICE_PORT=7445) into hostNetwork workloads, and the clone carried it into a non-hostNetwork pod where localhost is dead. Sanitise those env vars away so shards use the in-cluster defaults. Defense in depth for the same failure mode: the placement controller now defers reassignments whose target shard Deployment has no ready replica, so a backfill can never hand tenants to a helm-controller that is not actually running, and reacts to shard readiness changes directly. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <andrei.kvapil@aenix.io>
--shard-count now accepts "auto" (the new default) or an explicit positive integer. Auto mode drives the shard count from the tenant HelmRelease count — K = clamp(ceil(H/100), 1, min(16, T)) — applied with hysteresis anchored on the currently provisioned shards (scale up eagerly past 120 HR/shard, scale down lazily under 60 HR/shard) so K does not flap around sizing boundaries. The per-shard target is 100 HelmReleases (down from the 150 in the design doc) to leave headroom for existing tenants growing their HelmRelease count without an immediate reshard. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <andrei.kvapil@aenix.io>
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request implements a sharding architecture for Flux HelmReleases to enhance cluster stability and resource isolation. By distributing tenant workloads across multiple managed helm-controller shards, the system mitigates the impact of noisy tenants. The solution includes a dedicated operator that provisions shard deployments, handles tenant-to-shard placement, and uses a mutating webhook to ensure consistent shard assignment for all new HelmReleases. Highlights
New Features🧠 You can now enable Memory (public preview) to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on Gemini (@gemini-code-assist) comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces the flux-shard-operator to horizontally shard tenant HelmReleases across multiple helm-controller instances, isolating noisy tenants and reducing cache-lag blast radius. It includes the operator implementation, a placement controller, a mutating webhook, and integration manifests for Cozystack. The reviewer's feedback focuses on enhancing the operator's security posture by adding pod- and container-level security contexts, correcting the default --metrics-bind-address flag to properly disable metrics when empty, simplifying nested template blocks in the workload deployment, and fixing a minor formatting issue in the README.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| labels: | ||
| app: flux-shard-operator | ||
| spec: | ||
| serviceAccountName: flux-shard-operator |
There was a problem hiding this comment.
To improve security, it's best practice to define a securityContext at the pod level to run as a non-root user and set a specific fsGroup. This restricts pod permissions and ensures files created in volumes have defined ownership, adhering to the principle of least privilege.
serviceAccountName: flux-shard-operator
securityContext:
runAsNonRoot: true
runAsUser: 65534
fsGroup: 65534
seccompProfile:
type: RuntimeDefault| app: flux-shard-operator | ||
| containers: | ||
| - name: flux-shard-operator | ||
| image: "{{ .Values.fluxShardOperator.image }}" |
There was a problem hiding this comment.
For better security hardening, you should explicitly prevent privilege escalation and drop all capabilities for the container. This can be done by adding a securityContext to the container spec.
image: "{{ .Values.fluxShardOperator.image }}"
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL| flag.StringVar(&metricsAddr, "metrics-bind-address", "0", "The address the metrics endpoint binds to. "+ | ||
| "Use :8443 for HTTPS or :8080 for HTTP, or leave as 0 to disable the metrics service.") |
There was a problem hiding this comment.
The help text for --metrics-bind-address states that a value of "0" will disable the metrics service. However, with controller-runtime, a bind address of "0" causes the server to listen on a random port, not disable it. To disable the metrics server, an empty string "" should be used.
To align with the documented intent of being disabled by default, I suggest changing the default value to "" and updating the help text accordingly.
| flag.StringVar(&metricsAddr, "metrics-bind-address", "0", "The address the metrics endpoint binds to. "+ | |
| "Use :8443 for HTTPS or :8080 for HTTP, or leave as 0 to disable the metrics service.") | |
| flag.StringVar(&metricsAddr, "metrics-bind-address", "", "The address the metrics endpoint binds to. "+ | |
| "Use :8443 for HTTPS or :8080 for HTTP. Leave empty to disable the metrics service.") |
| | `fluxShardOperator.image` | Container image | `ghcr.io/cozystack/cozystack/flux-shard-operator:*` | | ||
| | `fluxShardOperator.debug` | Enable debug logging | `false` | | ||
| | `fluxShardOperator.replicas` | Operator replica count | `2` | | ||
| | `fluxShardOperator.shardCount` | Number of helm-controller shards: `auto` sizes from the HelmRelease count, an integer pins it | `auto` | |
There was a problem hiding this comment.
There's a minor formatting issue with trailing whitespace in the table, which can affect rendering on some platforms.
| | `fluxShardOperator.shardCount` | Number of helm-controller shards: `auto` sizes from the HelmRelease count, an integer pins it | `auto` | | |
| | `fluxShardOperator.shardCount` | Number of helm-controller shards: `auto` sizes from the HelmRelease count, an integer pins it | `auto` | |
| {{- with .Values.fluxShardOperator.shard.resources }} | ||
| {{- with .requests }} | ||
| {{- with .cpu }} | ||
| - --shard-cpu-request={{ . }} | ||
| {{- end }} | ||
| {{- with .memory }} | ||
| - --shard-memory-request={{ . }} | ||
| {{- end }} | ||
| {{- end }} | ||
| {{- with .limits }} | ||
| {{- with .cpu }} | ||
| - --shard-cpu-limit={{ . }} | ||
| {{- end }} | ||
| {{- with .memory }} | ||
| - --shard-memory-limit={{ . }} | ||
| {{- end }} | ||
| {{- end }} | ||
| {{- end }} |
There was a problem hiding this comment.
The nested with statements for passing shard resource arguments can be simplified for improved readability and maintainability. Using a more direct path for each value makes the template easier to understand.
{{- with .Values.fluxShardOperator.shard.resources.requests.cpu }}
- --shard-cpu-request={{ . }}
{{- end }}
{{- with .Values.fluxShardOperator.shard.resources.requests.memory }}
- --shard-memory-request={{ . }}
{{- end }}
{{- with .Values.fluxShardOperator.shard.resources.limits.cpu }}
- --shard-cpu-limit={{ . }}
{{- end }}
{{- with .Values.fluxShardOperator.shard.resources.limits.memory }}
- --shard-memory-limit={{ . }}
{{- end }}There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@cmd/flux-shard-operator/main.go`:
- Line 191: The cache is currently filtering HelmReleases by
fso.HelmReleaseMeta(): {Label: labels.NewSelector().Add(*shardKeyExists)} which
drops CREATEs that never got the sharding.fluxcd.io/key (e.g. when
failurePolicy: Ignore); remove that label selector so the HelmRelease cache
isn't label-filtered (or replace with labels.Everything()/nil) and then update
the gather/list in internal/fluxshardoperator/controller.go to query
HelmReleases without a label selector (include unlabeled resources) so placement
fallback can see and stamp HelmReleases missing the shard key.
In `@internal/fluxshardoperator/controller.go`:
- Around line 294-299: The code updates r.lastMoved[tenantNS] and increments
movesCounter before calling r.stamp, so failures in r.stamp still mark the
tenant as moved and inflate metrics; move the updates so they occur only after
r.stamp returns nil: call r.stamp(ctx, v, target) first, handle/append any error
to errs if non-nil, and only when r.stamp succeeds then set
r.lastMoved[tenantNS] = r.now(), do movesCounter.Inc(), and call
logger.Info("reassigning tenant", ...) so metrics and cooldown reflect
successful patches.
In `@internal/fluxshardoperator/placement.go`:
- Around line 145-146: rebalance currently treats every tenant key in in.Pinned
as a valid pin and skips them, which preserves pins that reference non-existent
shards; update the rebalance logic (function rebalance in placement.go) to
consult the same validation used by ComputePlacement so only pins that point to
existing/valid shard IDs are honored — i.e., when iterating tenants in
in.Pinned, verify the pin maps to a known shard (the same validity check used in
ComputePlacement) and skip only those valid pins, allowing tenants with
invalid/non-existent shard pins to participate in rebalancing.
In `@internal/fluxshardoperator/provisioner.go`:
- Around line 301-303: The current assignment hc.Resources = cfg.ShardResources
replaces the entire resource block and drops inherited values; instead merge
per-field so empty/nil fields inherit cloned values: when applying
cfg.ShardResources, set hc.Resources.Requests = cfg.ShardResources.Requests only
if cfg.ShardResources.Requests != nil, and set hc.Resources.Limits =
cfg.ShardResources.Limits only if cfg.ShardResources.Limits != nil (preserving
existing hc.Resources fields otherwise); ensure hc.Resources is initialized
before merging and perform deep copies as needed to avoid shared mutable
structures (referencing hc.Resources and cfg.ShardResources).
In
`@packages/system/flux-shard-operator/templates/mutatingwebhookconfiguration.yaml`:
- Line 6: The templated annotation value for cert-manager.io/inject-ca-from is
unquoted and can break YAML linters; update the mutatingwebhookconfiguration
template so the annotation key cert-manager.io/inject-ca-from has its templated
value quoted (for example wrap {{ .Release.Namespace }}/flux-shard-operator in
double quotes) to ensure YAML tooling and parsers accept the file while keeping
the same template expression.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 995b8bc9-6ed1-40b9-a317-c11e38d81d9b
📒 Files selected for processing (36)
Makefilecmd/flux-shard-operator/main.gogo.modinternal/fluxinstall/manifests/fluxcd-tenants.yamlinternal/fluxshardoperator/config.gointernal/fluxshardoperator/controller.gointernal/fluxshardoperator/gvk.gointernal/fluxshardoperator/metrics.gointernal/fluxshardoperator/placement.gointernal/fluxshardoperator/placement_test.gointernal/fluxshardoperator/provisioner.gointernal/fluxshardoperator/provisioner_test.gointernal/fluxshardoperator/tenant.gointernal/fluxshardoperator/tenant_test.gointernal/fluxshardoperator/webhook.gointernal/fluxshardoperator/webhook_test.gopackages/core/flux-aio/Makefilepackages/core/platform/images/migrations/migrations/44packages/core/platform/sources/flux-shard-operator.yamlpackages/core/platform/templates/bundles/system.yamlpackages/core/platform/values.yamlpackages/system/flux-shard-operator/Chart.yamlpackages/system/flux-shard-operator/Makefilepackages/system/flux-shard-operator/README.mdpackages/system/flux-shard-operator/images/flux-shard-operator/Dockerfilepackages/system/flux-shard-operator/templates/certmanager.yamlpackages/system/flux-shard-operator/templates/mutatingwebhookconfiguration.yamlpackages/system/flux-shard-operator/templates/pdb.yamlpackages/system/flux-shard-operator/templates/rbac-bind.yamlpackages/system/flux-shard-operator/templates/rbac.yamlpackages/system/flux-shard-operator/templates/sa.yamlpackages/system/flux-shard-operator/templates/service.yamlpackages/system/flux-shard-operator/templates/workload.yamlpackages/system/flux-shard-operator/tests/webhook_test.yamlpackages/system/flux-shard-operator/tests/workload_test.yamlpackages/system/flux-shard-operator/values.yaml
💤 Files with no reviewable changes (2)
- internal/fluxinstall/manifests/fluxcd-tenants.yaml
- packages/core/flux-aio/Makefile
| // Metadata-only stubs: the operator never decodes HelmRelease | ||
| // specs/statuses, and the cache only holds tenant HRs (the | ||
| // shard key label exists on every one of them). | ||
| fso.HelmReleaseMeta(): {Label: labels.NewSelector().Add(*shardKeyExists)}, |
There was a problem hiding this comment.
Do not label-filter HelmRelease cache if webhook misses must be backfilled.
With failurePolicy: Ignore, CREATE requests can persist without sharding.fluxcd.io/key. This cache selector drops those HelmReleases, so placement fallback cannot see or stamp them.
Suggested fix
- shardKeyExists, err := labels.NewRequirement(fso.ShardKeyLabel, selection.Exists, nil)
- if err != nil {
- setupLog.Error(err, "building HelmRelease label selector")
- os.Exit(1)
- }
...
- fso.HelmReleaseMeta(): {Label: labels.NewSelector().Add(*shardKeyExists)},
+ fso.HelmReleaseMeta(): {},Also align internal/fluxshardoperator/controller.go gather list call to include unlabeled HelmReleases:
- if err := r.List(ctx, hrs, client.HasLabels{ShardKeyLabel}); err != nil {
+ if err := r.List(ctx, hrs); err != nil {🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@cmd/flux-shard-operator/main.go` at line 191, The cache is currently
filtering HelmReleases by fso.HelmReleaseMeta(): {Label:
labels.NewSelector().Add(*shardKeyExists)} which drops CREATEs that never got
the sharding.fluxcd.io/key (e.g. when failurePolicy: Ignore); remove that label
selector so the HelmRelease cache isn't label-filtered (or replace with
labels.Everything()/nil) and then update the gather/list in
internal/fluxshardoperator/controller.go to query HelmReleases without a label
selector (include unlabeled resources) so placement fallback can see and stamp
HelmReleases missing the shard key.
myasnikovdaniil
left a comment
There was a problem hiding this comment.
Reviewed against the #2817 design doc. The two blocking findings from the design review are properly fixed here: the scoping is now sharding.fluxcd.io/key-based (cache selector key-exists, webhook objectSelector key In [tenants]), TenantNamespaceForHR exactly mirrors the chart's tenant.name helper including nested tenants, and the migration can't deadlock or orphan on the helm path (backfill covers all legacy HRs, retireLegacy + migration 44 share the same drain guard). The KUBERNETES_SERVICE_HOST/PORT strip in the sanitiser is a great catch, and the placement/provisioner/webhook unit coverage is solid.
Requesting changes for one functional bug and two rollout concerns (inline):
- apps API UPDATE path reverts the shard label — see comment on
webhook.go. This reintroduces the orphaning class the drain guard exists to prevent, on the most common write path in the product. shardCount: autodefault skips the staged rollout the design doc prescribed — see comment onvalues.yaml.- Fresh-install bootstrap gap (no inline anchor since the file is deleted): with
internal/fluxinstall/manifests/fluxcd-tenants.yamlgone, new installs have no tenant helm-controller until cert-manager → flux-shard-operator → shard0 cloned → shard0 Ready (first assignment is gated onreadyShards[target]inapply()). What used to be one installer-applied static manifest is now a 4-link runtime chain on the tenant plane's critical path, and a failure anywhere in it (cert-manager webhook, operator image pull, clone source) stalls every tenant HelmRelease on a fresh cluster. The upgrade path is handled nicely (flux-tenants keeps running until drained) — first installs need at least a documented story, ideally a fallback.
Two non-blocking notes:
- The design doc lists capping
remediation.retriesas a prerequisite rollout step, andpkg/registry/apps/application/rest.gostill hardcodesRetries: -1in the conversion. Fine if it lands separately, but the storm source this design contains is still live — would be good to link the follow-up. - Minor asymmetry: the provisioner sizes auto-K from all key-labeled HRs while placement sums only tenant-attributable ones, so placement-K ≤ provisioner-K. The error direction is safe (worst case an idle shard), just noting it's intentional-looking but undocumented.
| // helm-controller status patches are a firehose, so UPDATE is never | ||
| // intercepted, and a webhook outage must degrade to the catch-all path (the | ||
| // HelmRelease keeps its legacy "tenants" key until the placement controller | ||
| // relabels it) instead of blocking creation. |
There was a problem hiding this comment.
CREATE-only leaves a label-reversion hole on the apps API UPDATE path.
pkg/registry/apps/application/rest.go Update flow rebuilds the HelmRelease from the Application: convertApplicationToHelmRelease sets labels from prefixed user labels only (rest.go:1546) — the live sharding.fluxcd.io/key=shard<i> is dropped — then mergeMaps(r.releaseConfig.Labels, ...) re-stamps key: tenants from the ApplicationDefinition, and r.c.Update writes the full object.
Result: every app update through the dashboard/API bounces the HR off its shard back to tenants — synthetic informer DELETE on the owning shard, unowned until the placement controller relabels, and the just-submitted spec change waits for that relabel before any controller reconciles it. Worse: if the operator is down after flux-tenants retirement, every updated app strands on key=tenants with no reconciler at all — the exact orphaning class retireLegacy's drain guard prevents on the migration path.
Two possible fixes:
- (a) preserve the live
sharding.fluxcd.io/keyin the REST update handler (it's in-repo; cleanest), or - (b) add
UPDATEto this webhook with the sameobjectSelector— the status-patch-firehose argument doesn't apply: status patches on correctly-sharded HRs matchkey In [tenants]on neither old nor new object, so they're never sent to the webhook. Only the bounded migration window (HRs still ontenants) would generate UPDATE calls, andfailurePolicy: Ignore+ a cheap handler makes that acceptable.
There was a problem hiding this comment.
Fixed via option (a): Update now fetches the live HelmRelease and carries sharding.fluxcd.io/key over when rebuilding the object from the Application, so the API update path can no longer bounce a HelmRelease off its shard. Pinned by TestUpdate_PreservesShardKeyLabel. The webhook stays CREATE-only; its doc comment now spells out why UPDATE interception is unnecessary: the REST handler preserves the label, and helm-controller applies child HelmReleases via server-side apply, which leaves labels owned by other field managers untouched.
| ## over. "auto" (default) sizes from the tenant HelmRelease count | ||
| ## (~100 HR per shard, capped by tenant count, with hysteresis); set a | ||
| ## positive integer to pin it explicitly, e.g. shardCount: 7. | ||
| shardCount: auto |
There was a problem hiding this comment.
auto as the day-one default skips the staged rollout the design doc prescribed.
PLAN.md (#2817): ship at shardCount: 1 (≈ today's behaviour) to de-risk the runtime swap first; v1 surfaces the autosizing recommendation (cozy_flux_shard_recommended_count exists for exactly this), v2 enforces once telemetry is trusted. This PR jumps straight to enforcement: the largest fleet cluster goes from 1 shard to ceil(946/100) = 10 immediately on upgrade, exercising the leave-selector relabel semantics at full scale on day one — and the e2e the design doc itself called a TODO before production reliance ("relabel never triggers uninstall") isn't in this PR either.
Suggest defaulting to 1 for the release that performs the swap, flipping to auto one release later (or after the e2e lands). The hysteresis/anchoring logic itself looks right.
There was a problem hiding this comment.
We're keeping auto as the default: a default of 1 ships the feature disabled for everyone, and turning it on would require either a defaults-changing release or a per-cluster override — exactly the config drift we want to avoid. On the specific risks:
- The mass relabel happens at any K: with
shardCount: 1every HelmRelease still movestenants→shard0, so the leave-selector path is exercised on day one regardless;autoonly changes the number of targets. - Leaving a watch selector cannot trigger an uninstall: helm-controller uninstalls only through its finalizer on actual object deletion; an HR leaving the selector merely drops out of that shard's informer. Label-based shard handoff is the upstream-supported mechanism — this operator follows the design of weaveworks/flux-shard-controller.
- Moves are paced (5 tenants / 15s) and only target shards with a ready replica, so the day-one backfill on a large fleet is a bounded trickle, not a burst.
- The observation phase is now real: the metrics endpoint is wired and scraped, so
cozy_flux_shard_recommended_countand per-shard load are visible whileautoacts.
Happy to add the "relabel never triggers uninstall" e2e as a follow-up to close the design-doc TODO.
|
|
||
| if kubectl get deployment flux-tenants -n cozy-fluxcd >/dev/null 2>&1; then | ||
| remaining=$(kubectl get helmreleases.helm.toolkit.fluxcd.io -A \ | ||
| -l sharding.fluxcd.io/key=tenants -o name | head -n 1) |
There was a problem hiding this comment.
kubectl ... -o name | head -n 1 under set -euo pipefail: when the output exceeds the pipe buffer (~64KB), head exiting early SIGPIPEs kubectl → exit 141 → the whole migration aborts. ~950 HRs ≈ 38KB of names, so the biggest known cluster is already at half the limit.
Safer without early-close, e.g.:
remaining=$(kubectl get helmreleases.helm.toolkit.fluxcd.io -A \
-l sharding.fluxcd.io/key=tenants -o name | wc -l)
if [ "$remaining" -eq 0 ]; thenThere was a problem hiding this comment.
Fixed — the drain check counts with wc -l, no early-close in the pipe.
| - name: flux-shard-operator | ||
| image: "{{ .Values.fluxShardOperator.image }}" | ||
| args: | ||
| - --leader-elect |
There was a problem hiding this comment.
The README advertises six cozy_flux_shard_* metrics, but the metrics endpoint is off as shipped: --metrics-bind-address defaults to "0" (disabled) in main.go and is never set here, and there's no scrape annotation/PodMonitor either. Either wire --metrics-bind-address + scraping, or drop the Telemetry section from the README until it's reachable.
There was a problem hiding this comment.
Wired: metrics are served plain-HTTP on :8080, the pod carries prometheus.io/scrape annotations, and the README Telemetry section documents the endpoint. (Side note: controller-runtime's "0" does disable the server — but the point stood, nothing was scraping it; now it's on by default.)
| app: flux-shard-operator | ||
| containers: | ||
| - name: flux-shard-operator | ||
| image: "{{ .Values.fluxShardOperator.image }}" |
There was a problem hiding this comment.
Nit: the operator container has no resources and no securityContext (scratch image ⇒ runs as UID 0). Worth matching the hardening of the cloned shard pods (runAsNonRoot, drop ALL, readOnlyRootFilesystem) and giving it modest requests.
There was a problem hiding this comment.
Done: runAsNonRoot (65534) + fsGroup for the serving cert, seccomp RuntimeDefault, dropped capabilities, read-only rootfs, and modest requests/limits.
| continue | ||
| } | ||
| budget-- | ||
| r.lastMoved[tenantNS] = r.now() |
There was a problem hiding this comment.
Nit: the move budget, lastMoved cooldown, and movesCounter are all consumed before stamp() runs; if the patch fails the move is retried next reconcile but counted twice and the cooldown timestamp refreshed. Harmless in practice — just noting it's a known asymmetry, or move the bookkeeping after a successful stamp().
There was a problem hiding this comment.
Done — cooldown/counter/log now run only after a successful stamp, so a failed patch neither delays the retry nor double-counts.
The Update handler rebuilds the HelmRelease from the Application, which re-stamped sharding.fluxcd.io/key with the ApplicationDefinition default and bounced the HelmRelease off its operator-assigned shard on every update. Carry the live label over from the current HelmRelease instead. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <andrei.kvapil@aenix.io>
The README advertised cozy_flux_shard_* metrics while the endpoint was disabled as shipped: serve them plain-HTTP on :8080 with scrape annotations so the autosizing recommendation is actually observable. Run the operator non-root with a read-only rootfs, dropped capabilities and modest resources, quote the inject-ca-from annotation for YAML tooling, and document the fresh-install bootstrap chain. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <andrei.kvapil@aenix.io>
head -n 1 closing the pipe early under pipefail aborts the migration once the HelmRelease name list outgrows the pipe buffer; count with wc -l instead. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <andrei.kvapil@aenix.io>
A failed patch no longer refreshes the rebalance cooldown or inflates the moves counter, so retries are not delayed and not double-counted. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <andrei.kvapil@aenix.io>
ComputePlacement ignores pins to non-existent shards, but rebalance treated any pinned tenant as sticky, so an invalid pin blocked expected moves. Honor only pins placement itself honors. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <andrei.kvapil@aenix.io>
Replacing the whole resources block dropped cloned values the override did not name (e.g. the inherited cpu limit when only a memory limit is configured), contradicting the documented inherit-when-empty semantics. Also document the intentional provisioner-vs-placement sizing asymmetry. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <andrei.kvapil@aenix.io>
|
All review items addressed — replies inline, summary below. Fresh-install bootstrap (item 3). We'd rather not reintroduce a static manifest fallback. The chain (cert-manager → operator → shard0 cloned → shard0 Ready) is the same install-order dependency class as other cert-manager-gated components we already install at runtime (capi-operator, capi-providers), and a parallel static shard would mean two sources of truth for the tenant helm-controller — plus the yq image/version sync this PR removes. Runtime-provisioned shards are also how the upstream weaveworks/flux-shard-controller works. Instead, the README now documents the bootstrap chain and how to debug each link; since assignments only target ready shards, a broken link surfaces as HelmReleases staying on Bot findings triaged. Fixed: rebalance no longer treats pins to non-existent shards as sticky; shard resources merge per-field instead of replacing the block; Follow-ups (separate PRs): cap |
|
I have not had a chance to review this in detail, but reviewing the summary details and the discussion, I think this looks good! The right analysis here. I will follow this PR for updates and watch the documentation when it merges, to look for gaps. |
Aleksei Sviridkin (lexfrei)
left a comment
There was a problem hiding this comment.
NOT LGTM — the cloned shard helm-controller pods inherit flux-aio's required pod anti-affinity, which leaves them unschedulable on single-node clusters and stalls the whole tenant plane there.
Business context: spread tenant HelmReleases across multiple helm-controller shards so one noisy tenant cannot degrade the others (design #2817).
I verified the previously-requested changes are addressed: the apps-API Update path now preserves sharding.fluxcd.io/key (rest.go, pinned by TestUpdate_PreservesShardKeyLabel); the fresh-install bootstrap chain is now documented in the chart README; migration 44 counts with wc -l (no SIGPIPE) and stamps the version ConfigMap with the no-delete label; metrics are wired and the pod is hardened. One new blocker below.
Blockers
B1: shard pods inherit flux-aio's required podAntiAffinity → unschedulable on single-node clusters
BuildShardDeployment deep-copies the flux-aio pod spec and sanitises hostNetwork, dnsPolicy, tolerations, volumes and init containers — but never clears affinity. The flux-aio flux Deployment carries a required pod anti-affinity against app.kubernetes.io/name=flux (internal/fluxinstall/manifests/fluxcd.yaml), so every helm-controller-shard<i> pod refuses to schedule onto the node already running the flux pod.
Evidence: the legacy flux-tenants Deployment this operator replaces had only nodeAffinity (os=linux) and no pod anti-affinity, so this is a regression, not parity. The unit test can't catch it either — the fluxAIODeployment() fixture omits affinity entirely, so TestBuildShardDeployment asserts nothing about it.
Impact: on any cluster where shard pods can only land on the flux-aio node (single-node sandbox/edge installs, or one schedulable node), every shard pod stays Pending; observeShards then reports no ready shard, apply() defers every assignment, and tenant HelmReleases stay on key=tenants. On a fresh single-node install — where flux-tenants is already retired — nothing reconciles the tenant plane at all. Multi-node clusters mask it (shards land elsewhere), which is why multi-node e2e won't surface it.
Fix: sanitise the inherited affinity in BuildShardDeployment the same way host networking and tolerations are — affinity = nil, or strip just podAntiAffinity (keeping nodeAffinity os=linux is harmless). Then add the real anti-affinity to the test fixture and assert it's stripped, so the fixture stays faithful to the manifest.
Non-blocking follow-ups
-
remediation.retries: -1is still hard-coded inpkg/registry/apps/application/rest.go. Sharding bounds the blast radius but the remediation-storm source the design names as a prerequisite to cap is still live. Fine to land separately — worth linking the follow-up. -
The "relabel never triggers uninstall" e2e is deferred. The design doc flags it as a TODO before production reliance, and this PR ships
shardCount: auto(enforcing at fleet scale on day one) without it. Worth tracking. -
The CREATE-only justification comment in
webhook.gostates helm-controller applies child HelmReleases via server-side apply, "which leaves labels owned by other field managers untouched." helm-controller v1.5.0 applies Helm releases via the Helm SDK three-way merge, not Kubernetes SSA, and the tenant chart templates hard-codesharding.fluxcd.io/key: tenantson child HelmReleases (e.g.packages/apps/tenant/templates/etcd.yaml). On an actual tenant-chart upgrade the label can revert totenantsand rely on the placement controller's straggler relabel — which the description acknowledges. Worth making the comment match the self-heal mechanism, since the SSA claim is the stated reason UPDATE interception is unnecessary. -
With
replicas: 2and leader-elected controllers, the placement gauges are populated only by the leader, but the pod-levelprometheus.io/scrapeannotation scrapes both pods, so the non-leader emits thecozy_flux_shard_*gauges at zero. Minor — worth scraping only the leader or documenting themaxaggregation.
| podSpec.InitContainers = nil | ||
| podSpec.HostNetwork = false | ||
| podSpec.DNSPolicy = corev1.DNSClusterFirst | ||
| podSpec.Tolerations = nil |
There was a problem hiding this comment.
affinity is not sanitised here. The cloned flux-aio pod spec carries a required podAntiAffinity against app.kubernetes.io/name=flux (fluxcd.yaml), so shard pods can't co-locate with the flux-aio pod. On a single schedulable node every shard stays Pending and the tenant plane stalls (no ready shard → assignments deferred → HelmReleases stuck on key=tenants). The legacy flux-tenants had no anti-affinity. Clear podSpec.Affinity (or just PodAntiAffinity) alongside the other sanitisation.
There was a problem hiding this comment.
Fixed in 3646e0e — BuildShardDeployment now clears podSpec.Affinity.PodAntiAffinity alongside the other sanitisation, so the inherited required anti-affinity on app.kubernetes.io/name=flux no longer keeps shards off the flux-aio node. A benign nodeAffinity (e.g. os=linux) is preserved, so single-node clusters can now schedule the shard pods.
| }, | ||
| Spec: appsv1.DeploymentSpec{ | ||
| Template: corev1.PodTemplateSpec{ | ||
| Spec: corev1.PodSpec{ |
There was a problem hiding this comment.
The fluxAIODeployment() fixture omits Affinity, so the suite can't catch the inherited required podAntiAffinity from the real manifest. Add the anti-affinity here and assert BuildShardDeployment strips it.
There was a problem hiding this comment.
Done in 3646e0e — the fluxAIODeployment() fixture now carries flux-aio's nodeAffinity + required podAntiAffinity, and TestBuildShardDeployment asserts the podAntiAffinity is stripped while the nodeAffinity survives. The assertion fails without the provisioner change (verified).
…d pods BuildShardDeployment clones the flux-aio pod spec, which carries a required podAntiAffinity on app.kubernetes.io/name=flux. Inherited onto a shard pod it forbids scheduling on the flux-aio node, so on a single schedulable node every shard stays Pending and the tenant plane stalls (no ready shard -> assignments deferred -> HelmReleases stuck on key=tenants). Drop the podAntiAffinity during sanitisation while keeping any benign nodeAffinity (e.g. os=linux). The fixture now carries flux-aio's affinity and TestBuildShardDeployment asserts the strip. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Myasnikov Daniil <myasnikovdaniil2001@gmail.com>
The shardCount description and the shard.resources value overflowed their columns, pushing the internal separators out of line. Reflow every column to its widest cell so the raw table renders cleanly. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Myasnikov Daniil <myasnikovdaniil2001@gmail.com>
Resolve go.mod conflict: keep main's go 1.26.4 and golang.org/x/* bumps, and keep gomodules.xyz/jsonpatch/v2 v2.4.0 as a direct dependency (imported by the flux-shard-operator admission webhook) instead of main's indirect entry. go mod tidy leaves go.sum consistent. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Myasnikov Daniil <myasnikovdaniil2001@gmail.com>
What this PR does
Implements the tenant Flux sharding design from #2817: a new
flux-shard-operatorsystem package that spreads tenant HelmReleases across multiple helm-controller shards, so a noisy tenant (e.g. a HelmRelease stuck in infinite remediation) cannot degrade the others.shardCounthelm-controller Deployments cloned from the flux-aiofluxDeployment and sanitised, so the image and feature-gates stay version-synced automatically. flux-aio itself is unchanged and keeps owning everything without a shard key.failurePolicy: Ignore) stamps the tenant's shard onto every HelmRelease at admission, so each one is born on the correct shard regardless of creation path; any miss degrades gracefully to the placement controller.flux-tenantsDeployment is retired: the operator drains it (backfilltenants→shard<i>) and deletes it once empty; migration 44 removes a leftover only under the same guard.Ships with
shardCount: auto(the default): the operator sizes the shard count from the tenant HelmRelease count —K = clamp(ceil(H/100), 1, min(16, T))with hysteresis (scale up past 120 HR/shard, down under 60) — so small clusters stay at one shard (≈ today's behaviour) and large fleets shard out automatically. An integer value pins the count explicitly.One deviation from the design doc: the webhook/controller scope is the
sharding.fluxcd.io/keylabel rather thaninternal.cozystack.io/tenantmodule— every tenant HelmRelease is born withkey=tenants(stamped by ApplicationDefinitionrelease.labelsand chart templates), so the webhook rewrites it and the placement controller continuously relabels stragglers;tenantmodulecovers only the 6 tenant chart modules.Screenshots
Not a UI change.
Release note
Summary by CodeRabbit