Skip to content

feat(fluxcd): add flux-shard-operator for tenant helm-controller sharding#2821

Open
Andrei Kvapil (kvaps) wants to merge 17 commits into
mainfrom
feat/flux-shard-operator-impl
Open

feat(fluxcd): add flux-shard-operator for tenant helm-controller sharding#2821
Andrei Kvapil (kvaps) wants to merge 17 commits into
mainfrom
feat/flux-shard-operator-impl

Conversation

@kvaps

@kvaps Andrei Kvapil (kvaps) commented Jun 3, 2026

Copy link
Copy Markdown
Member

What this PR does

Implements the tenant Flux sharding design from #2817: a new flux-shard-operator system package that spreads tenant HelmReleases across multiple helm-controller shards, so a noisy tenant (e.g. a HelmRelease stuck in infinite remediation) cannot degrade the others.

  • Placement controller owns the tenant→shard assignment: the unit of placement is the tenant (all of its HelmReleases share one shard label), distribution is greedy least-loaded weighted by HelmRelease count, with pinned tenants, threshold-gated rebalance, paced moves, and never moving deleting tenants. Watches are metadata-only, so the controller does not decode the helm-controller status-patch firehose. Assignments are recorded on tenant namespaces; HelmRelease labels remain the source of truth on restarts.
  • Shard runtime is provisioned by the operator: shardCount helm-controller Deployments cloned from the flux-aio flux Deployment and sanitised, so the image and feature-gates stay version-synced automatically. flux-aio itself is unchanged and keeps owning everything without a shard key.
  • CREATE-only mutating webhook (failurePolicy: Ignore) stamps the tenant's shard onto every HelmRelease at admission, so each one is born on the correct shard regardless of creation path; any miss degrades gracefully to the placement controller.
  • The hand-rolled flux-tenants Deployment is retired: the operator drains it (backfill tenantsshard<i>) and deletes it once empty; migration 44 removes a leftover only under the same guard.

Ships with shardCount: auto (the default): the operator sizes the shard count from the tenant HelmRelease count — K = clamp(ceil(H/100), 1, min(16, T)) with hysteresis (scale up past 120 HR/shard, down under 60) — so small clusters stay at one shard (≈ today's behaviour) and large fleets shard out automatically. An integer value pins the count explicitly.

One deviation from the design doc: the webhook/controller scope is the sharding.fluxcd.io/key label rather than internal.cozystack.io/tenantmodule — every tenant HelmRelease is born with key=tenants (stamped by ApplicationDefinition release.labels and chart templates), so the webhook rewrites it and the placement controller continuously relabels stragglers; tenantmodule covers only the 6 tenant chart modules.

Screenshots

Not a UI change.

Release note

feat(fluxcd): tenant HelmReleases are now reconciled by operator-managed helm-controller shards; the flux-shard-operator balances tenants across `shardCount` instances and retires the legacy flux-tenants deployment

Summary by CodeRabbit

  • New Features
    • Introduced flux-shard-operator to shard helm-controller workloads with deterministic tenant placement, optional rebalance controls, and autosized shard count.
    • Added a CREATE-only mutating webhook to stamp the correct shard label on tenant HelmReleases.
  • Bug Fixes
    • Application updates now preserve existing flux-shard label assignments to prevent unwanted shard re-stamping.
  • Chores
    • Added platform packaging for the new component, including an image build step.
    • Removed the legacy tenants deployment and added a migration to retire it safely.
  • Tests
    • Expanded unit and chart tests for placement, provisioning, webhook behavior, and configuration parsing.
  • Documentation
    • Added operator documentation describing configuration and metrics.

Andrei Kvapil (kvaps) and others added 6 commits June 3, 2026 19:47
Tenant-granular placement for sharding the management-cluster tenant
helm-controller: tenant key extraction from HelmRelease metadata,
greedy least-loaded assignment weighted by HelmRelease count (with the
N-tenants-over-N-shards 1-per-shard guarantee), minimal-movement
rescale, threshold-gated rebalance with per-tenant cooldown hook,
pinned tenants, and the autosizing recommendation formula.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <andrei.kvapil@aenix.io>
Three leader-elected/served parts on metadata-only watches, so the
operator never decodes the helm-controller status-patch firehose:

- placement controller: singleton full-sync reconciler that rebuilds
  the tenant index from HelmRelease labels (restart-safe), records the
  assignment on the tenant namespace, self-heals HR labels and paces
  reassignments so a backfill never floods a target shard;
- shard provisioner: reconciles helm-controller-shard<i> Deployments
  cloned from the flux-aio helm-controller container and sanitised
  (no host networking, no localhost cross-container wiring, per-shard
  watch-label-selector), prunes drained extra shards and retires the
  legacy flux-tenants Deployment once the tenants bucket is empty;
- CREATE-only mutating webhook: stamps the tenant's recorded shard on
  newborn HelmReleases so each one is born on the correct shard; every
  miss is permissive and falls back to the placement controller.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <andrei.kvapil@aenix.io>
Manager wiring modeled on lineage-controller-webhook: leader-elected
controllers, webhook served on all replicas, metadata-only caches for
HelmReleases (shard-key label selector) and namespaces, Deployments
cached from the flux namespace only.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <andrei.kvapil@aenix.io>
New cozy-fluxcd system package (Deployment with 2 replicas, webhook
cert via cert-manager, CREATE-only MutatingWebhookConfiguration with
failurePolicy Ignore scoped to HelmReleases born on the legacy tenants
bucket), registered in the platform system bundle and the root image
build.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <andrei.kvapil@aenix.io>
The installer no longer ships the hand-rolled tenant helm-controller
shard: flux-shard-operator provisions operator-managed shard
Deployments cloned from flux-aio at runtime instead, and deletes a
running flux-tenants Deployment only after the legacy tenants bucket
drains. flux-aio itself stays unchanged and keeps owning everything
without a shard key; the Makefile lines that synced the image/version
into the removed manifest go away with it.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <andrei.kvapil@aenix.io>
Defensively removes a leftover flux-tenants Deployment only when no
HelmRelease carries sharding.fluxcd.io/key=tenants anymore (the
flux-shard-operator backfill owns the live drain), and stamps
cozystack-version 45 mirroring migration 43's labeled manifest.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <andrei.kvapil@aenix.io>
@coderabbitai

coderabbitai Bot commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 7d7f9cca-733c-40d1-a8fa-bb26c8730865

📥 Commits

Reviewing files that changed from the base of the PR and between 0421d71 and dc94272.

📒 Files selected for processing (5)
  • Makefile
  • go.mod
  • packages/core/platform/templates/bundles/system.yaml
  • packages/core/platform/values.yaml
  • pkg/registry/apps/application/rest.go
🚧 Files skipped from review as they are similar to previous changes (4)
  • Makefile
  • pkg/registry/apps/application/rest.go
  • packages/core/platform/values.yaml
  • go.mod

📝 Walkthrough

Walkthrough

Adds a new flux-shard-operator: placement and provisioner controllers, CREATE-only mutating webhook, operator entrypoint, Helm chart and packaging, tests, and a migration that conditionally retires the legacy flux-tenants Deployment when drained.

Changes

Flux Shard Operator

Layer / File(s) Summary
Operator configuration and parsing
internal/fluxshardoperator/config.go, gvk.go, metrics.go
Sharding label/naming constants, Config struct with auto/explicit shard modes, shard name/index parsing helpers, GVK constants and PartialObjectMetadata constructors, and Prometheus metric definitions.
Tenant metadata derivation
internal/fluxshardoperator/tenant.go, tenant_test.go
Helper to determine tenant namespace ownership from object metadata/labels, including nested tenant HelmRelease resolution and table-driven test coverage.
Placement algorithm and autosizing
internal/fluxshardoperator/placement.go, placement_test.go
Deterministic tenant→shard assignment with pinning, bounded rebalance respecting threshold, shard autosizing from HelmRelease counts, hysteresis-based effective count, and comprehensive unit tests for balancing, scaling, pins, and cooldown gates.
Placement reconciler
internal/fluxshardoperator/controller.go
PlacementReconciler gathers tenant HelmRelease/Namespace metadata, observes shard Deployment readiness, computes desired placement with per-tenant cooldown tracking, paces label updates via merge-patch, reports load/telemetry gauges, and self-heals drift.
Provisioner reconciler and shard builder
internal/fluxshardoperator/provisioner.go, provisioner_test.go
ShardSetReconciler provisions per-shard helm-controller Deployments via BuildShardDeployment (clones and sanitizes container args/env/volumes/resources), computes effective shard count, prunes drained shards, conditionally retires legacy flux-tenants, with comprehensive tests for sanitization and resource override merging.
Admission webhook
internal/fluxshardoperator/webhook.go, webhook_test.go
CREATE-only mutating webhook that resolves tenant namespace, reads shard assignment, stamps label via RFC6901-escaped JSON Patch when appropriate, and includes tests for mutation, nested tenants, no-ops, and non-Create operations.
Operator entrypoint and manager wiring
cmd/flux-shard-operator/main.go, go.mod
Main program parsing flags (shard count, concurrency, rebalance threshold, pinned tenants, resource overrides), configures controller-runtime manager with custom cache (metadata-only shard-key-labeled HelmRelease and flux Deployment), wires reconcilers/webhook, and promotes jsonpatch dependency.
Helm chart packaging
packages/system/flux-shard-operator/Chart.yaml, Makefile, values.yaml, README.md
Chart metadata, build/test Make targets, Helm values for image/replicas/sharding settings, and operator README documenting placement behavior, autosizing, and metrics.
Chart manifests and infrastructure
packages/system/flux-shard-operator/templates/*, images/*, tests/*
Dockerfile multi-stage build, Deployment/Service/webhook/cert-manager/RBAC/PDB templates, helm-unittest suites for webhook and workload assertions.
Platform integration and migration
packages/core/platform/sources/flux-shard-operator.yaml, templates/bundles/system.yaml, values.yaml, images/migrations/migrations/44
Adds PackageSource with dependencies, includes operator in system bundle, bumps migration targetVersion to 45, and adds guarded migration that deletes legacy flux-tenants Deployment only when drained.
Build and flux-aio updates
Makefile, packages/core/flux-aio/Makefile
Root Makefile invokes operator image build, flux-aio Makefile removes legacy flux-tenants post-processing steps.
Application registry update
pkg/registry/apps/application/rest.go, rest_shard_label_test.go
REST update preserves live flux-shard operator label from backing HelmRelease to prevent shard reversion during Application updates, with unit test validating preservation.

Sequence Diagram

sequenceDiagram
    participant Client as Kubernetes API
    participant Manager as controller-runtime Manager
    participant Placement as PlacementReconciler
    participant Provisioner as ShardSetReconciler
    participant Webhook as ShardWebhook

    Client->>Manager: Start watches (HelmReleases, Namespaces, Deployments)
    Client->>Placement: HelmRelease/Namespace change event
    Placement->>Client: List HelmRelease metadata
    Placement->>Client: List Namespaces (metadata)
    Placement->>Client: List managed shard Deployments
    Placement->>Placement: ComputePlacement with rebalance
    Placement->>Client: Patch Namespace shard label
    Placement->>Client: Patch HelmRelease shard labels

    Client->>Provisioner: Flux source Deployment change
    Provisioner->>Client: Get source Deployment
    Provisioner->>Provisioner: Compute effective shard count
    Provisioner->>Client: Server-side apply shard Deployments
    Provisioner->>Client: Delete drained extra shards
    Provisioner->>Client: Delete legacy flux-tenants when drained

    Client->>Webhook: Admission CREATE HelmRelease
    Webhook->>Client: Get tenant Namespace
    Webhook->>Webhook: Resolve tenant shard
    Webhook-->>Client: JSON Patch to stamp ShardKeyLabel
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested labels

area/build, area/testing

Suggested reviewers

  • androndo
  • lllamnyp
  • IvanHunters
  • sircthulhu
  • myasnikovdaniil

Poem

Tenants split across the shards with care,
A webhook stamps with labels rare,
Controllers dance, placement flows,
Shard by shard, the system grows.
Rabbits cheers this operator spree! 🐰

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 38.89% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: introducing flux-shard-operator for tenant helm-controller sharding.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/flux-shard-operator-impl

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions Bot added area/platform Issues or PRs related to platform infrastructure (bundle, flux, talos, installer) kind/feature Categorizes issue or PR as related to a new feature size/XXL This PR changes 1000+ lines, ignoring generated files labels Jun 3, 2026
Andrei Kvapil (kvaps) and others added 2 commits June 4, 2026 00:04
…s on shard readiness

E2E caught the cloned shard crashlooping on Talos: the installer injects
the node-local KubePrism endpoint (KUBERNETES_SERVICE_HOST=localhost,
KUBERNETES_SERVICE_PORT=7445) into hostNetwork workloads, and the clone
carried it into a non-hostNetwork pod where localhost is dead. Sanitise
those env vars away so shards use the in-cluster defaults.

Defense in depth for the same failure mode: the placement controller now
defers reassignments whose target shard Deployment has no ready replica,
so a backfill can never hand tenants to a helm-controller that is not
actually running, and reacts to shard readiness changes directly.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <andrei.kvapil@aenix.io>
--shard-count now accepts "auto" (the new default) or an explicit
positive integer. Auto mode drives the shard count from the tenant
HelmRelease count — K = clamp(ceil(H/100), 1, min(16, T)) — applied
with hysteresis anchored on the currently provisioned shards (scale up
eagerly past 120 HR/shard, scale down lazily under 60 HR/shard) so K
does not flap around sizing boundaries.

The per-shard target is 100 HelmReleases (down from the 150 in the
design doc) to leave headroom for existing tenants growing their
HelmRelease count without an immediate reshard.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <andrei.kvapil@aenix.io>
@gemini-code-assist

Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request implements a sharding architecture for Flux HelmReleases to enhance cluster stability and resource isolation. By distributing tenant workloads across multiple managed helm-controller shards, the system mitigates the impact of noisy tenants. The solution includes a dedicated operator that provisions shard deployments, handles tenant-to-shard placement, and uses a mutating webhook to ensure consistent shard assignment for all new HelmReleases.

Highlights

  • Operator Implementation: Introduced the flux-shard-operator to manage tenant HelmRelease sharding across multiple helm-controller instances.
  • Placement Logic: Implemented a greedy least-loaded placement controller that distributes tenants based on HelmRelease count.
  • Admission Webhook: Added a mutating webhook to automatically stamp the correct shard label on new HelmReleases at creation time.
  • Legacy Cleanup: Retired the legacy flux-tenants deployment and added a migration to ensure clean removal.
New Features

🧠 You can now enable Memory (public preview) to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment Gemini (@gemini-code-assist) Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on Gemini (@gemini-code-assist) comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the flux-shard-operator to horizontally shard tenant HelmReleases across multiple helm-controller instances, isolating noisy tenants and reducing cache-lag blast radius. It includes the operator implementation, a placement controller, a mutating webhook, and integration manifests for Cozystack. The reviewer's feedback focuses on enhancing the operator's security posture by adding pod- and container-level security contexts, correcting the default --metrics-bind-address flag to properly disable metrics when empty, simplifying nested template blocks in the workload deployment, and fixing a minor formatting issue in the README.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

labels:
app: flux-shard-operator
spec:
serviceAccountName: flux-shard-operator

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

To improve security, it's best practice to define a securityContext at the pod level to run as a non-root user and set a specific fsGroup. This restricts pod permissions and ensures files created in volumes have defined ownership, adhering to the principle of least privilege.

      serviceAccountName: flux-shard-operator
      securityContext:
        runAsNonRoot: true
        runAsUser: 65534
        fsGroup: 65534
        seccompProfile:
          type: RuntimeDefault

app: flux-shard-operator
containers:
- name: flux-shard-operator
image: "{{ .Values.fluxShardOperator.image }}"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

For better security hardening, you should explicitly prevent privilege escalation and drop all capabilities for the container. This can be done by adding a securityContext to the container spec.

        image: "{{ .Values.fluxShardOperator.image }}"
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
              - ALL

Comment on lines +76 to +77
flag.StringVar(&metricsAddr, "metrics-bind-address", "0", "The address the metrics endpoint binds to. "+
"Use :8443 for HTTPS or :8080 for HTTP, or leave as 0 to disable the metrics service.")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The help text for --metrics-bind-address states that a value of "0" will disable the metrics service. However, with controller-runtime, a bind address of "0" causes the server to listen on a random port, not disable it. To disable the metrics server, an empty string "" should be used.

To align with the documented intent of being disabled by default, I suggest changing the default value to "" and updating the help text accordingly.

Suggested change
flag.StringVar(&metricsAddr, "metrics-bind-address", "0", "The address the metrics endpoint binds to. "+
"Use :8443 for HTTPS or :8080 for HTTP, or leave as 0 to disable the metrics service.")
flag.StringVar(&metricsAddr, "metrics-bind-address", "", "The address the metrics endpoint binds to. "+
"Use :8443 for HTTPS or :8080 for HTTP. Leave empty to disable the metrics service.")

| `fluxShardOperator.image` | Container image | `ghcr.io/cozystack/cozystack/flux-shard-operator:*` |
| `fluxShardOperator.debug` | Enable debug logging | `false` |
| `fluxShardOperator.replicas` | Operator replica count | `2` |
| `fluxShardOperator.shardCount` | Number of helm-controller shards: `auto` sizes from the HelmRelease count, an integer pins it | `auto` |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

low

There's a minor formatting issue with trailing whitespace in the table, which can affect rendering on some platforms.

Suggested change
| `fluxShardOperator.shardCount` | Number of helm-controller shards: `auto` sizes from the HelmRelease count, an integer pins it | `auto` |
| `fluxShardOperator.shardCount` | Number of helm-controller shards: `auto` sizes from the HelmRelease count, an integer pins it | `auto` |

Comment on lines +52 to +69
{{- with .Values.fluxShardOperator.shard.resources }}
{{- with .requests }}
{{- with .cpu }}
- --shard-cpu-request={{ . }}
{{- end }}
{{- with .memory }}
- --shard-memory-request={{ . }}
{{- end }}
{{- end }}
{{- with .limits }}
{{- with .cpu }}
- --shard-cpu-limit={{ . }}
{{- end }}
{{- with .memory }}
- --shard-memory-limit={{ . }}
{{- end }}
{{- end }}
{{- end }}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

low

The nested with statements for passing shard resource arguments can be simplified for improved readability and maintainability. Using a more direct path for each value makes the template easier to understand.

        {{- with .Values.fluxShardOperator.shard.resources.requests.cpu }}
        - --shard-cpu-request={{ . }}
        {{- end }}
        {{- with .Values.fluxShardOperator.shard.resources.requests.memory }}
        - --shard-memory-request={{ . }}
        {{- end }}
        {{- with .Values.fluxShardOperator.shard.resources.limits.cpu }}
        - --shard-cpu-limit={{ . }}
        {{- end }}
        {{- with .Values.fluxShardOperator.shard.resources.limits.memory }}
        - --shard-memory-limit={{ . }}
        {{- end }}

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cmd/flux-shard-operator/main.go`:
- Line 191: The cache is currently filtering HelmReleases by
fso.HelmReleaseMeta(): {Label: labels.NewSelector().Add(*shardKeyExists)} which
drops CREATEs that never got the sharding.fluxcd.io/key (e.g. when
failurePolicy: Ignore); remove that label selector so the HelmRelease cache
isn't label-filtered (or replace with labels.Everything()/nil) and then update
the gather/list in internal/fluxshardoperator/controller.go to query
HelmReleases without a label selector (include unlabeled resources) so placement
fallback can see and stamp HelmReleases missing the shard key.

In `@internal/fluxshardoperator/controller.go`:
- Around line 294-299: The code updates r.lastMoved[tenantNS] and increments
movesCounter before calling r.stamp, so failures in r.stamp still mark the
tenant as moved and inflate metrics; move the updates so they occur only after
r.stamp returns nil: call r.stamp(ctx, v, target) first, handle/append any error
to errs if non-nil, and only when r.stamp succeeds then set
r.lastMoved[tenantNS] = r.now(), do movesCounter.Inc(), and call
logger.Info("reassigning tenant", ...) so metrics and cooldown reflect
successful patches.

In `@internal/fluxshardoperator/placement.go`:
- Around line 145-146: rebalance currently treats every tenant key in in.Pinned
as a valid pin and skips them, which preserves pins that reference non-existent
shards; update the rebalance logic (function rebalance in placement.go) to
consult the same validation used by ComputePlacement so only pins that point to
existing/valid shard IDs are honored — i.e., when iterating tenants in
in.Pinned, verify the pin maps to a known shard (the same validity check used in
ComputePlacement) and skip only those valid pins, allowing tenants with
invalid/non-existent shard pins to participate in rebalancing.

In `@internal/fluxshardoperator/provisioner.go`:
- Around line 301-303: The current assignment hc.Resources = cfg.ShardResources
replaces the entire resource block and drops inherited values; instead merge
per-field so empty/nil fields inherit cloned values: when applying
cfg.ShardResources, set hc.Resources.Requests = cfg.ShardResources.Requests only
if cfg.ShardResources.Requests != nil, and set hc.Resources.Limits =
cfg.ShardResources.Limits only if cfg.ShardResources.Limits != nil (preserving
existing hc.Resources fields otherwise); ensure hc.Resources is initialized
before merging and perform deep copies as needed to avoid shared mutable
structures (referencing hc.Resources and cfg.ShardResources).

In
`@packages/system/flux-shard-operator/templates/mutatingwebhookconfiguration.yaml`:
- Line 6: The templated annotation value for cert-manager.io/inject-ca-from is
unquoted and can break YAML linters; update the mutatingwebhookconfiguration
template so the annotation key cert-manager.io/inject-ca-from has its templated
value quoted (for example wrap {{ .Release.Namespace }}/flux-shard-operator in
double quotes) to ensure YAML tooling and parsers accept the file while keeping
the same template expression.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 995b8bc9-6ed1-40b9-a317-c11e38d81d9b

📥 Commits

Reviewing files that changed from the base of the PR and between adb9989 and 4ae3458.

📒 Files selected for processing (36)
  • Makefile
  • cmd/flux-shard-operator/main.go
  • go.mod
  • internal/fluxinstall/manifests/fluxcd-tenants.yaml
  • internal/fluxshardoperator/config.go
  • internal/fluxshardoperator/controller.go
  • internal/fluxshardoperator/gvk.go
  • internal/fluxshardoperator/metrics.go
  • internal/fluxshardoperator/placement.go
  • internal/fluxshardoperator/placement_test.go
  • internal/fluxshardoperator/provisioner.go
  • internal/fluxshardoperator/provisioner_test.go
  • internal/fluxshardoperator/tenant.go
  • internal/fluxshardoperator/tenant_test.go
  • internal/fluxshardoperator/webhook.go
  • internal/fluxshardoperator/webhook_test.go
  • packages/core/flux-aio/Makefile
  • packages/core/platform/images/migrations/migrations/44
  • packages/core/platform/sources/flux-shard-operator.yaml
  • packages/core/platform/templates/bundles/system.yaml
  • packages/core/platform/values.yaml
  • packages/system/flux-shard-operator/Chart.yaml
  • packages/system/flux-shard-operator/Makefile
  • packages/system/flux-shard-operator/README.md
  • packages/system/flux-shard-operator/images/flux-shard-operator/Dockerfile
  • packages/system/flux-shard-operator/templates/certmanager.yaml
  • packages/system/flux-shard-operator/templates/mutatingwebhookconfiguration.yaml
  • packages/system/flux-shard-operator/templates/pdb.yaml
  • packages/system/flux-shard-operator/templates/rbac-bind.yaml
  • packages/system/flux-shard-operator/templates/rbac.yaml
  • packages/system/flux-shard-operator/templates/sa.yaml
  • packages/system/flux-shard-operator/templates/service.yaml
  • packages/system/flux-shard-operator/templates/workload.yaml
  • packages/system/flux-shard-operator/tests/webhook_test.yaml
  • packages/system/flux-shard-operator/tests/workload_test.yaml
  • packages/system/flux-shard-operator/values.yaml
💤 Files with no reviewable changes (2)
  • internal/fluxinstall/manifests/fluxcd-tenants.yaml
  • packages/core/flux-aio/Makefile

// Metadata-only stubs: the operator never decodes HelmRelease
// specs/statuses, and the cache only holds tenant HRs (the
// shard key label exists on every one of them).
fso.HelmReleaseMeta(): {Label: labels.NewSelector().Add(*shardKeyExists)},

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Do not label-filter HelmRelease cache if webhook misses must be backfilled.

With failurePolicy: Ignore, CREATE requests can persist without sharding.fluxcd.io/key. This cache selector drops those HelmReleases, so placement fallback cannot see or stamp them.

Suggested fix
-	shardKeyExists, err := labels.NewRequirement(fso.ShardKeyLabel, selection.Exists, nil)
-	if err != nil {
-		setupLog.Error(err, "building HelmRelease label selector")
-		os.Exit(1)
-	}
...
-				fso.HelmReleaseMeta(): {Label: labels.NewSelector().Add(*shardKeyExists)},
+				fso.HelmReleaseMeta(): {},

Also align internal/fluxshardoperator/controller.go gather list call to include unlabeled HelmReleases:

-	if err := r.List(ctx, hrs, client.HasLabels{ShardKeyLabel}); err != nil {
+	if err := r.List(ctx, hrs); err != nil {
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cmd/flux-shard-operator/main.go` at line 191, The cache is currently
filtering HelmReleases by fso.HelmReleaseMeta(): {Label:
labels.NewSelector().Add(*shardKeyExists)} which drops CREATEs that never got
the sharding.fluxcd.io/key (e.g. when failurePolicy: Ignore); remove that label
selector so the HelmRelease cache isn't label-filtered (or replace with
labels.Everything()/nil) and then update the gather/list in
internal/fluxshardoperator/controller.go to query HelmReleases without a label
selector (include unlabeled resources) so placement fallback can see and stamp
HelmReleases missing the shard key.

Comment thread internal/fluxshardoperator/controller.go Outdated
Comment thread internal/fluxshardoperator/placement.go Outdated
Comment thread internal/fluxshardoperator/provisioner.go Outdated
Comment thread packages/system/flux-shard-operator/templates/mutatingwebhookconfiguration.yaml Outdated

@myasnikovdaniil myasnikovdaniil left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed against the #2817 design doc. The two blocking findings from the design review are properly fixed here: the scoping is now sharding.fluxcd.io/key-based (cache selector key-exists, webhook objectSelector key In [tenants]), TenantNamespaceForHR exactly mirrors the chart's tenant.name helper including nested tenants, and the migration can't deadlock or orphan on the helm path (backfill covers all legacy HRs, retireLegacy + migration 44 share the same drain guard). The KUBERNETES_SERVICE_HOST/PORT strip in the sanitiser is a great catch, and the placement/provisioner/webhook unit coverage is solid.

Requesting changes for one functional bug and two rollout concerns (inline):

  1. apps API UPDATE path reverts the shard label — see comment on webhook.go. This reintroduces the orphaning class the drain guard exists to prevent, on the most common write path in the product.
  2. shardCount: auto default skips the staged rollout the design doc prescribed — see comment on values.yaml.
  3. Fresh-install bootstrap gap (no inline anchor since the file is deleted): with internal/fluxinstall/manifests/fluxcd-tenants.yaml gone, new installs have no tenant helm-controller until cert-manager → flux-shard-operator → shard0 cloned → shard0 Ready (first assignment is gated on readyShards[target] in apply()). What used to be one installer-applied static manifest is now a 4-link runtime chain on the tenant plane's critical path, and a failure anywhere in it (cert-manager webhook, operator image pull, clone source) stalls every tenant HelmRelease on a fresh cluster. The upgrade path is handled nicely (flux-tenants keeps running until drained) — first installs need at least a documented story, ideally a fallback.

Two non-blocking notes:

  • The design doc lists capping remediation.retries as a prerequisite rollout step, and pkg/registry/apps/application/rest.go still hardcodes Retries: -1 in the conversion. Fine if it lands separately, but the storm source this design contains is still live — would be good to link the follow-up.
  • Minor asymmetry: the provisioner sizes auto-K from all key-labeled HRs while placement sums only tenant-attributable ones, so placement-K ≤ provisioner-K. The error direction is safe (worst case an idle shard), just noting it's intentional-looking but undocumented.

// helm-controller status patches are a firehose, so UPDATE is never
// intercepted, and a webhook outage must degrade to the catch-all path (the
// HelmRelease keeps its legacy "tenants" key until the placement controller
// relabels it) instead of blocking creation.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CREATE-only leaves a label-reversion hole on the apps API UPDATE path.

pkg/registry/apps/application/rest.go Update flow rebuilds the HelmRelease from the Application: convertApplicationToHelmRelease sets labels from prefixed user labels only (rest.go:1546) — the live sharding.fluxcd.io/key=shard<i> is dropped — then mergeMaps(r.releaseConfig.Labels, ...) re-stamps key: tenants from the ApplicationDefinition, and r.c.Update writes the full object.

Result: every app update through the dashboard/API bounces the HR off its shard back to tenants — synthetic informer DELETE on the owning shard, unowned until the placement controller relabels, and the just-submitted spec change waits for that relabel before any controller reconciles it. Worse: if the operator is down after flux-tenants retirement, every updated app strands on key=tenants with no reconciler at all — the exact orphaning class retireLegacy's drain guard prevents on the migration path.

Two possible fixes:

  • (a) preserve the live sharding.fluxcd.io/key in the REST update handler (it's in-repo; cleanest), or
  • (b) add UPDATE to this webhook with the same objectSelector — the status-patch-firehose argument doesn't apply: status patches on correctly-sharded HRs match key In [tenants] on neither old nor new object, so they're never sent to the webhook. Only the bounded migration window (HRs still on tenants) would generate UPDATE calls, and failurePolicy: Ignore + a cheap handler makes that acceptable.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed via option (a): Update now fetches the live HelmRelease and carries sharding.fluxcd.io/key over when rebuilding the object from the Application, so the API update path can no longer bounce a HelmRelease off its shard. Pinned by TestUpdate_PreservesShardKeyLabel. The webhook stays CREATE-only; its doc comment now spells out why UPDATE interception is unnecessary: the REST handler preserves the label, and helm-controller applies child HelmReleases via server-side apply, which leaves labels owned by other field managers untouched.

## over. "auto" (default) sizes from the tenant HelmRelease count
## (~100 HR per shard, capped by tenant count, with hysteresis); set a
## positive integer to pin it explicitly, e.g. shardCount: 7.
shardCount: auto

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

auto as the day-one default skips the staged rollout the design doc prescribed.

PLAN.md (#2817): ship at shardCount: 1 (≈ today's behaviour) to de-risk the runtime swap first; v1 surfaces the autosizing recommendation (cozy_flux_shard_recommended_count exists for exactly this), v2 enforces once telemetry is trusted. This PR jumps straight to enforcement: the largest fleet cluster goes from 1 shard to ceil(946/100) = 10 immediately on upgrade, exercising the leave-selector relabel semantics at full scale on day one — and the e2e the design doc itself called a TODO before production reliance ("relabel never triggers uninstall") isn't in this PR either.

Suggest defaulting to 1 for the release that performs the swap, flipping to auto one release later (or after the e2e lands). The hysteresis/anchoring logic itself looks right.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're keeping auto as the default: a default of 1 ships the feature disabled for everyone, and turning it on would require either a defaults-changing release or a per-cluster override — exactly the config drift we want to avoid. On the specific risks:

  • The mass relabel happens at any K: with shardCount: 1 every HelmRelease still moves tenantsshard0, so the leave-selector path is exercised on day one regardless; auto only changes the number of targets.
  • Leaving a watch selector cannot trigger an uninstall: helm-controller uninstalls only through its finalizer on actual object deletion; an HR leaving the selector merely drops out of that shard's informer. Label-based shard handoff is the upstream-supported mechanism — this operator follows the design of weaveworks/flux-shard-controller.
  • Moves are paced (5 tenants / 15s) and only target shards with a ready replica, so the day-one backfill on a large fleet is a bounded trickle, not a burst.
  • The observation phase is now real: the metrics endpoint is wired and scraped, so cozy_flux_shard_recommended_count and per-shard load are visible while auto acts.

Happy to add the "relabel never triggers uninstall" e2e as a follow-up to close the design-doc TODO.


if kubectl get deployment flux-tenants -n cozy-fluxcd >/dev/null 2>&1; then
remaining=$(kubectl get helmreleases.helm.toolkit.fluxcd.io -A \
-l sharding.fluxcd.io/key=tenants -o name | head -n 1)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kubectl ... -o name | head -n 1 under set -euo pipefail: when the output exceeds the pipe buffer (~64KB), head exiting early SIGPIPEs kubectl → exit 141 → the whole migration aborts. ~950 HRs ≈ 38KB of names, so the biggest known cluster is already at half the limit.

Safer without early-close, e.g.:

remaining=$(kubectl get helmreleases.helm.toolkit.fluxcd.io -A \
  -l sharding.fluxcd.io/key=tenants -o name | wc -l)
if [ "$remaining" -eq 0 ]; then

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — the drain check counts with wc -l, no early-close in the pipe.

- name: flux-shard-operator
image: "{{ .Values.fluxShardOperator.image }}"
args:
- --leader-elect

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The README advertises six cozy_flux_shard_* metrics, but the metrics endpoint is off as shipped: --metrics-bind-address defaults to "0" (disabled) in main.go and is never set here, and there's no scrape annotation/PodMonitor either. Either wire --metrics-bind-address + scraping, or drop the Telemetry section from the README until it's reachable.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wired: metrics are served plain-HTTP on :8080, the pod carries prometheus.io/scrape annotations, and the README Telemetry section documents the endpoint. (Side note: controller-runtime's "0" does disable the server — but the point stood, nothing was scraping it; now it's on by default.)

app: flux-shard-operator
containers:
- name: flux-shard-operator
image: "{{ .Values.fluxShardOperator.image }}"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: the operator container has no resources and no securityContext (scratch image ⇒ runs as UID 0). Worth matching the hardening of the cloned shard pods (runAsNonRoot, drop ALL, readOnlyRootFilesystem) and giving it modest requests.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done: runAsNonRoot (65534) + fsGroup for the serving cert, seccomp RuntimeDefault, dropped capabilities, read-only rootfs, and modest requests/limits.

continue
}
budget--
r.lastMoved[tenantNS] = r.now()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: the move budget, lastMoved cooldown, and movesCounter are all consumed before stamp() runs; if the patch fails the move is retried next reconcile but counted twice and the cooldown timestamp refreshed. Harmless in practice — just noting it's a known asymmetry, or move the bookkeeping after a successful stamp().

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — cooldown/counter/log now run only after a successful stamp, so a failed patch neither delays the retry nor double-counts.

Andrei Kvapil (kvaps) and others added 6 commits June 11, 2026 13:44
The Update handler rebuilds the HelmRelease from the Application, which
re-stamped sharding.fluxcd.io/key with the ApplicationDefinition default
and bounced the HelmRelease off its operator-assigned shard on every
update. Carry the live label over from the current HelmRelease instead.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <andrei.kvapil@aenix.io>
The README advertised cozy_flux_shard_* metrics while the endpoint was
disabled as shipped: serve them plain-HTTP on :8080 with scrape
annotations so the autosizing recommendation is actually observable.
Run the operator non-root with a read-only rootfs, dropped capabilities
and modest resources, quote the inject-ca-from annotation for YAML
tooling, and document the fresh-install bootstrap chain.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <andrei.kvapil@aenix.io>
head -n 1 closing the pipe early under pipefail aborts the migration
once the HelmRelease name list outgrows the pipe buffer; count with
wc -l instead.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <andrei.kvapil@aenix.io>
A failed patch no longer refreshes the rebalance cooldown or inflates
the moves counter, so retries are not delayed and not double-counted.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <andrei.kvapil@aenix.io>
ComputePlacement ignores pins to non-existent shards, but rebalance
treated any pinned tenant as sticky, so an invalid pin blocked expected
moves. Honor only pins placement itself honors.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <andrei.kvapil@aenix.io>
Replacing the whole resources block dropped cloned values the override
did not name (e.g. the inherited cpu limit when only a memory limit is
configured), contradicting the documented inherit-when-empty semantics.
Also document the intentional provisioner-vs-placement sizing
asymmetry.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <andrei.kvapil@aenix.io>
@kvaps

Copy link
Copy Markdown
Member Author

All review items addressed — replies inline, summary below.

Fresh-install bootstrap (item 3). We'd rather not reintroduce a static manifest fallback. The chain (cert-manager → operator → shard0 cloned → shard0 Ready) is the same install-order dependency class as other cert-manager-gated components we already install at runtime (capi-operator, capi-providers), and a parallel static shard would mean two sources of truth for the tenant helm-controller — plus the yq image/version sync this PR removes. Runtime-provisioned shards are also how the upstream weaveworks/flux-shard-controller works. Instead, the README now documents the bootstrap chain and how to debug each link; since assignments only target ready shards, a broken link surfaces as HelmReleases staying on tenants — visible and recoverable, never destructive.

Bot findings triaged. Fixed: rebalance no longer treats pins to non-existent shards as sticky; shard resources merge per-field instead of replacing the block; inject-ca-from quoted. Rejected the cache-selector finding: every tenant HelmRelease is born with sharding.fluxcd.io/key (stamped by ApplicationDefinition release.labels and chart templates, independently of the webhook), and an HR without the label is reconciled by flux-aio (!sharding.fluxcd.io/key), so the label-scoped cache misses nothing this operator is responsible for.

Follow-ups (separate PRs): cap remediation.retries (the storm source this design contains), e2e "relabel never triggers uninstall".

@kingdonb

Copy link
Copy Markdown
Member

I have not had a chance to review this in detail, but reviewing the summary details and the discussion, I think this looks good! The right analysis here. I will follow this PR for updates and watch the documentation when it merges, to look for gaps.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOT LGTM — the cloned shard helm-controller pods inherit flux-aio's required pod anti-affinity, which leaves them unschedulable on single-node clusters and stalls the whole tenant plane there.

Business context: spread tenant HelmReleases across multiple helm-controller shards so one noisy tenant cannot degrade the others (design #2817).

I verified the previously-requested changes are addressed: the apps-API Update path now preserves sharding.fluxcd.io/key (rest.go, pinned by TestUpdate_PreservesShardKeyLabel); the fresh-install bootstrap chain is now documented in the chart README; migration 44 counts with wc -l (no SIGPIPE) and stamps the version ConfigMap with the no-delete label; metrics are wired and the pod is hardened. One new blocker below.

Blockers

B1: shard pods inherit flux-aio's required podAntiAffinity → unschedulable on single-node clusters

BuildShardDeployment deep-copies the flux-aio pod spec and sanitises hostNetwork, dnsPolicy, tolerations, volumes and init containers — but never clears affinity. The flux-aio flux Deployment carries a required pod anti-affinity against app.kubernetes.io/name=flux (internal/fluxinstall/manifests/fluxcd.yaml), so every helm-controller-shard<i> pod refuses to schedule onto the node already running the flux pod.

Evidence: the legacy flux-tenants Deployment this operator replaces had only nodeAffinity (os=linux) and no pod anti-affinity, so this is a regression, not parity. The unit test can't catch it either — the fluxAIODeployment() fixture omits affinity entirely, so TestBuildShardDeployment asserts nothing about it.

Impact: on any cluster where shard pods can only land on the flux-aio node (single-node sandbox/edge installs, or one schedulable node), every shard pod stays Pending; observeShards then reports no ready shard, apply() defers every assignment, and tenant HelmReleases stay on key=tenants. On a fresh single-node install — where flux-tenants is already retired — nothing reconciles the tenant plane at all. Multi-node clusters mask it (shards land elsewhere), which is why multi-node e2e won't surface it.

Fix: sanitise the inherited affinity in BuildShardDeployment the same way host networking and tolerations are — affinity = nil, or strip just podAntiAffinity (keeping nodeAffinity os=linux is harmless). Then add the real anti-affinity to the test fixture and assert it's stripped, so the fixture stays faithful to the manifest.

Non-blocking follow-ups

  1. remediation.retries: -1 is still hard-coded in pkg/registry/apps/application/rest.go. Sharding bounds the blast radius but the remediation-storm source the design names as a prerequisite to cap is still live. Fine to land separately — worth linking the follow-up.

  2. The "relabel never triggers uninstall" e2e is deferred. The design doc flags it as a TODO before production reliance, and this PR ships shardCount: auto (enforcing at fleet scale on day one) without it. Worth tracking.

  3. The CREATE-only justification comment in webhook.go states helm-controller applies child HelmReleases via server-side apply, "which leaves labels owned by other field managers untouched." helm-controller v1.5.0 applies Helm releases via the Helm SDK three-way merge, not Kubernetes SSA, and the tenant chart templates hard-code sharding.fluxcd.io/key: tenants on child HelmReleases (e.g. packages/apps/tenant/templates/etcd.yaml). On an actual tenant-chart upgrade the label can revert to tenants and rely on the placement controller's straggler relabel — which the description acknowledges. Worth making the comment match the self-heal mechanism, since the SSA claim is the stated reason UPDATE interception is unnecessary.

  4. With replicas: 2 and leader-elected controllers, the placement gauges are populated only by the leader, but the pod-level prometheus.io/scrape annotation scrapes both pods, so the non-leader emits the cozy_flux_shard_* gauges at zero. Minor — worth scraping only the leader or documenting the max aggregation.

podSpec.InitContainers = nil
podSpec.HostNetwork = false
podSpec.DNSPolicy = corev1.DNSClusterFirst
podSpec.Tolerations = nil

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

affinity is not sanitised here. The cloned flux-aio pod spec carries a required podAntiAffinity against app.kubernetes.io/name=flux (fluxcd.yaml), so shard pods can't co-locate with the flux-aio pod. On a single schedulable node every shard stays Pending and the tenant plane stalls (no ready shard → assignments deferred → HelmReleases stuck on key=tenants). The legacy flux-tenants had no anti-affinity. Clear podSpec.Affinity (or just PodAntiAffinity) alongside the other sanitisation.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 3646e0eBuildShardDeployment now clears podSpec.Affinity.PodAntiAffinity alongside the other sanitisation, so the inherited required anti-affinity on app.kubernetes.io/name=flux no longer keeps shards off the flux-aio node. A benign nodeAffinity (e.g. os=linux) is preserved, so single-node clusters can now schedule the shard pods.

},
Spec: appsv1.DeploymentSpec{
Template: corev1.PodTemplateSpec{
Spec: corev1.PodSpec{

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fluxAIODeployment() fixture omits Affinity, so the suite can't catch the inherited required podAntiAffinity from the real manifest. Add the anti-affinity here and assert BuildShardDeployment strips it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 3646e0e — the fluxAIODeployment() fixture now carries flux-aio's nodeAffinity + required podAntiAffinity, and TestBuildShardDeployment asserts the podAntiAffinity is stripped while the nodeAffinity survives. The assertion fails without the provisioner change (verified).

myasnikovdaniil and others added 3 commits June 22, 2026 10:02
…d pods

BuildShardDeployment clones the flux-aio pod spec, which carries a required podAntiAffinity on app.kubernetes.io/name=flux. Inherited onto a shard pod it forbids scheduling on the flux-aio node, so on a single schedulable node every shard stays Pending and the tenant plane stalls (no ready shard -> assignments deferred -> HelmReleases stuck on key=tenants).

Drop the podAntiAffinity during sanitisation while keeping any benign nodeAffinity (e.g. os=linux). The fixture now carries flux-aio's affinity and TestBuildShardDeployment asserts the strip.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Myasnikov Daniil <myasnikovdaniil2001@gmail.com>
The shardCount description and the shard.resources value overflowed their columns, pushing the internal separators out of line. Reflow every column to its widest cell so the raw table renders cleanly.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Myasnikov Daniil <myasnikovdaniil2001@gmail.com>
Resolve go.mod conflict: keep main's go 1.26.4 and golang.org/x/* bumps, and keep gomodules.xyz/jsonpatch/v2 v2.4.0 as a direct dependency (imported by the flux-shard-operator admission webhook) instead of main's indirect entry. go mod tidy leaves go.sum consistent.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Myasnikov Daniil <myasnikovdaniil2001@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/platform Issues or PRs related to platform infrastructure (bundle, flux, talos, installer) debug Debugging in progress kind/feature Categorizes issue or PR as related to a new feature size/XXL This PR changes 1000+ lines, ignoring generated files

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants