Skip to content

Add operator-level defaultImagePullSecrets across all controllers#5105

Merged
JAORMX merged 2 commits intomainfrom
feat/5102-default-image-pull-secrets-foundation
Apr 29, 2026
Merged

Add operator-level defaultImagePullSecrets across all controllers#5105
JAORMX merged 2 commits intomainfrom
feat/5102-default-image-pull-secrets-foundation

Conversation

@JAORMX
Copy link
Copy Markdown
Collaborator

@JAORMX JAORMX commented Apr 29, 2026

Summary

Why: Cluster operators frequently need a registry pull secret (for example, a private mirror of ghcr.io/stacklok/toolhive/proxyrunner) applied to every workload the operator spawns. Today the chart only exposes operator.imagePullSecrets for the operator's own pod (deploy/charts/operator/templates/deployment.yaml), forcing users to set the secret on every MCPServer/MCPRemoteProxy/VirtualMCPServer/MCPRegistry/EmbeddingServer CR or to mutate the namespace-default ServiceAccount. This is brittle and easy to miss.

What: Adds a chart value operator.defaultImagePullSecrets, a small shared imagepullsecrets package the operator's reconcilers consume, and wires the value through every workload-spawning controller in one shot:

  • MCPServer
  • MCPRemoteProxy
  • MCPRegistry (via registryapi.manager)
  • VirtualMCPServer
  • EmbeddingServer (chart defaults set on base PodSpec; user PodTemplateSpec strategic-merges additively, since corev1.PodSpec.ImagePullSecrets carries patchStrategy:"merge",patchMergeKey:"name")

The earlier "foundational PR + follow-ups" plan is superseded — there is no follow-up work needed for the other controllers.

Precedence rule

When both a chart-level default and a per-CR imagePullSecrets are set:

  • The CR-level list comes first in the resulting slice (preserving its order).
  • Each chart-level default is appended only if its Name does not already appear in the CR-level list.
  • Deduplication is by Name; the CR-level entry wins on name collision.
  • The CR's slice is never mutated; callers receive a fresh slice.

This rule is documented on the Defaults.Merge Go doc and in the chart values.yaml comment.

Drift-detection bug fixed in MCPServer

While extending scope, the existing deploymentNeedsUpdate in MCPServer was found to compare the live Spec.Template.Spec.ImagePullSecrets against the CR-only slice while the construction site applied the merged defaults+CR slice. With any chart default configured this would have looped forever. Both sites now route through imagePullSecretsForMCPServer. Regression test: TestDeploymentNeedsUpdate_DefaultImagePullSecrets.

Closes #5102

Type of change

  • New feature (non-breaking change which adds functionality)

Test plan

  • task test (operator unit and controller tests pass; operator integration tests fail only because envtest requires /usr/local/kubebuilder/bin/etcd which is not installed in this environment).
  • task lint-fix (zero issues).
  • helm template test deploy/charts/operator/ --set 'operator.defaultImagePullSecrets={regcred,otherscred}' renders TOOLHIVE_DEFAULT_IMAGE_PULL_SECRETS=regcred,otherscred on the operator container, with operator.env rendered before chart-managed env vars so chart-managed values win on duplicate names.
  • helm template test deploy/charts/operator/ (default values) does not render the env var at all.
  • helm lint deploy/charts/operator/ passes.
  • task helm-docs regenerates deploy/charts/operator/README.md.
  • New unit tests cover env-var parsing, the Merge precedence rule, and the wiring on each newly-integrated controller (*_default_imagepullsecrets_test.go).
  • New regression test TestDeploymentNeedsUpdate_DefaultImagePullSecrets confirms the MCPServer drift fix.

Changes

File Why
cmd/thv-operator/pkg/imagepullsecrets/defaults.go (new) Shared package: Defaults value type, LoadDefaultsFromEnv, NewDefaults, Merge, List.
cmd/thv-operator/pkg/imagepullsecrets/defaults_test.go (new) Unit tests for the package.
cmd/thv-operator/main.go Parse env var once at startup; thread Defaults through setupServerControllers, setupRegistryController, setupAggregationControllers. Log on typo'd env var.
cmd/thv-operator/controllers/mcpserver_controller.go Add ImagePullSecretsDefaults field; merge defaults at construction and drift-detection sites via imagePullSecretsForMCPServer.
cmd/thv-operator/controllers/mcpremoteproxy_controller.go + mcpremoteproxy_deployment.go Same wiring via imagePullSecretsForRemoteProxy.
cmd/thv-operator/controllers/virtualmcpserver_controller.go + virtualmcpserver_deployment.go Same wiring via imagePullSecretsForVMCP; hash annotation now over the merged list.
cmd/thv-operator/controllers/embeddingserver_controller.go Chart defaults placed on base PodSpec; strategic-merge with user PTS unions additively.
cmd/thv-operator/pkg/registryapi/manager.go + rbac.go + deployment.go NewManager takes Defaults; manager merges with mcpRegistry.Spec.ImagePullSecrets for both SA and Deployment construction.
cmd/thv-operator/controllers/{mcpserver,mcpremoteproxy,virtualmcpserver,embeddingserver}_default_imagepullsecrets_test.go Wiring tests + drift regression.
deploy/charts/operator/values.yaml New defaultImagePullSecrets value with documentation.
deploy/charts/operator/templates/deployment.yaml Render operator.env before chart-managed env vars (chart-managed wins on collision); render TOOLHIVE_DEFAULT_IMAGE_PULL_SECRETS env var on the operator container.
deploy/charts/operator/README.md Regenerated by helm-docs.

Does this introduce a user-facing change?

Yes. Cluster operators can now set operator.defaultImagePullSecrets in the chart values to apply pull secrets to every workload the operator spawns, instead of editing each CR. Existing per-CR imagePullSecrets continue to work unchanged and take precedence on name collision.

Special notes for reviewers

  • Replies to the prior round of inline comments are inline; the substantive surface (drift bug + scope expansion) is in a top-level comment so it's not buried.
  • All five workload-spawning reconcilers now consume Defaults. Pure-config reconcilers (MCPGroup, ToolConfig, MCPExternalAuthConfig, MCPOIDCConfig, MCPTelemetryConfig, MCPServerEntry) do not need it.
  • Defaults is a value type rather than a pointer to keep the zero value usable: a reconciler that omits the field still gets correct merge behavior (CR-level value passes through unchanged).

Generated with Claude Code

Large PR Justification

The diff exceeds the 1000-line threshold because the only useful end state for this feature is all five workload-spawning controllers consuming the chart-level defaults. A staged rollout was the original plan — the v1 of this PR landed only MCPServer as a "foundational" change — but reviewer feedback (and a perpetual-reconcile-loop bug found in MCPServer's drift detection while extending scope) made it clear the partial state is worse than the complete one:

  1. The chart values.yaml claim is only true if every controller is wired. Splitting meant shipping defaultImagePullSecrets: [...] documented as "applied to every workload" while four out of five controllers silently ignored it. Reviewers correctly flagged this on the v1 of the PR.
  2. Each controller's wiring is mechanically identical (ImagePullSecretsDefaults field on the reconciler + helper method that calls Defaults.Merge + setup-path threading). Splitting these into four follow-up PRs would have produced four near-identical reviews.
  3. The shared imagepullsecrets package and the main.go threading must land together with at least one consumer; otherwise the package is unused and setupControllersAndWebhooks has a parameter nothing reads.
  4. Drift detection must mirror construction in the same PR that introduces the merge. The MCPServer drift bug fixed here would not exist if both sites had been touched together — splitting that fix into its own PR would have been pure churn.

What's NOT in the diff that could have been: the integration tests for embedding-server, mcp-external-auth, mcp-registry, and virtualmcp packages all fail locally because envtest binaries aren't installed; those are pre-existing test-infrastructure issues unrelated to this change. The wiring tests for the four newly-integrated controllers are all unit tests that exercise the exact code path this PR touches, kept deliberately minimal (two cases per helper) with cross-references to defaults_test.go::TestDefaultsMerge for the exhaustive merge-precedence cases.

@JAORMX JAORMX self-assigned this Apr 29, 2026
@github-actions github-actions Bot added the size/L Large PR: 600-999 lines changed label Apr 29, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 29, 2026

Codecov Report

❌ Patch coverage is 57.00935% with 46 lines in your changes missing coverage. Please review.
✅ Project coverage is 67.10%. Comparing base (4c23ade) to head (b4d4248).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
cmd/thv-operator/main.go 0.00% 42 Missing ⚠️
...thv-operator/controllers/mcpregistry_controller.go 0.00% 2 Missing ⚠️
cmd/thv-operator/pkg/imagepullsecrets/defaults.go 93.75% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5105      +/-   ##
==========================================
- Coverage   67.12%   67.10%   -0.03%     
==========================================
  Files         597      598       +1     
  Lines       60170    60222      +52     
==========================================
+ Hits        40390    40411      +21     
- Misses      16706    16736      +30     
- Partials     3074     3075       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions github-actions Bot added size/L Large PR: 600-999 lines changed and removed size/L Large PR: 600-999 lines changed labels Apr 29, 2026
Copy link
Copy Markdown
Contributor

@jhrozek jhrozek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed with a few specialist agents. Design shape looks right (matches the cert-manager / OpenTelemetry operator pattern of operator-flag/env-var defaults), so most comments are details. The two I'd want a response on before merge are the values.yaml scope claim and the operator.env ordering question.

Comment thread deploy/charts/operator/values.yaml
Comment thread deploy/charts/operator/templates/deployment.yaml Outdated
Comment thread cmd/thv-operator/pkg/imagepullsecrets/defaults.go Outdated
Comment thread cmd/thv-operator/pkg/imagepullsecrets/defaults_test.go Outdated
Comment thread cmd/thv-operator/main.go Outdated
Comment thread cmd/thv-operator/controllers/mcpserver_default_imagepullsecrets_test.go Outdated
Comment thread cmd/thv-operator/main.go Outdated
Cluster operators frequently need a registry pull secret applied to every
workload the operator spawns (proxy runners, registry API, vMCP servers,
embedding servers). Today the chart only exposes imagePullSecrets for the
operator's own pod, forcing users to set the secret on every CR or to
mutate the namespace-default ServiceAccount.

This change introduces a chart value, operator.defaultImagePullSecrets,
that the operator picks up at startup via THV_DEFAULT_IMAGE_PULL_SECRETS
and applies as a default to every workload it spawns. All five
workload-spawning reconcilers consume the shared imagepullsecrets.Defaults
value and merge it with the per-CR list at workload-construction time:
MCPServer, MCPRemoteProxy, MCPRegistry (via registryapi.manager),
VirtualMCPServer, and EmbeddingServer.

Precedence rule: per-CR imagePullSecrets take priority on name collisions;
chart-level entries are appended additively and deduped by Name. The
CR-level slice is never mutated. EmbeddingServer places the chart
defaults on the base PodSpec and lets strategic-merge-patch additively
union the user's PodTemplateSpec entries (PodSpec.ImagePullSecrets is
declared with patchStrategy:"merge",patchMergeKey:"name").

Drift detection on every controller routes through the same merge helper
as the construction site so chart defaults do not flag perpetual
reconcile loops. The Helm template renders operator.env before
chart-managed env vars so a user-supplied entry cannot silently override
a reserved name like THV_DEFAULT_IMAGE_PULL_SECRETS — Kubernetes keeps
the last entry on a duplicate-named env. The startup parser logs a
diagnostic when the env var is set but parses to nothing (typos like
" , " or ",,,") so the misconfiguration is visible.

Part of #5102

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@JAORMX JAORMX force-pushed the feat/5102-default-image-pull-secrets-foundation branch from f05325c to b36ed41 Compare April 29, 2026 12:36
@github-actions github-actions Bot added size/XL Extra large PR: 1000+ lines changed and removed size/L Large PR: 600-999 lines changed labels Apr 29, 2026
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Large PR Detected

This PR exceeds 1000 lines of changes and requires justification before it can be reviewed.

How to unblock this PR:

Add a section to your PR description with the following format:

## Large PR Justification

[Explain why this PR must be large, such as:]
- Generated code that cannot be split
- Large refactoring that must be atomic
- Multiple related changes that would break if separated
- Migration or data transformation

Alternative:

Consider splitting this PR into smaller, focused changes (< 1000 lines each) for easier review and reduced risk.

See our Contributing Guidelines for more details.


This review will be automatically dismissed once you add the justification section.

@JAORMX
Copy link
Copy Markdown
Collaborator Author

JAORMX commented Apr 29, 2026

Force-pushed a substantially expanded revision (b36ed41). Two things worth flagging beyond the inline replies:

Scope expanded — this is no longer a "foundational" PR. All five workload-spawning reconcilers now consume imagepullsecrets.Defaults: MCPServer, MCPRemoteProxy, MCPRegistry (via registryapi.manager), VirtualMCPServer, and EmbeddingServer. The follow-up issues for the remaining controllers can be closed once this lands. Also added focused tests for each newly-wired controller (*_default_imagepullsecrets_test.go).

Drift-detection bug found and fixed in MCPServer. While extending the wiring I noticed deploymentNeedsUpdate compared the live Deployment.Spec.Template.Spec.ImagePullSecrets against mcpServer.Spec.ResourceOverrides.ProxyDeployment.ImagePullSecrets (CR-only), while the construction site at deploymentForMCPServer was setting the merged defaults+CR slice. With any chart default configured, the comparison was always unequal and deploymentNeedsUpdate returned true on every reconcile — perpetual loop. Both sites now route through imagePullSecretsForMCPServer, the same helper-method pattern the other reconcilers use. Regression test: TestDeploymentNeedsUpdate_DefaultImagePullSecrets in mcpserver_default_imagepullsecrets_test.go. The same risk exists in spirit for VirtualMCPServer's imagePullSecretsNeedsUpdate and MCPRemoteProxy's podSpecNeedsUpdate; both already route through their imagePullSecretsForX helpers, so they were correct from the start of this revision.

The PR description still calls this "foundational" — I'll fix that in a separate edit before merge.

@JAORMX JAORMX changed the title Add operator-level defaultImagePullSecrets plumbing Add operator-level defaultImagePullSecrets across all controllers Apr 29, 2026
@github-actions github-actions Bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Apr 29, 2026
@github-actions github-actions Bot dismissed their stale review April 29, 2026 12:40

Large PR justification has been provided. Thank you!

@github-actions
Copy link
Copy Markdown
Contributor

✅ Large PR justification has been provided. The size review has been dismissed and this PR can now proceed with normal review.

@github-actions github-actions Bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Apr 29, 2026
Copy link
Copy Markdown
Contributor

@jhrozek jhrozek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two small things from a re-read of the chart wiring — both are pre-merge ergonomics, neither is a blocker.

Comment thread deploy/charts/operator/templates/deployment.yaml Outdated
Comment thread cmd/thv-operator/pkg/imagepullsecrets/defaults.go Outdated
Rename THV_DEFAULT_IMAGE_PULL_SECRETS to TOOLHIVE_DEFAULT_IMAGE_PULL_SECRETS
to match the operator container's TOOLHIVE_* convention before the env var
ships and a rename costs a deprecation cycle.

Accept both shapes in operator.defaultImagePullSecrets to mirror
operator.imagePullSecrets above:

  - plain strings:        ["regcred", "otherscred"]
  - object with `name`:   [{name: regcred}, {name: otherscred}]

Users who copy the [{name: foo}] pattern from operator.imagePullSecrets
no longer get a silent THV_DEFAULT_IMAGE_PULL_SECRETS=map[name:foo] env var
and ImagePullBackOff at runtime. Malformed entries (numbers, nested lists,
objects without a `name` field) now fail the helm template render with a
clear message instead of producing a bogus env var.
@github-actions github-actions Bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Apr 29, 2026
Copy link
Copy Markdown
Contributor

@jhrozek jhrozek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified the b4d4248 follow-up locally — env var rename is complete (no THV_* references left, tests use the EnvVar const so they track automatically), and the Helm template now accepts both shapes correctly. Walked through all the cases: plain strings, [{name: foo}] objects, mixed, empty default (no env var rendered), and the bad-input paths (object without name, object with empty name, number, nested list) all fail the render with clear messages. helm lint passes, operator unit tests pass.

One nit not worth blocking on: an empty-string entry [""] in the strings path renders as value: ",regcred" (leading comma). The operator's parser filters empties after the split so it's functionally correct, just slightly ugly. A {{- if not $entry -}}{{- continue -}}{{- end -}} skip would symmetrize with the empty-name rejection if you want it.

@JAORMX JAORMX merged commit 1bbca89 into main Apr 29, 2026
52 of 54 checks passed
@JAORMX JAORMX deleted the feat/5102-default-image-pull-secrets-foundation branch April 29, 2026 14:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/XL Extra large PR: 1000+ lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Operator chart: cluster-wide default imagePullSecrets for all spawned workloads

2 participants