Add operator-level defaultImagePullSecrets across all controllers#5105
Add operator-level defaultImagePullSecrets across all controllers#5105
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #5105 +/- ##
==========================================
- Coverage 67.12% 67.10% -0.03%
==========================================
Files 597 598 +1
Lines 60170 60222 +52
==========================================
+ Hits 40390 40411 +21
- Misses 16706 16736 +30
- Partials 3074 3075 +1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
jhrozek
left a comment
There was a problem hiding this comment.
Reviewed with a few specialist agents. Design shape looks right (matches the cert-manager / OpenTelemetry operator pattern of operator-flag/env-var defaults), so most comments are details. The two I'd want a response on before merge are the values.yaml scope claim and the operator.env ordering question.
Cluster operators frequently need a registry pull secret applied to every workload the operator spawns (proxy runners, registry API, vMCP servers, embedding servers). Today the chart only exposes imagePullSecrets for the operator's own pod, forcing users to set the secret on every CR or to mutate the namespace-default ServiceAccount. This change introduces a chart value, operator.defaultImagePullSecrets, that the operator picks up at startup via THV_DEFAULT_IMAGE_PULL_SECRETS and applies as a default to every workload it spawns. All five workload-spawning reconcilers consume the shared imagepullsecrets.Defaults value and merge it with the per-CR list at workload-construction time: MCPServer, MCPRemoteProxy, MCPRegistry (via registryapi.manager), VirtualMCPServer, and EmbeddingServer. Precedence rule: per-CR imagePullSecrets take priority on name collisions; chart-level entries are appended additively and deduped by Name. The CR-level slice is never mutated. EmbeddingServer places the chart defaults on the base PodSpec and lets strategic-merge-patch additively union the user's PodTemplateSpec entries (PodSpec.ImagePullSecrets is declared with patchStrategy:"merge",patchMergeKey:"name"). Drift detection on every controller routes through the same merge helper as the construction site so chart defaults do not flag perpetual reconcile loops. The Helm template renders operator.env before chart-managed env vars so a user-supplied entry cannot silently override a reserved name like THV_DEFAULT_IMAGE_PULL_SECRETS — Kubernetes keeps the last entry on a duplicate-named env. The startup parser logs a diagnostic when the env var is set but parses to nothing (typos like " , " or ",,,") so the misconfiguration is visible. Part of #5102 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
f05325c to
b36ed41
Compare
There was a problem hiding this comment.
Large PR Detected
This PR exceeds 1000 lines of changes and requires justification before it can be reviewed.
How to unblock this PR:
Add a section to your PR description with the following format:
## Large PR Justification
[Explain why this PR must be large, such as:]
- Generated code that cannot be split
- Large refactoring that must be atomic
- Multiple related changes that would break if separated
- Migration or data transformationAlternative:
Consider splitting this PR into smaller, focused changes (< 1000 lines each) for easier review and reduced risk.
See our Contributing Guidelines for more details.
This review will be automatically dismissed once you add the justification section.
|
Force-pushed a substantially expanded revision (b36ed41). Two things worth flagging beyond the inline replies: Scope expanded — this is no longer a "foundational" PR. All five workload-spawning reconcilers now consume Drift-detection bug found and fixed in MCPServer. While extending the wiring I noticed The PR description still calls this "foundational" — I'll fix that in a separate edit before merge. |
Large PR justification has been provided. Thank you!
|
✅ Large PR justification has been provided. The size review has been dismissed and this PR can now proceed with normal review. |
jhrozek
left a comment
There was a problem hiding this comment.
Two small things from a re-read of the chart wiring — both are pre-merge ergonomics, neither is a blocker.
Rename THV_DEFAULT_IMAGE_PULL_SECRETS to TOOLHIVE_DEFAULT_IMAGE_PULL_SECRETS
to match the operator container's TOOLHIVE_* convention before the env var
ships and a rename costs a deprecation cycle.
Accept both shapes in operator.defaultImagePullSecrets to mirror
operator.imagePullSecrets above:
- plain strings: ["regcred", "otherscred"]
- object with `name`: [{name: regcred}, {name: otherscred}]
Users who copy the [{name: foo}] pattern from operator.imagePullSecrets
no longer get a silent THV_DEFAULT_IMAGE_PULL_SECRETS=map[name:foo] env var
and ImagePullBackOff at runtime. Malformed entries (numbers, nested lists,
objects without a `name` field) now fail the helm template render with a
clear message instead of producing a bogus env var.
jhrozek
left a comment
There was a problem hiding this comment.
Verified the b4d4248 follow-up locally — env var rename is complete (no THV_* references left, tests use the EnvVar const so they track automatically), and the Helm template now accepts both shapes correctly. Walked through all the cases: plain strings, [{name: foo}] objects, mixed, empty default (no env var rendered), and the bad-input paths (object without name, object with empty name, number, nested list) all fail the render with clear messages. helm lint passes, operator unit tests pass.
One nit not worth blocking on: an empty-string entry [""] in the strings path renders as value: ",regcred" (leading comma). The operator's parser filters empties after the split so it's functionally correct, just slightly ugly. A {{- if not $entry -}}{{- continue -}}{{- end -}} skip would symmetrize with the empty-name rejection if you want it.
Summary
Why: Cluster operators frequently need a registry pull secret (for example, a private mirror of
ghcr.io/stacklok/toolhive/proxyrunner) applied to every workload the operator spawns. Today the chart only exposesoperator.imagePullSecretsfor the operator's own pod (deploy/charts/operator/templates/deployment.yaml), forcing users to set the secret on every MCPServer/MCPRemoteProxy/VirtualMCPServer/MCPRegistry/EmbeddingServer CR or to mutate the namespace-default ServiceAccount. This is brittle and easy to miss.What: Adds a chart value
operator.defaultImagePullSecrets, a small sharedimagepullsecretspackage the operator's reconcilers consume, and wires the value through every workload-spawning controller in one shot:registryapi.manager)corev1.PodSpec.ImagePullSecretscarriespatchStrategy:"merge",patchMergeKey:"name")The earlier "foundational PR + follow-ups" plan is superseded — there is no follow-up work needed for the other controllers.
Precedence rule
When both a chart-level default and a per-CR
imagePullSecretsare set:Namedoes not already appear in the CR-level list.Name; the CR-level entry wins on name collision.This rule is documented on the
Defaults.MergeGo doc and in the chartvalues.yamlcomment.Drift-detection bug fixed in MCPServer
While extending scope, the existing
deploymentNeedsUpdatein MCPServer was found to compare the liveSpec.Template.Spec.ImagePullSecretsagainst the CR-only slice while the construction site applied the mergeddefaults+CRslice. With any chart default configured this would have looped forever. Both sites now route throughimagePullSecretsForMCPServer. Regression test:TestDeploymentNeedsUpdate_DefaultImagePullSecrets.Closes #5102
Type of change
Test plan
task test(operator unit and controller tests pass; operator integration tests fail only becauseenvtestrequires/usr/local/kubebuilder/bin/etcdwhich is not installed in this environment).task lint-fix(zero issues).helm template test deploy/charts/operator/ --set 'operator.defaultImagePullSecrets={regcred,otherscred}'rendersTOOLHIVE_DEFAULT_IMAGE_PULL_SECRETS=regcred,otherscredon the operator container, withoperator.envrendered before chart-managed env vars so chart-managed values win on duplicate names.helm template test deploy/charts/operator/(default values) does not render the env var at all.helm lint deploy/charts/operator/passes.task helm-docsregeneratesdeploy/charts/operator/README.md.Mergeprecedence rule, and the wiring on each newly-integrated controller (*_default_imagepullsecrets_test.go).TestDeploymentNeedsUpdate_DefaultImagePullSecretsconfirms the MCPServer drift fix.Changes
cmd/thv-operator/pkg/imagepullsecrets/defaults.go(new)Defaultsvalue type,LoadDefaultsFromEnv,NewDefaults,Merge,List.cmd/thv-operator/pkg/imagepullsecrets/defaults_test.go(new)cmd/thv-operator/main.goDefaultsthroughsetupServerControllers,setupRegistryController,setupAggregationControllers. Log on typo'd env var.cmd/thv-operator/controllers/mcpserver_controller.goImagePullSecretsDefaultsfield; merge defaults at construction and drift-detection sites viaimagePullSecretsForMCPServer.cmd/thv-operator/controllers/mcpremoteproxy_controller.go+mcpremoteproxy_deployment.goimagePullSecretsForRemoteProxy.cmd/thv-operator/controllers/virtualmcpserver_controller.go+virtualmcpserver_deployment.goimagePullSecretsForVMCP; hash annotation now over the merged list.cmd/thv-operator/controllers/embeddingserver_controller.gocmd/thv-operator/pkg/registryapi/manager.go+rbac.go+deployment.goNewManagertakesDefaults;managermerges withmcpRegistry.Spec.ImagePullSecretsfor both SA and Deployment construction.cmd/thv-operator/controllers/{mcpserver,mcpremoteproxy,virtualmcpserver,embeddingserver}_default_imagepullsecrets_test.godeploy/charts/operator/values.yamldefaultImagePullSecretsvalue with documentation.deploy/charts/operator/templates/deployment.yamloperator.envbefore chart-managed env vars (chart-managed wins on collision); renderTOOLHIVE_DEFAULT_IMAGE_PULL_SECRETSenv var on the operator container.deploy/charts/operator/README.mdhelm-docs.Does this introduce a user-facing change?
Yes. Cluster operators can now set
operator.defaultImagePullSecretsin the chart values to apply pull secrets to every workload the operator spawns, instead of editing each CR. Existing per-CRimagePullSecretscontinue to work unchanged and take precedence on name collision.Special notes for reviewers
Defaults. Pure-config reconcilers (MCPGroup, ToolConfig, MCPExternalAuthConfig, MCPOIDCConfig, MCPTelemetryConfig, MCPServerEntry) do not need it.Defaultsis a value type rather than a pointer to keep the zero value usable: a reconciler that omits the field still gets correct merge behavior (CR-level value passes through unchanged).Generated with Claude Code
Large PR Justification
The diff exceeds the 1000-line threshold because the only useful end state for this feature is all five workload-spawning controllers consuming the chart-level defaults. A staged rollout was the original plan — the v1 of this PR landed only
MCPServeras a "foundational" change — but reviewer feedback (and a perpetual-reconcile-loop bug found in MCPServer's drift detection while extending scope) made it clear the partial state is worse than the complete one:values.yamlclaim is only true if every controller is wired. Splitting meant shippingdefaultImagePullSecrets: [...]documented as "applied to every workload" while four out of five controllers silently ignored it. Reviewers correctly flagged this on the v1 of the PR.ImagePullSecretsDefaultsfield on the reconciler + helper method that callsDefaults.Merge+ setup-path threading). Splitting these into four follow-up PRs would have produced four near-identical reviews.imagepullsecretspackage and themain.gothreading must land together with at least one consumer; otherwise the package is unused andsetupControllersAndWebhookshas a parameter nothing reads.What's NOT in the diff that could have been: the integration tests for embedding-server, mcp-external-auth, mcp-registry, and virtualmcp packages all fail locally because envtest binaries aren't installed; those are pre-existing test-infrastructure issues unrelated to this change. The wiring tests for the four newly-integrated controllers are all unit tests that exercise the exact code path this PR touches, kept deliberately minimal (two cases per helper) with cross-references to
defaults_test.go::TestDefaultsMergefor the exhaustive merge-precedence cases.