operator: Fix logging, teardown, and Dockerfile issues#1444
Open
RafalKorepta wants to merge 8 commits intomainfrom
Open
operator: Fix logging, teardown, and Dockerfile issues#1444RafalKorepta wants to merge 8 commits intomainfrom
RafalKorepta wants to merge 8 commits intomainfrom
Conversation
6a1947c to
77f370b
Compare
77f370b to
651f732
Compare
…zap instance The multicluster entrypoint created its own zap logger with a dedicated --log-level flag for the raft subsystem, bypassing the global logger configured in PersistentPreRun. This meant: - The raft logger lacked the ContextFree wrapper, so context.Context values leaked into log output (the bug fixed in the prior commit). - Log level had to be set separately via --log-level, diverging from the operator-wide --log-level flag in the root command. Replace ctrl.LoggerFrom (controller-runtime) with log.FromContext (otelutil) so both the setup and raft loggers inherit the global logger's configuration, level, and ContextFree filtering. Remove the now-redundant parseLogLevel function and --log-level flag.
The local cluster is represented internally as an empty string (mcmanager.LocalCluster). Log messages and object keys that included the raw cluster name printed it as an empty string, making it unclear which cluster a particular log line referred to. Propagate canonical cluster names (resolved via GetLocalClusterName) through MulticlusterStatefulSet, MulticlusterPod, ClusterNamespacedName, and the clusterObject interface so that logs always show the real cluster identity (e.g. "cluster-1") instead of "". Also refactor CanonicalClusterName to accept func() string instead of the full multicluster.Manager interface, and convert syncer log entries from Info to Debug/Trace level with structured key-value pairs instead of fmt.Sprintf messages.
CheckScale logged at Info level using context.Background(), which both spammed logs during normal scaling operations and lost the request- scoped logger (with its ContextFree filtering and trace correlation). Accept a context.Context parameter, lower the log level to Debug, and add poolName and clusterName fields so the log line identifies which pool is still mid-scale.
…tatefulset deletion When Stretchcluster and node pools are tearing down, the statefulset was never deleted. That was preventing node pool custom resource from being deleted.
When a StretchCluster was being torn down, the AllZero() early return at the top of reconcileDecommission exited before reaching the ToDelete() loop. StatefulSets with zero replicas were never deleted, which blocked their owning NodePool custom resources from being garbage collected. Move the AllZero check after the scale-down and StatefulSet deletion logic so that zero-replica sets are cleaned up before the function returns.
RaftConfiguration is a large struct (rest.Config, logr.Logger, scheme, peers list, etc.). Passing it by value copied all fields on every call. Change the parameter to a pointer to avoid the unnecessary copy.
The distroless base image has no $PATH, so the relative command "redpanda-operator" cannot be resolved. Normally the Helm chart overrides the entrypoint via the pod spec command field, but when the command is cleared (e.g. for dlv debugging) the container falls back to the Dockerfile ENTRYPOINT and fails with "executable file not found in $PATH". Use the absolute path "/redpanda-operator" to match where the binary is installed (line 11).
651f732 to
1feb26e
Compare
andrewstucki
approved these changes
Apr 13, 2026
Contributor
Integration test failures are Docker Hub rate limiting, not related to this PRBuild 12940 shows 20+ test failures, but they all stem from Docker Hub anonymous pull rate limit (429 Too Many Requests) — the same issue seen in #1441. Root cause from the logs: Images affected:
All failing tests are helm install timeouts from image pull failures:
Note: There is also one legit build failure — Tracked in DEVPROD-4069 (ECR pull-through cache). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fix ContextFree wrapper stripped by WithName:
contextFreeSinkinotelutil/logdoes not overrideWithName(). When controller-runtime callsWithName()to create per-controller loggers, Go's method promotion returns the inner zapr sink without theContextFreewrapper. Subsequentlog.FromContext(ctx)calls append("ctx", ctx)expectingContextFreeto filter them out — without the wrapper, zapr type-assertscontext.Contextas a string key and fires DPanic. Added atemporaryWrapperthat re-appliesContextFreeafter everyWithNamecall. To be reverted when Fix contextFreeSink to implement WithName and add integration tests common-go#164 lands.Use global logger in multicluster command: The multicluster entrypoint created its own zap logger with a dedicated
--log-levelflag, bypassing the global logger configured inPersistentPreRun. The raft logger lacked theContextFreewrapper and log level had to be set separately. Replaced withlog.FromContextso all loggers inherit global configuration.Log canonical cluster names: The local cluster is represented internally as an empty string (
mcmanager.LocalCluster). Log messages and object keys printed blank values, making it impossible to identify which cluster an operation targeted. Propagated canonical names throughMulticlusterStatefulSet,MulticlusterPod,ClusterNamespacedName, and theclusterObjectinterface. Also reduced syncer log verbosity (Info→Debug/Trace) and switched to structured key-value pairs.Pass context to CheckScale:
CheckScalelogged at Info level usingcontext.Background(), losing request-scoped logger and spamming logs. Now acceptscontext.Context, logs at Debug level, and includespoolName/clusterNamefields.Fix StatefulSet deletion during teardown: The
AllZero()early return at the top ofreconcileDecommissionexited before theToDelete()loop, so zero-replica StatefulSets were never deleted. This blocked owning NodePool custom resources from being garbage collected. Moved the check after the deletion logic.Accept *RaftConfiguration by pointer: Avoids copying the large struct on every
NewRaftRuntimeManagercall.Use absolute path in Dockerfile ENTRYPOINT: The distroless base image has no
$PATH, so the relative"redpanda-operator"command failed when the Helm chart's command override was cleared (e.g. for dlv debugging).🤖 Generated with Claude Code