operator: Fix logging, teardown, and Dockerfile issues by RafalKorepta · Pull Request #1444 · redpanda-data/redpanda-operator

RafalKorepta · 2026-04-13T15:56:34Z

Summary

Fix ContextFree wrapper stripped by WithName: contextFreeSink in otelutil/log does not override WithName(). When controller-runtime calls WithName() to create per-controller loggers, Go's method promotion returns the inner zapr sink without the ContextFree wrapper. Subsequent log.FromContext(ctx) calls append ("ctx", ctx) expecting ContextFree to filter them out — without the wrapper, zapr type-asserts context.Context as a string key and fires DPanic. Added a temporaryWrapper that re-applies ContextFree after every WithName call. To be reverted when Fix contextFreeSink to implement WithName and add integration tests common-go#164 lands.
Use global logger in multicluster command: The multicluster entrypoint created its own zap logger with a dedicated --log-level flag, bypassing the global logger configured in PersistentPreRun. The raft logger lacked the ContextFree wrapper and log level had to be set separately. Replaced with log.FromContext so all loggers inherit global configuration.
Log canonical cluster names: The local cluster is represented internally as an empty string (mcmanager.LocalCluster). Log messages and object keys printed blank values, making it impossible to identify which cluster an operation targeted. Propagated canonical names through MulticlusterStatefulSet, MulticlusterPod, ClusterNamespacedName, and the clusterObject interface. Also reduced syncer log verbosity (Info→Debug/Trace) and switched to structured key-value pairs.
Pass context to CheckScale: CheckScale logged at Info level using context.Background(), losing request-scoped logger and spamming logs. Now accepts context.Context, logs at Debug level, and includes poolName/clusterName fields.
Fix StatefulSet deletion during teardown: The AllZero() early return at the top of reconcileDecommission exited before the ToDelete() loop, so zero-replica StatefulSets were never deleted. This blocked owning NodePool custom resources from being garbage collected. Moved the check after the deletion logic.
Accept *RaftConfiguration by pointer: Avoids copying the large struct on every NewRaftRuntimeManager call.
Use absolute path in Dockerfile ENTRYPOINT: The distroless base image has no $PATH, so the relative "redpanda-operator" command failed when the Helm chart's command override was cleared (e.g. for dlv debugging).

🤖 Generated with Claude Code

…zap instance The multicluster entrypoint created its own zap logger with a dedicated --log-level flag for the raft subsystem, bypassing the global logger configured in PersistentPreRun. This meant: - The raft logger lacked the ContextFree wrapper, so context.Context values leaked into log output (the bug fixed in the prior commit). - Log level had to be set separately via --log-level, diverging from the operator-wide --log-level flag in the root command. Replace ctrl.LoggerFrom (controller-runtime) with log.FromContext (otelutil) so both the setup and raft loggers inherit the global logger's configuration, level, and ContextFree filtering. Remove the now-redundant parseLogLevel function and --log-level flag.

The local cluster is represented internally as an empty string (mcmanager.LocalCluster). Log messages and object keys that included the raw cluster name printed it as an empty string, making it unclear which cluster a particular log line referred to. Propagate canonical cluster names (resolved via GetLocalClusterName) through MulticlusterStatefulSet, MulticlusterPod, ClusterNamespacedName, and the clusterObject interface so that logs always show the real cluster identity (e.g. "cluster-1") instead of "". Also refactor CanonicalClusterName to accept func() string instead of the full multicluster.Manager interface, and convert syncer log entries from Info to Debug/Trace level with structured key-value pairs instead of fmt.Sprintf messages.

CheckScale logged at Info level using context.Background(), which both spammed logs during normal scaling operations and lost the request- scoped logger (with its ContextFree filtering and trace correlation). Accept a context.Context parameter, lower the log level to Debug, and add poolName and clusterName fields so the log line identifies which pool is still mid-scale.

…tatefulset deletion When Stretchcluster and node pools are tearing down, the statefulset was never deleted. That was preventing node pool custom resource from being deleted.

When a StretchCluster was being torn down, the AllZero() early return at the top of reconcileDecommission exited before reaching the ToDelete() loop. StatefulSets with zero replicas were never deleted, which blocked their owning NodePool custom resources from being garbage collected. Move the AllZero check after the scale-down and StatefulSet deletion logic so that zero-replica sets are cleaned up before the function returns.

RaftConfiguration is a large struct (rest.Config, logr.Logger, scheme, peers list, etc.). Passing it by value copied all fields on every call. Change the parameter to a pointer to avoid the unnecessary copy.

The distroless base image has no $PATH, so the relative command "redpanda-operator" cannot be resolved. Normally the Helm chart overrides the entrypoint via the pod spec command field, but when the command is cleared (e.g. for dlv debugging) the container falls back to the Dockerfile ENTRYPOINT and fails with "executable file not found in $PATH". Use the absolute path "/redpanda-operator" to match where the binary is installed (line 11).

david-yu · 2026-04-13T19:08:12Z

Integration test failures are Docker Hub rate limiting, not related to this PR

Build 12940 shows 20+ test failures, but they all stem from Docker Hub anonymous pull rate limit (429 Too Many Requests) — the same issue seen in #1441.

Root cause from the logs:

toomanyrequests: You have reached your unauthenticated pull rate limit.
https://www.docker.com/increase-rate-limit

Images affected:

docker.redpanda.com/redpandadata/redpanda:v26.1.1 — all pods in basic-{0..4} stuck in Back-off pulling image (count=12-16)
docker.io/redpandadata/redpanda:v24.3.5 — Back-off pulling image (count=17)

All failing tests are helm install timeouts from image pull failures:

TestIntegrationStatefulSetDecommissioner — context deadline exceeded
TestIntegrationPVCUnbinder
TestIntegrationFactoryOperatorV1 — ImagePullBackOff on v24.3.5
TestIntegrationClientFactory (no_TLS, TLS, TLS+SCRAM-512)
TestIntegrationClientFactoryTLSListeners
TestIntegrationChart (default, namespaced, basic_test, rbac, sidecar, console-integration, admin_api_auth_required, set-datadir-ownership, mtls variants)

Note: There is also one legit build failure — pkg/multicluster/raft_engage_test.go:127:50 needs to pass *RaftConfiguration (pointer) instead of value, matching the signature change in this PR.

Tracked in DEVPROD-4069 (ECR pull-through cache).

RafalKorepta requested review from andrewstucki, chrisseto, gene-redpanda and hidalgopl as code owners April 13, 2026 15:56

RafalKorepta force-pushed the rk/fix-otelutil-logger branch from 6a1947c to 77f370b Compare April 13, 2026 16:16

RafalKorepta added the no-changelog label Apr 13, 2026

RafalKorepta force-pushed the rk/fix-otelutil-logger branch from 77f370b to 651f732 Compare April 13, 2026 16:21

RafalKorepta added 8 commits April 13, 2026 18:28

Update common-go library to include otelutil fix

5e13ff6

operator: Move all zero check of the node pool after scale down and s…

c07152e

…tatefulset deletion When Stretchcluster and node pools are tearing down, the statefulset was never deleted. That was preventing node pool custom resource from being deleted.

operator: Accept *RaftConfiguration by pointer in NewRaftRuntimeManager

e796c31

RaftConfiguration is a large struct (rest.Config, logr.Logger, scheme, peers list, etc.). Passing it by value copied all fields on every call. Change the parameter to a pointer to avoid the unnecessary copy.

RafalKorepta force-pushed the rk/fix-otelutil-logger branch from 651f732 to 1feb26e Compare April 13, 2026 16:28

andrewstucki approved these changes Apr 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

operator: Fix logging, teardown, and Dockerfile issues#1444

operator: Fix logging, teardown, and Dockerfile issues#1444
RafalKorepta wants to merge 8 commits intomainfrom
rk/fix-otelutil-logger

RafalKorepta commented Apr 13, 2026

Uh oh!

david-yu commented Apr 13, 2026 •

edited by atlassian bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

RafalKorepta commented Apr 13, 2026

Summary

Uh oh!

david-yu commented Apr 13, 2026 • edited by atlassian bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Integration test failures are Docker Hub rate limiting, not related to this PR

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

david-yu commented Apr 13, 2026 •

edited by atlassian bot

Loading