Skip to content

operator: Fix logging, teardown, and Dockerfile issues#1444

Open
RafalKorepta wants to merge 8 commits intomainfrom
rk/fix-otelutil-logger
Open

operator: Fix logging, teardown, and Dockerfile issues#1444
RafalKorepta wants to merge 8 commits intomainfrom
rk/fix-otelutil-logger

Conversation

@RafalKorepta
Copy link
Copy Markdown
Contributor

Summary

  • Fix ContextFree wrapper stripped by WithName: contextFreeSink in otelutil/log does not override WithName(). When controller-runtime calls WithName() to create per-controller loggers, Go's method promotion returns the inner zapr sink without the ContextFree wrapper. Subsequent log.FromContext(ctx) calls append ("ctx", ctx) expecting ContextFree to filter them out — without the wrapper, zapr type-asserts context.Context as a string key and fires DPanic. Added a temporaryWrapper that re-applies ContextFree after every WithName call. To be reverted when Fix contextFreeSink to implement WithName and add integration tests common-go#164 lands.

  • Use global logger in multicluster command: The multicluster entrypoint created its own zap logger with a dedicated --log-level flag, bypassing the global logger configured in PersistentPreRun. The raft logger lacked the ContextFree wrapper and log level had to be set separately. Replaced with log.FromContext so all loggers inherit global configuration.

  • Log canonical cluster names: The local cluster is represented internally as an empty string (mcmanager.LocalCluster). Log messages and object keys printed blank values, making it impossible to identify which cluster an operation targeted. Propagated canonical names through MulticlusterStatefulSet, MulticlusterPod, ClusterNamespacedName, and the clusterObject interface. Also reduced syncer log verbosity (Info→Debug/Trace) and switched to structured key-value pairs.

  • Pass context to CheckScale: CheckScale logged at Info level using context.Background(), losing request-scoped logger and spamming logs. Now accepts context.Context, logs at Debug level, and includes poolName/clusterName fields.

  • Fix StatefulSet deletion during teardown: The AllZero() early return at the top of reconcileDecommission exited before the ToDelete() loop, so zero-replica StatefulSets were never deleted. This blocked owning NodePool custom resources from being garbage collected. Moved the check after the deletion logic.

  • Accept *RaftConfiguration by pointer: Avoids copying the large struct on every NewRaftRuntimeManager call.

  • Use absolute path in Dockerfile ENTRYPOINT: The distroless base image has no $PATH, so the relative "redpanda-operator" command failed when the Helm chart's command override was cleared (e.g. for dlv debugging).

🤖 Generated with Claude Code

…zap instance

The multicluster entrypoint created its own zap logger with a dedicated
--log-level flag for the raft subsystem, bypassing the global logger
configured in PersistentPreRun. This meant:

- The raft logger lacked the ContextFree wrapper, so context.Context
  values leaked into log output (the bug fixed in the prior commit).
- Log level had to be set separately via --log-level, diverging from
  the operator-wide --log-level flag in the root command.

Replace ctrl.LoggerFrom (controller-runtime) with log.FromContext
(otelutil) so both the setup and raft loggers inherit the global
logger's configuration, level, and ContextFree filtering. Remove the
now-redundant parseLogLevel function and --log-level flag.
The local cluster is represented internally as an empty string
(mcmanager.LocalCluster). Log messages and object keys that included
the raw cluster name printed it as an empty string, making it unclear
which cluster a particular log line referred to.

Propagate canonical cluster names (resolved via GetLocalClusterName)
through MulticlusterStatefulSet, MulticlusterPod, ClusterNamespacedName,
and the clusterObject interface so that logs always show the real
cluster identity (e.g. "cluster-1") instead of "".

Also refactor CanonicalClusterName to accept func() string instead of
the full multicluster.Manager interface, and convert syncer log entries
from Info to Debug/Trace level with structured key-value pairs instead
of fmt.Sprintf messages.
CheckScale logged at Info level using context.Background(), which both
spammed logs during normal scaling operations and lost the request-
scoped logger (with its ContextFree filtering and trace correlation).

Accept a context.Context parameter, lower the log level to Debug, and
add poolName and clusterName fields so the log line identifies which
pool is still mid-scale.
…tatefulset deletion

When Stretchcluster and node pools are tearing down, the statefulset was never deleted. That was preventing node pool custom resource from being deleted.
When a StretchCluster was being torn down, the AllZero() early return
at the top of reconcileDecommission exited before reaching the
ToDelete() loop. StatefulSets with zero replicas were never deleted,
which blocked their owning NodePool custom resources from being
garbage collected.

Move the AllZero check after the scale-down and StatefulSet deletion
logic so that zero-replica sets are cleaned up before the function
returns.
RaftConfiguration is a large struct (rest.Config, logr.Logger, scheme,
peers list, etc.). Passing it by value copied all fields on every call.
Change the parameter to a pointer to avoid the unnecessary copy.
The distroless base image has no $PATH, so the relative command
"redpanda-operator" cannot be resolved. Normally the Helm chart
overrides the entrypoint via the pod spec command field, but when
the command is cleared (e.g. for dlv debugging) the container falls
back to the Dockerfile ENTRYPOINT and fails with "executable file
not found in $PATH".

Use the absolute path "/redpanda-operator" to match where the binary
is installed (line 11).
@RafalKorepta RafalKorepta force-pushed the rk/fix-otelutil-logger branch from 651f732 to 1feb26e Compare April 13, 2026 16:28
@david-yu
Copy link
Copy Markdown
Contributor

david-yu commented Apr 13, 2026

Integration test failures are Docker Hub rate limiting, not related to this PR

Build 12940 shows 20+ test failures, but they all stem from Docker Hub anonymous pull rate limit (429 Too Many Requests) — the same issue seen in #1441.

Root cause from the logs:

toomanyrequests: You have reached your unauthenticated pull rate limit.
https://www.docker.com/increase-rate-limit

Images affected:

  • docker.redpanda.com/redpandadata/redpanda:v26.1.1 — all pods in basic-{0..4} stuck in Back-off pulling image (count=12-16)
  • docker.io/redpandadata/redpanda:v24.3.5Back-off pulling image (count=17)

All failing tests are helm install timeouts from image pull failures:

  • TestIntegrationStatefulSetDecommissionercontext deadline exceeded
  • TestIntegrationPVCUnbinder
  • TestIntegrationFactoryOperatorV1 — ImagePullBackOff on v24.3.5
  • TestIntegrationClientFactory (no_TLS, TLS, TLS+SCRAM-512)
  • TestIntegrationClientFactoryTLSListeners
  • TestIntegrationChart (default, namespaced, basic_test, rbac, sidecar, console-integration, admin_api_auth_required, set-datadir-ownership, mtls variants)

Note: There is also one legit build failure — pkg/multicluster/raft_engage_test.go:127:50 needs to pass *RaftConfiguration (pointer) instead of value, matching the signature change in this PR.

Tracked in DEVPROD-4069 (ECR pull-through cache).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants