Skip to content

chore: requeue with specific interval for when self monitor is not ready#3475

Open
k15r wants to merge 52 commits into
kyma-project:mainfrom
k15r:check-k8s-1.35
Open

chore: requeue with specific interval for when self monitor is not ready#3475
k15r wants to merge 52 commits into
kyma-project:mainfrom
k15r:check-k8s-1.35

Conversation

@k15r
Copy link
Copy Markdown
Contributor

@k15r k15r commented May 5, 2026

What Changed

All four pipeline reconcilers (FluentBit log, OTel log, metric, and trace) now gate flow health probing on self-monitor Deployment readiness. Before probing flow health, each reconciler checks whether the self-monitor is ready; if not, the reconciler sets FlowHealthy to Unknown and either skips requeueing (when self-monitor is not deployed) or requeues after 5 seconds (when the Deployment exists but is not yet ready). This PR also fixes a potential negative TLS certificate requeue duration and corrects early-return logic in the Telemetry reconciler.

Affected Signal Types

Logs, Metrics, Traces

Key Changes
  • internal/reconciler/commonstatus: Adds the shared CheckSelfMonitorReadiness function and the RequeueDelayOnFlowHealthProbingFailure constant (5 seconds) used by all pipeline reconcilers to gate flow health probing on self-monitor availability.
  • controllers/telemetry: Wires a DeploymentProber as selfMonitorProber into all four pipeline reconcilers via the new WithSelfMonitorProber option.
  • internal/reconciler/logpipeline/fluentbit, internal/reconciler/logpipeline/otel, internal/reconciler/metricpipeline, internal/reconciler/tracepipeline (status): Adds a self-monitor readiness check in evaluateFlowHealthCondition; if the self-monitor Deployment is not found the reconciler sets FlowHealthy to Unknown without requeueing, and if the Deployment exists but is not ready the reconciler sets flowHealthProbingFailed = true to trigger a timed requeue.
  • internal/reconciler/logpipeline/fluentbit, internal/reconciler/logpipeline/otel, internal/reconciler/metricpipeline, internal/reconciler/tracepipeline (status): Changes setFlowHealthCondition errors from being joined into allErrors (which blocked Status().Update) to being logged directly, so a flow health probing failure no longer prevents the pipeline status from being written.
  • internal/reconciler/logpipeline/fluentbit, internal/reconciler/logpipeline/otel, internal/reconciler/metricpipeline, internal/reconciler/tracepipeline (reconciler): Returns ctrl.Result{RequeueAfter: 5s} when flowHealthProbingFailed is true, and clamps the TLS certificate requeue-after duration to a minimum of one second to prevent zero or negative durations for already-expired certificates.
  • internal/reconciler/telemetry: Fixes the Telemetry reconciler to return early on error before evaluating the requeue condition, and replaces Requeue: bool with RequeueAfter: 30s when the state is Warning.
  • Makefile: Adds a docker-build-local target that builds the manager image and imports it into the local k3d cluster.
  • test/e2e: Updates expected OAuth2 validation error message strings to match the format produced by Kubernetes 1.35.

Notes for Reviewers

The self-monitor readiness gating distinguishes two cases: ErrDeploymentNotFound (self-monitor not deployed — no error, no requeue, FlowHealthy stays Unknown) and any other error such as ErrDeploymentFetching, PodIsPendingError, or RolloutInProgressError (self-monitor exists but is not ready — error logged, reconciler requeues after 5 seconds). Verify that the errors.Is unwrapping correctly identifies ErrDeploymentNotFound through the fmt.Errorf("...: %w", err) wrapping in CheckSelfMonitorReadiness.

The flow health condition error-handling change is a behavioral shift: previously, a setFlowHealthCondition error joined allErrors and caused Status().Update to be skipped entirely. Now the error is only logged and the status update always proceeds. Confirm this is the intended behavior and that silencing the flow health error does not mask anything more serious.

The max(time.Until(...), time.Second) guard is defensive: time.Until returns a negative or zero duration if the certificate expiry is already in the past, which would cause the reconciler to requeue with an invalid duration.

Release Notes Input

None.

  • 🔄 Regenerate and Update Summary
PR Bot Information

Version: 1.21.0

  • LLM: anthropic--claude-4.6-sonnet
  • Summary Prompt: PR Prompt File
  • File Content Strategy: Full file content
  • Correlation ID: 40b73490-fab6-479d-8898-442852ca844f
  • Event Trigger: pull_request.edited
  • Output Template: PR Template File

@k15r k15r requested a review from a team as a code owner May 5, 2026 10:52
@github-actions github-actions Bot added this to the 1.64.0 milestone May 5, 2026
@hyperspace-insights hyperspace-insights Bot deleted a comment from k15r May 12, 2026
k15r added 3 commits May 13, 2026 10:23
k8s PR #132798 changed CEL validation errors to omit the generic type
label ("object") for complex types, so the expected error message no
longer includes `"object":`.
@k15r k15r added the area/dependency Dependency changes label May 15, 2026
k15r added 7 commits May 15, 2026 14:46
controller-runtime silently drops RequeueAfter when the duration is <= 0.
If the TLS cert expires between updateStatus and calculateRequeueAfterDuration,
time.Until returns a non-positive value and no requeue is scheduled, leaving
the pipeline stuck in TLSCertificateAboutToExpire indefinitely.

Clamp the requeue duration to a minimum of 1s to guarantee reconciliation
always triggers after an about-to-expire cert crosses the expiry boundary.
… reconciler

controller-runtime ignores RequeueAfter/Requeue when error is non-nil and
logs a warning. Return the error alone; the controller framework will requeue
on error automatically.
The self-monitor may not be reachable during early startup (e.g. while
its pod is still being scheduled on k8s 1.35, which is slower than 1.34).
When setFlowHealthCondition fails, the condition is already set to Unknown
-- propagating the error further blocks calculateRequeueAfterDuration from
running, so the cert-expiry requeue is never scheduled and the pipeline
stays stuck in TLSCertificateAboutToExpire.

Log the probe error instead of returning it, so the requeue path is
always reached when the cert is about to expire.
@hisarbalik hisarbalik added area/tests Writing/adding/Refactoring tests or checks and removed area/dependency Dependency changes labels May 27, 2026
@hisarbalik hisarbalik changed the title test: update kubernetes to 1.35 test: fix self-monitor e2e test flakiness May 27, 2026
),
).
Build(),
errorMsg: "Invalid value: \"object\": 'tokenURL' must be a valid URL",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there test changes must be transferred to the other PR


if err := r.setFlowHealthCondition(ctx, &pipeline); err != nil {
allErrors = errors.Join(allErrors, err)
logf.FromContext(ctx).Error(err, "Failed to set flow health condition")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so if we're not going to return an error anymore here, then we don't need allErrors.

but why are we not returning this error anymore? what happens when there is an issue in self monitor and deployment is never ready? do we just requeue reconciliation for all components every 5 seconds

@jeffreylimnardy jeffreylimnardy changed the title test: fix self-monitor e2e test flakiness chore: fix self-monitor e2e test flakiness May 27, 2026
@github-actions github-actions Bot added kind/chore Categorizes issue or PR as related to a chore. and removed kind/test labels May 27, 2026
@jeffreylimnardy jeffreylimnardy changed the title chore: fix self-monitor e2e test flakiness chore: requeue with specific interval for when self monitor is not ready May 27, 2026
// CheckSelfMonitorReadiness checks if the self-monitor deployment is ready before attempting to probe flow health.
// It returns an error if the self-monitor is not ready, including when it's not deployed.
// The caller should check if the error is workloadstatus.ErrDeploymentNotFound to decide whether to requeue.
func CheckSelfMonitorReadiness(ctx context.Context, prober Prober, targetNamespace string, failureReason string) error {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why pass in failureReason when it's never used?

@@ -35,6 +36,7 @@ type Reconciler struct {
agentApplierDeleter AgentApplierDeleter
agentProber AgentProber
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rename this to a generic Prober? since now both self monitor (deployment) is using the same prober as fluent bit (daemonset agent)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/tests Writing/adding/Refactoring tests or checks kind/chore Categorizes issue or PR as related to a chore.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants