chore: requeue with specific interval for when self monitor is not ready#3475
Open
k15r wants to merge 52 commits into
Open
chore: requeue with specific interval for when self monitor is not ready#3475k15r wants to merge 52 commits into
k15r wants to merge 52 commits into
Conversation
k8s PR #132798 changed CEL validation errors to omit the generic type
label ("object") for complex types, so the expected error message no
longer includes `"object":`.
controller-runtime silently drops RequeueAfter when the duration is <= 0. If the TLS cert expires between updateStatus and calculateRequeueAfterDuration, time.Until returns a non-positive value and no requeue is scheduled, leaving the pipeline stuck in TLSCertificateAboutToExpire indefinitely. Clamp the requeue duration to a minimum of 1s to guarantee reconciliation always triggers after an about-to-expire cert crosses the expiry boundary.
… reconciler controller-runtime ignores RequeueAfter/Requeue when error is non-nil and logs a warning. Return the error alone; the controller framework will requeue on error automatically.
The self-monitor may not be reachable during early startup (e.g. while its pod is still being scheduled on k8s 1.35, which is slower than 1.34). When setFlowHealthCondition fails, the condition is already set to Unknown -- propagating the error further blocks calculateRequeueAfterDuration from running, so the cert-expiry requeue is never scheduled and the pipeline stays stuck in TLSCertificateAboutToExpire. Log the probe error instead of returning it, so the requeue path is always reached when the cert is about to expire.
…to check-k8s-1.35
This reverts commit 3e9dcfe.
| ), | ||
| ). | ||
| Build(), | ||
| errorMsg: "Invalid value: \"object\": 'tokenURL' must be a valid URL", |
Contributor
There was a problem hiding this comment.
there test changes must be transferred to the other PR
|
|
||
| if err := r.setFlowHealthCondition(ctx, &pipeline); err != nil { | ||
| allErrors = errors.Join(allErrors, err) | ||
| logf.FromContext(ctx).Error(err, "Failed to set flow health condition") |
Contributor
There was a problem hiding this comment.
so if we're not going to return an error anymore here, then we don't need allErrors.
but why are we not returning this error anymore? what happens when there is an issue in self monitor and deployment is never ready? do we just requeue reconciliation for all components every 5 seconds
| // CheckSelfMonitorReadiness checks if the self-monitor deployment is ready before attempting to probe flow health. | ||
| // It returns an error if the self-monitor is not ready, including when it's not deployed. | ||
| // The caller should check if the error is workloadstatus.ErrDeploymentNotFound to decide whether to requeue. | ||
| func CheckSelfMonitorReadiness(ctx context.Context, prober Prober, targetNamespace string, failureReason string) error { |
Contributor
There was a problem hiding this comment.
why pass in failureReason when it's never used?
| @@ -35,6 +36,7 @@ type Reconciler struct { | |||
| agentApplierDeleter AgentApplierDeleter | |||
| agentProber AgentProber | |||
Contributor
There was a problem hiding this comment.
Can we rename this to a generic Prober? since now both self monitor (deployment) is using the same prober as fluent bit (daemonset agent)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What Changed
All four pipeline reconcilers (FluentBit log, OTel log, metric, and trace) now gate flow health probing on self-monitor Deployment readiness. Before probing flow health, each reconciler checks whether the self-monitor is ready; if not, the reconciler sets
FlowHealthytoUnknownand either skips requeueing (when self-monitor is not deployed) or requeues after 5 seconds (when the Deployment exists but is not yet ready). This PR also fixes a potential negative TLS certificate requeue duration and corrects early-return logic in the Telemetry reconciler.Affected Signal Types
Logs, Metrics, Traces
Key Changes
internal/reconciler/commonstatus: Adds the sharedCheckSelfMonitorReadinessfunction and theRequeueDelayOnFlowHealthProbingFailureconstant (5 seconds) used by all pipeline reconcilers to gate flow health probing on self-monitor availability.controllers/telemetry: Wires aDeploymentProberasselfMonitorProberinto all four pipeline reconcilers via the newWithSelfMonitorProberoption.internal/reconciler/logpipeline/fluentbit,internal/reconciler/logpipeline/otel,internal/reconciler/metricpipeline,internal/reconciler/tracepipeline(status): Adds a self-monitor readiness check inevaluateFlowHealthCondition; if the self-monitor Deployment is not found the reconciler setsFlowHealthytoUnknownwithout requeueing, and if the Deployment exists but is not ready the reconciler setsflowHealthProbingFailed = trueto trigger a timed requeue.internal/reconciler/logpipeline/fluentbit,internal/reconciler/logpipeline/otel,internal/reconciler/metricpipeline,internal/reconciler/tracepipeline(status): ChangessetFlowHealthConditionerrors from being joined intoallErrors(which blockedStatus().Update) to being logged directly, so a flow health probing failure no longer prevents the pipeline status from being written.internal/reconciler/logpipeline/fluentbit,internal/reconciler/logpipeline/otel,internal/reconciler/metricpipeline,internal/reconciler/tracepipeline(reconciler): Returnsctrl.Result{RequeueAfter: 5s}whenflowHealthProbingFailedis true, and clamps the TLS certificate requeue-after duration to a minimum of one second to prevent zero or negative durations for already-expired certificates.internal/reconciler/telemetry: Fixes the Telemetry reconciler to return early on error before evaluating the requeue condition, and replacesRequeue: boolwithRequeueAfter: 30swhen the state isWarning.Makefile: Adds adocker-build-localtarget that builds the manager image and imports it into the local k3d cluster.test/e2e: Updates expected OAuth2 validation error message strings to match the format produced by Kubernetes 1.35.Notes for Reviewers
The self-monitor readiness gating distinguishes two cases:
ErrDeploymentNotFound(self-monitor not deployed — no error, no requeue,FlowHealthystaysUnknown) and any other error such asErrDeploymentFetching,PodIsPendingError, orRolloutInProgressError(self-monitor exists but is not ready — error logged, reconciler requeues after 5 seconds). Verify that theerrors.Isunwrapping correctly identifiesErrDeploymentNotFoundthrough thefmt.Errorf("...: %w", err)wrapping inCheckSelfMonitorReadiness.The flow health condition error-handling change is a behavioral shift: previously, a
setFlowHealthConditionerror joinedallErrorsand causedStatus().Updateto be skipped entirely. Now the error is only logged and the status update always proceeds. Confirm this is the intended behavior and that silencing the flow health error does not mask anything more serious.The
max(time.Until(...), time.Second)guard is defensive:time.Untilreturns a negative or zero duration if the certificate expiry is already in the past, which would cause the reconciler to requeue with an invalid duration.Release Notes Input
None.
PR Bot Information
Version:
1.21.0anthropic--claude-4.6-sonnet40b73490-fab6-479d-8898-442852ca844fpull_request.edited