chore: requeue with specific interval for when self monitor is not ready by k15r · Pull Request #3475 · kyma-project/telemetry-manager

k15r · 2026-05-05T10:52:00Z

What Changed

All four pipeline reconcilers (FluentBit log, OTel log, metric, and trace) now gate flow health probing on self-monitor Deployment readiness. Before probing flow health, each reconciler checks whether the self-monitor is ready; if not, the reconciler sets FlowHealthy to Unknown and either skips requeueing (when self-monitor is not deployed) or requeues after 5 seconds (when the Deployment exists but is not yet ready). This PR also fixes a potential negative TLS certificate requeue duration and corrects early-return logic in the Telemetry reconciler.

Affected Signal Types

Logs, Metrics, Traces

Key Changes

internal/reconciler/commonstatus: Adds the shared CheckSelfMonitorReadiness function and the RequeueDelayOnFlowHealthProbingFailure constant (5 seconds) used by all pipeline reconcilers to gate flow health probing on self-monitor availability.
controllers/telemetry: Wires a DeploymentProber as selfMonitorProber into all four pipeline reconcilers via the new WithSelfMonitorProber option.
internal/reconciler/logpipeline/fluentbit, internal/reconciler/logpipeline/otel, internal/reconciler/metricpipeline, internal/reconciler/tracepipeline (status): Adds a self-monitor readiness check in evaluateFlowHealthCondition; if the self-monitor Deployment is not found the reconciler sets FlowHealthy to Unknown without requeueing, and if the Deployment exists but is not ready the reconciler sets flowHealthProbingFailed = true to trigger a timed requeue.
internal/reconciler/logpipeline/fluentbit, internal/reconciler/logpipeline/otel, internal/reconciler/metricpipeline, internal/reconciler/tracepipeline (status): Changes setFlowHealthCondition errors from being joined into allErrors (which blocked Status().Update) to being logged directly, so a flow health probing failure no longer prevents the pipeline status from being written.
internal/reconciler/logpipeline/fluentbit, internal/reconciler/logpipeline/otel, internal/reconciler/metricpipeline, internal/reconciler/tracepipeline (reconciler): Returns ctrl.Result{RequeueAfter: 5s} when flowHealthProbingFailed is true, and clamps the TLS certificate requeue-after duration to a minimum of one second to prevent zero or negative durations for already-expired certificates.
internal/reconciler/telemetry: Fixes the Telemetry reconciler to return early on error before evaluating the requeue condition, and replaces Requeue: bool with RequeueAfter: 30s when the state is Warning.
Makefile: Adds a docker-build-local target that builds the manager image and imports it into the local k3d cluster.
test/e2e: Updates expected OAuth2 validation error message strings to match the format produced by Kubernetes 1.35.

Notes for Reviewers

The self-monitor readiness gating distinguishes two cases: ErrDeploymentNotFound (self-monitor not deployed — no error, no requeue, FlowHealthy stays Unknown) and any other error such as ErrDeploymentFetching, PodIsPendingError, or RolloutInProgressError (self-monitor exists but is not ready — error logged, reconciler requeues after 5 seconds). Verify that the errors.Is unwrapping correctly identifies ErrDeploymentNotFound through the fmt.Errorf("...: %w", err) wrapping in CheckSelfMonitorReadiness.

The flow health condition error-handling change is a behavioral shift: previously, a setFlowHealthCondition error joined allErrors and caused Status().Update to be skipped entirely. Now the error is only logged and the status update always proceeds. Confirm this is the intended behavior and that silencing the flow health error does not mask anything more serious.

The max(time.Until(...), time.Second) guard is defensive: time.Until returns a negative or zero duration if the certificate expiry is already in the past, which would cause the reconciler to requeue with an invalid duration.

Release Notes Input

None.

🔄 Regenerate and Update Summary

PR Bot Information

Version: 1.21.0

LLM: anthropic--claude-4.6-sonnet
Summary Prompt: PR Prompt File
File Content Strategy: Full file content
Correlation ID: 40b73490-fab6-479d-8898-442852ca844f
Event Trigger: pull_request.edited
Output Template: PR Template File

k8s PR #132798 changed CEL validation errors to omit the generic type label ("object") for complex types, so the expected error message no longer includes `"object":`.

controller-runtime silently drops RequeueAfter when the duration is <= 0. If the TLS cert expires between updateStatus and calculateRequeueAfterDuration, time.Until returns a non-positive value and no requeue is scheduled, leaving the pipeline stuck in TLSCertificateAboutToExpire indefinitely. Clamp the requeue duration to a minimum of 1s to guarantee reconciliation always triggers after an about-to-expire cert crosses the expiry boundary.

… reconciler controller-runtime ignores RequeueAfter/Requeue when error is non-nil and logs a warning. Return the error alone; the controller framework will requeue on error automatically.

The self-monitor may not be reachable during early startup (e.g. while its pod is still being scheduled on k8s 1.35, which is slower than 1.34). When setFlowHealthCondition fails, the condition is already set to Unknown -- propagating the error further blocks calculateRequeueAfterDuration from running, so the cert-expiry requeue is never scheduled and the pipeline stays stuck in TLSCertificateAboutToExpire. Log the probe error instead of returning it, so the requeue path is always reached when the cert is about to expire.

…to check-k8s-1.35

This reverts commit 3e9dcfe.

…to check-k8s-1.35

jeffreylimnardy · 2026-05-27T11:11:45Z

 					),
 				).
 				Build(),
-			errorMsg: "Invalid value: \"object\": 'tokenURL' must be a valid URL",


there test changes must be transferred to the other PR

jeffreylimnardy · 2026-05-27T11:18:46Z


 	if err := r.setFlowHealthCondition(ctx, &pipeline); err != nil {
-		allErrors = errors.Join(allErrors, err)
+		logf.FromContext(ctx).Error(err, "Failed to set flow health condition")


so if we're not going to return an error anymore here, then we don't need allErrors.

but why are we not returning this error anymore? what happens when there is an issue in self monitor and deployment is never ready? do we just requeue reconciliation for all components every 5 seconds

jeffreylimnardy · 2026-05-27T12:20:12Z

+// CheckSelfMonitorReadiness checks if the self-monitor deployment is ready before attempting to probe flow health.
+// It returns an error if the self-monitor is not ready, including when it's not deployed.
+// The caller should check if the error is workloadstatus.ErrDeploymentNotFound to decide whether to requeue.
+func CheckSelfMonitorReadiness(ctx context.Context, prober Prober, targetNamespace string, failureReason string) error {


why pass in failureReason when it's never used?

jeffreylimnardy · 2026-05-27T12:39:48Z

@@ -35,6 +36,7 @@ type Reconciler struct {
 	agentApplierDeleter AgentApplierDeleter
 	agentProber         AgentProber


Can we rename this to a generic Prober? since now both self monitor (deployment) is using the same prober as fluent bit (daemonset agent)

…to check-k8s-1.35

update kubernetes to 1.35

732b0de

k15r requested a review from a team as a code owner May 5, 2026 10:52

github-actions Bot added this to the 1.64.0 milestone May 5, 2026

github-actions Bot added the kind/test label May 5, 2026

k15r added 3 commits May 5, 2026 14:15

update tools

c4ab771

chore: merge upstream/main into check-k8s-1.35

6dc3cff

Merge branch 'main' into check-k8s-1.35

dd2b22c

hyperspace-insights Bot deleted a comment from k15r May 12, 2026

k15r added 3 commits May 13, 2026 10:23

Merge branch 'main' into check-k8s-1.35

f337d64

Merge branch 'main' into check-k8s-1.35

19a7dbc

test: fix CEL validation error message format for k8s 1.35

72ec678

k8s PR #132798 changed CEL validation errors to omit the generic type label ("object") for complex types, so the expected error message no longer includes `"object":`.

k15r added the area/dependency Dependency changes label May 15, 2026

k15r added 7 commits May 15, 2026 14:46

fix: don't return Requeue=true alongside a non-nil error in telemetry…

08adf9a

… reconciler controller-runtime ignores RequeueAfter/Requeue when error is non-nil and logs a warning. Return the error alone; the controller framework will requeue on error automatically.

Merge remote-tracking branch 'upstream/main' into check-k8s-1.35

0ff9356

Merge branch 'main' into check-k8s-1.35

ca146d8

Merge branch 'main' into check-k8s-1.35

b2b4703

Merge branch 'main' into check-k8s-1.35

51666bf

rakesh-garimella assigned hisarbalik May 19, 2026

hisarbalik added 10 commits May 19, 2026 12:57

Merge branch 'main' into check-k8s-1.35

43318a9

Merge branch 'main' into check-k8s-1.35

0815444

adjust k3d resources

fe05bc8

Merge branch 'main' into check-k8s-1.35

47fb0b4

add full FQDNs service name for self monitor prober

14fcf9c

Merge branch 'check-k8s-1.35' of github.com:k15r/telemetry-manager in…

a3d48f9

…to check-k8s-1.35

roll back FQDN changes

3e9dcfe

Revert "roll back FQDN changes"

5cf3eb6

This reverts commit 3e9dcfe.

roll back FQDN changes

9e203ca

fix self monitor e2e test

ac9b182

hisarbalik added 2 commits May 27, 2026 11:36

roll-back k8s changes

55acd3e

Merge branch 'check-k8s-1.35' of github.com:k15r/telemetry-manager in…

1169c9c

…to check-k8s-1.35

hisarbalik requested a deployment to GitHub-Actions May 27, 2026 09:36 — with GitHub Actions Waiting

hisarbalik temporarily deployed to GitHub-Actions May 27, 2026 09:36 — with GitHub Actions Inactive

hisarbalik added area/tests Writing/adding/Refactoring tests or checks and removed area/dependency Dependency changes labels May 27, 2026

hisarbalik temporarily deployed to GitHub-Actions May 27, 2026 09:37 — with GitHub Actions Inactive

hisarbalik changed the title ~~test: update kubernetes to 1.35~~ test: fix self-monitor e2e test flakiness May 27, 2026

jeffreylimnardy reviewed May 27, 2026

View reviewed changes

jeffreylimnardy changed the title ~~test: fix self-monitor e2e test flakiness~~ chore: fix self-monitor e2e test flakiness May 27, 2026

github-actions Bot added kind/chore Categorizes issue or PR as related to a chore. and removed kind/test labels May 27, 2026

jeffreylimnardy changed the title ~~chore: fix self-monitor e2e test flakiness~~ chore: requeue with specific interval for when self monitor is not ready May 27, 2026

jeffreylimnardy reviewed May 27, 2026

View reviewed changes

hisarbalik added 2 commits May 27, 2026 14:42

revert e2e test changes

83502f7

remove not used code

14da4e0

hisarbalik temporarily deployed to GitHub-Actions May 27, 2026 13:38 — with GitHub Actions Inactive

Merge branch 'main' into check-k8s-1.35

4a6b2f0

hisarbalik temporarily deployed to GitHub-Actions May 28, 2026 07:30 — with GitHub Actions Inactive

hisarbalik added 2 commits May 28, 2026 10:30

fix linter issues

4b5056a

Merge branch 'check-k8s-1.35' of github.com:k15r/telemetry-manager in…

7f27c89

…to check-k8s-1.35

hisarbalik requested a deployment to GitHub-Actions May 28, 2026 08:31 — with GitHub Actions Waiting

Merge branch 'main' into check-k8s-1.35

156b7b1

hisarbalik requested a deployment to GitHub-Actions May 28, 2026 12:12 — with GitHub Actions Waiting

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: requeue with specific interval for when self monitor is not ready#3475

chore: requeue with specific interval for when self monitor is not ready#3475
k15r wants to merge 52 commits into
kyma-project:mainfrom
k15r:check-k8s-1.35

k15r commented May 5, 2026 •

edited by hyperspace-insights Bot

Loading

Uh oh!

jeffreylimnardy May 27, 2026

Uh oh!

jeffreylimnardy May 27, 2026

Uh oh!

jeffreylimnardy May 27, 2026

Uh oh!

jeffreylimnardy May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

		@@ -35,6 +36,7 @@ type Reconciler struct {
		agentApplierDeleter AgentApplierDeleter
		agentProber AgentProber

Conversation

k15r commented May 5, 2026 • edited by hyperspace-insights Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What Changed

Affected Signal Types

Notes for Reviewers

Release Notes Input

Uh oh!

jeffreylimnardy May 27, 2026

Choose a reason for hiding this comment

Uh oh!

jeffreylimnardy May 27, 2026

Choose a reason for hiding this comment

Uh oh!

jeffreylimnardy May 27, 2026

Choose a reason for hiding this comment

Uh oh!

jeffreylimnardy May 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

k15r commented May 5, 2026 •

edited by hyperspace-insights Bot

Loading