-
Notifications
You must be signed in to change notification settings - Fork 151
Untaint controller #2753
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Untaint controller #2753
Changes from all commits
Commits
Show all changes
10 commits
Select commit
Hold shift + click to select a range
0b567c6
[draft] untaint controller
levan-m 83421a8
Merge branch 'main' into levan-m/untaint-controller
levan-m 434b936
Merge branch 'main' into levan-m/untaint-controller
levan-m 359ab79
Fixes, improvements, doc
levan-m bc3b02d
Update docs/untaint_controller.md
levan-m 6641386
Apply suggestions from code review
levan-m 302a746
doc and makefile fixes
levan-m 1b7739e
PR feedback
levan-m 9174413
improve inline doc
levan-m 9fa9fc5
Apply suggestion from @levan-m
levan-m File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,79 @@ | ||
| # Untaint Controller | ||
|
|
||
| This feature was introduced in Datadog Operator v1.28 and is currently in preview. | ||
|
|
||
| ## Overview | ||
|
|
||
| The Untaint controller watches Kubernetes Nodes carrying the taint | ||
| `agent.datadoghq.com/not-ready=presence:NoSchedule` and removes it once the | ||
| Datadog Agent pod on that node is `Ready`. It is intended to run alongside a | ||
| separate mechanism (cluster-autoscaler hook, CCM, admission webhook, etc.) | ||
| that adds the taint to new nodes. The use case is keeping workloads off a | ||
| node until the Datadog Agent is Ready, and recovering gracefully if the Agent never | ||
| becomes Ready. | ||
|
|
||
| Agent pods are matched by the label `agent.datadoghq.com/component=agent` in | ||
| the operator's watched namespaces (`WATCH_NAMESPACE` / | ||
| `DD_AGENT_WATCH_NAMESPACE`). | ||
|
|
||
| If the Agent pod never reaches Ready on a tainted node, a configurable timeout | ||
| policy ensures the node is never permanently unschedulable. Two clocks cover | ||
| the two failure modes: | ||
|
|
||
| - **Readiness timeout** — the Agent pod is on the node but not Ready. Clock: | ||
| `pod.Status.StartTime`. Pod recreation restarts the window; container | ||
| restarts inside the same pod do not. | ||
| - **Scheduling timeout** — no Agent pod is on the node. Clock: | ||
| `node.metadata.creationTimestamp`. The expected path when a DaemonSet never | ||
| schedules a pod onto the node (taint not tolerated, missing labels, etc.). | ||
|
|
||
| A pod-recreation crash-loop faster than the readiness window can hold a node | ||
| tainted indefinitely; run with `policy=keep` and alert on | ||
| `untaint_taint_timeouts_total{policy="keep"}` to catch this. | ||
|
|
||
| The controller removes only this fixed taint and does not add it; both | ||
| timeouts are global and cannot be tuned per Node (Group), DDA, or DAP. | ||
|
|
||
| ## Prerequisites | ||
|
|
||
| - Operator v1.x+ | ||
| - Tested on Kubernetes 1.27.0+ | ||
|
|
||
| ## Enable the Untaint controller | ||
|
|
||
| The Untaint controller is disabled by default. Enable it on the operator | ||
| manager: | ||
|
|
||
| ```yaml | ||
| args: | ||
| - --untaintControllerEnabled=true | ||
| ``` | ||
|
|
||
| ## Configuration | ||
|
|
||
| All other tuning knobs are environment variables on the operator pod. Values | ||
| use Go's `time.ParseDuration` format (`90s`, `5m`, `1h`, etc.). Invalid values | ||
| fail the controller startup with an ERROR log; other controllers continue to | ||
| start normally. | ||
|
|
||
|
|
||
| | Env var | Default | Description | | ||
| | ------------------------------------------ | -------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | ||
| | `DD_UNTAINT_CONTROLLER_TIMEOUT` | `10m` | Readiness timeout. Tune to the upper bound of legitimate agent startup on your nodes; 2–5m is often enough on clusters with cached images. | | ||
| | `DD_UNTAINT_CONTROLLER_SCHEDULING_TIMEOUT` | `5m` | Scheduling timeout. Set larger than your scheduler retry window; raise it on clusters with large pending queues or aggressive autoscaling. | | ||
| | `DD_UNTAINT_CONTROLLER_TIMEOUT_POLICY` | `remove` | Action when a timeout fires. `remove` untaints the node anyway (favors scheduling availability over telemetry; lowest operational risk). `keep` leaves the taint in place and emits a Warning event (favors telemetry; pair with an alert on the timeout counter to surface stuck nodes). | | ||
| | `DD_UNTAINT_CONTROLLER_EVENTS_ENABLED` | `false` | Emit Kubernetes Events on Nodes for taint removals and timeout decisions. | | ||
|
|
||
| ## Observability | ||
|
|
||
| Metrics, under the `untaint` Prometheus subsystem: | ||
|
|
||
| - `untaint_taint_removals_total` — counter, every taint removal regardless of cause. | ||
| - `untaint_taint_removal_latency_seconds` — histogram, time between pod Ready and taint removal. | ||
| - `untaint_taint_timeouts_total{reason, policy}` — counter, timeout decisions. `reason` in {`readiness`, `scheduling`}; `policy` in {`remove`, `keep`}. Alert on `policy="keep"` to investigate stuck nodes. | ||
|
|
||
| Kubernetes Events (gated by `DD_UNTAINT_CONTROLLER_EVENTS_ENABLED=true`): | ||
|
|
||
| - `TaintRemoved` (Normal) — taint removed because the Agent pod became Ready. | ||
| - `UntaintTimeout` — a timeout fired. Normal under `remove`, Warning under `keep`. Message carries the reason, elapsed time, and policy. | ||
|
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,77 @@ | ||
| // Unless explicitly stated otherwise all files in this repository are licensed | ||
| // under the Apache License Version 2.0. | ||
| // This product includes software developed at Datadog (https://www.datadoghq.com/). | ||
| // Copyright 2016-present Datadog, Inc. | ||
|
|
||
| package metrics | ||
|
|
||
| import ( | ||
| "github.com/prometheus/client_golang/prometheus" | ||
| "sigs.k8s.io/controller-runtime/pkg/metrics" | ||
| ) | ||
|
|
||
| // Label values for TaintTimeoutsTotal. | ||
| const ( | ||
| // UntaintTimeoutReasonReadiness signals that a pod existed on the node but | ||
| // never became Ready within --untaintControllerTimeout. | ||
| UntaintTimeoutReasonReadiness = "readiness" | ||
| // UntaintTimeoutReasonScheduling signals that no agent pod was scheduled on | ||
| // the node within --untaintControllerSchedulingTimeout. | ||
| UntaintTimeoutReasonScheduling = "scheduling" | ||
|
|
||
| // UntaintTimeoutPolicyRemove untaints the node despite the agent not being ready. | ||
| UntaintTimeoutPolicyRemove = "remove" | ||
| // UntaintTimeoutPolicyKeep leaves the taint in place but emits observability signals. | ||
| UntaintTimeoutPolicyKeep = "keep" | ||
| ) | ||
|
|
||
| var ( | ||
| // TaintRemovalsTotal is the total number of taints removed from nodes. | ||
| TaintRemovalsTotal = prometheus.NewCounter( | ||
| prometheus.CounterOpts{ | ||
| Subsystem: untaintSubsystem, | ||
| Name: "taint_removals_total", | ||
| Help: "Total number of taints removed from nodes", | ||
| }, | ||
| ) | ||
|
|
||
| // TaintRemovalLatency is the time between agent pod becoming Ready and taint removal. | ||
| TaintRemovalLatency = prometheus.NewHistogram( | ||
| prometheus.HistogramOpts{ | ||
| Subsystem: untaintSubsystem, | ||
| Name: "taint_removal_latency_seconds", | ||
| Help: "Time between agent pod becoming Ready and taint removal from the node", | ||
| Buckets: prometheus.DefBuckets, | ||
| }, | ||
| ) | ||
|
|
||
| // TaintTimeoutsTotal counts timeout decisions broken down by reason and policy. | ||
| TaintTimeoutsTotal = prometheus.NewCounterVec( | ||
| prometheus.CounterOpts{ | ||
| Subsystem: untaintSubsystem, | ||
| Name: "taint_timeouts_total", | ||
| Help: "Total number of untaint-controller timeout decisions, by reason and policy", | ||
| }, | ||
| []string{"reason", "policy"}, | ||
| ) | ||
|
|
||
| // TaintRemovalErrorsTotal counts hard errors encountered while attempting to | ||
| // remove the taint (apiserver Patch failures, JSON marshal failures, …). | ||
| // Benign optimistic-concurrency races (IsConflict/IsInvalid) are NOT counted | ||
| // here — they're handled by requeueing. Inspect the operator's ERROR-level | ||
| // logs for the specific failure cause. | ||
| TaintRemovalErrorsTotal = prometheus.NewCounter( | ||
| prometheus.CounterOpts{ | ||
| Subsystem: untaintSubsystem, | ||
| Name: "taint_removal_errors_total", | ||
| Help: "Total number of errors encountered while attempting to remove the agent-not-ready taint from a node", | ||
| }, | ||
| ) | ||
| ) | ||
|
|
||
| func init() { | ||
| metrics.Registry.MustRegister(TaintRemovalsTotal) | ||
| metrics.Registry.MustRegister(TaintRemovalLatency) | ||
| metrics.Registry.MustRegister(TaintTimeoutsTotal) | ||
| metrics.Registry.MustRegister(TaintRemovalErrorsTotal) | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,44 @@ | ||
| // Unless explicitly stated otherwise all files in this repository are licensed | ||
| // under the Apache License Version 2.0. | ||
| // This product includes software developed at Datadog (https://www.datadoghq.com/). | ||
| // Copyright 2016-present Datadog, Inc. | ||
|
|
||
| package controller | ||
|
|
||
| import ( | ||
| "errors" | ||
| "testing" | ||
|
|
||
| "github.com/go-logr/logr" | ||
| "github.com/stretchr/testify/assert" | ||
| "sigs.k8s.io/controller-runtime/pkg/log" | ||
| "sigs.k8s.io/controller-runtime/pkg/manager" | ||
|
|
||
| "github.com/DataDog/datadog-operator/pkg/controller/utils/datadog" | ||
| "github.com/DataDog/datadog-operator/pkg/kubernetes" | ||
| ) | ||
|
|
||
| // TestSetupControllers_StarterErrorsAreBestEffort confirms that starter | ||
| // failures are logged at ERROR by SetupControllers but never propagated up: | ||
| // one controller's misconfiguration must not bring the whole operator down | ||
| // or prevent other controllers from starting. The untaint controller follows | ||
| // this same best-effort pattern. | ||
| func TestSetupControllers_StarterErrorsAreBestEffort(t *testing.T) { | ||
| originalStarters := controllerStarters | ||
| t.Cleanup(func() { controllerStarters = originalStarters }) | ||
|
|
||
| failing := func(logr.Logger, manager.Manager, kubernetes.PlatformInfo, SetupOptions, datadog.MetricsForwardersManager) error { | ||
| return errors.New("simulated starter failure") | ||
| } | ||
| controllerStarters = map[string]starterFunc{ | ||
| agentControllerName: failing, | ||
| untaintControllerName: failing, | ||
| } | ||
|
|
||
| assert.NoError(t, SetupControllers( | ||
| log.Log, | ||
| nil, | ||
| kubernetes.PlatformInfo{}, | ||
| SetupOptions{UntaintControllerEnabled: true}, | ||
| )) | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.