You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[CONTP-1610][CONTP-1611] Wait for CSI driver node server pod readiness in untaint controller if csi feature is enabled (#3096)
* [CONTP-1610][CONTP-1611] Wait for CSI driver node server pod readiness in untaint controller if csi feature is enabled
* require explicit enablement of awaiting csi driver in the untaint controller
* increase unit test coverage
* simplify timeout reconciliation: merge agent and csi into one condition
* address final nits
"When true (requires --untaintControllerEnabled), the Untaint controller removes the startup taint only after both the node Agent and Datadog CSI node-server pods are Ready. Requires Pod watch coverage of CSI namespaces (DD_CSIDRIVER_WATCH_NAMESPACE).")
191
194
192
195
// DatadogAgentInternal
193
196
flag.BoolVar(&opts.createControllerRevisions, "createControllerRevisions", false, "Enable creation of ControllerRevision snapshots on each DDA spec change")
| `true` | `true` | Wait for Agent **and** CSI node-server Ready; widened Pod cache (agent + `DD_CSIDRIVER_WATCH_NAMESPACE` namespaces); startup toleration on Agent and, when the DatadogCSIDriver controller is enabled, on the CSI node DaemonSet. |
87
+
88
+
`--untaintControllerWaitForCSIDriver`requires `--untaintControllerEnabled=true` (the operator exits on invalid combinations).
89
+
90
+
When `--untaintControllerEnabled` is enabled, the operator injects a toleration for
53
91
`agent.datadoghq.com/not-ready=presence:NoSchedule`into the node Agent
54
92
DaemonSet (or ExtendedDaemonSet) pod template, unless an equivalent toleration
55
-
is already present. This avoids a deadlock where the node stays tainted because
56
-
the Agent pod cannot schedule without the toleration, especially when admission
57
-
webhook auto-injection is not in use.
93
+
is already present. When **`--untaintControllerWaitForCSIDriver`** is also true **and**
94
+
the DatadogCSIDriver controller is running (`--datadogCSIDriverEnabled=true`), the same
95
+
toleration is injected into the **Datadog CSI node-server** DaemonSet pod
96
+
template so the CSI workload can schedule on tainted nodes before the taint is
97
+
removed.
58
98
59
99
## Configuration
60
100
@@ -81,6 +121,8 @@ Metrics, under the `untaint` Prometheus subsystem:
81
121
82
122
Kubernetes Events (gated by `DD_UNTAINT_CONTROLLER_EVENTS_ENABLED=true`):
83
123
84
-
- `TaintRemoved`(Normal) — taint removed because the Agent pod became Ready.
124
+
- `TaintRemoved`(Normal) — taint removed after the Agent became Ready, or (when
125
+
`--untaintControllerWaitForCSIDriver`is enabled) after both the Agent and
126
+
CSI node-server pods became Ready.
85
127
- `UntaintTimeout`— a timeout fired. Normal under `remove`, Warning under `keep`. Message carries the reason, elapsed time, and policy.
0 commit comments