[CONTP-1610][CONTP-1611] Wait for CSI driver node server pod readiness in untaint controller if csi feature is enabled#3096
Conversation
🛑 Gate Violations
ℹ️ Info🎯 Code Coverage (details) Useful? React with 👍 / 👎 This comment will be updated automatically if new data arrives.🔗 Commit SHA: 70ca5b1 | Docs | Datadog PR Page | Give us feedback! |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3096 +/- ##
==========================================
- Coverage 43.66% 43.65% -0.01%
==========================================
Files 352 354 +2
Lines 30115 30958 +843
==========================================
+ Hits 13149 13514 +365
- Misses 16094 16576 +482
+ Partials 872 868 -4
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 13 files with indirect coverage changes Continue to review full report in Codecov by Harness.
🚀 New features to boost your workflow:
|
86b8af2 to
0d149b2
Compare
…s in untaint controller if csi feature is enabled
0d149b2 to
6bec367
Compare
9fb794c to
e82e571
Compare
cb1d461 to
4c037ce
Compare
|
Bits Code status: ✅ Done Comment @DataDog to request changes |
rtrieu
left a comment
There was a problem hiding this comment.
some suggestions for clarity for your consideration
| `Ready` before removing the taint. The taint stays until both are | ||
| satisfied or a timeout fires. The operator's Pod informer then watches the | ||
| **union** of `DD_AGENT_WATCH_NAMESPACE` and `DD_CSIDRIVER_WATCH_NAMESPACE` (all | ||
| pods in those namespaces—keep namespaces tight). Ensure CSI namespaces are |
There was a problem hiding this comment.
| pods in those namespaces—keep namespaces tight). Ensure CSI namespaces are | |
| pods in those namespaces. Keep these namespaces tightly scoped to limit the pod informer's watch scope.). CSI namespaces must be |
| **`--datadogCSIDriverEnabled`** only controls whether the **DatadogCSIDriver** | ||
| controller runs; it does **not** by itself turn on dual-readiness untaint. | ||
| Enable `--untaintControllerWaitForCSIDriver` only when you actually deploy CSI | ||
| node-server pods on tainted nodes (for example via a `DatadogCSIDriver` CR with |
There was a problem hiding this comment.
| node-server pods on tainted nodes (for example via a `DatadogCSIDriver` CR with | |
| node-server pods on tainted nodes (for example, with a `DatadogCSIDriver` CR with |
| **With `--untaintControllerEnabled=true` and `--untaintControllerWaitForCSIDriver=true`:** | ||
| the controller waits until **both** the node Agent and **CSI | ||
| node-server** pod (`app=datadog-csi-driver-node-server`) on the node are | ||
| `Ready` before removing the taint. The taint stays until both are |
There was a problem hiding this comment.
| `Ready` before removing the taint. The taint stays until both are | |
| `Ready` before removing the taint. The taint stays until both criteria are |
| the controller waits until **both** the node Agent and **CSI | ||
| node-server** pod (`app=datadog-csi-driver-node-server`) on the node are | ||
| `Ready` before removing the taint. The taint stays until both are | ||
| satisfied or a timeout fires. The operator's Pod informer then watches the |
There was a problem hiding this comment.
| satisfied or a timeout fires. The operator's Pod informer then watches the | |
| met or a timeout fires. The operator's Pod informer then watches the |
| | `true` | `false` | Agent-only readiness; Agent DaemonSet startup toleration injected. | | ||
| | `true` | `true` | Wait for Agent **and** CSI node-server Ready; widened Pod cache (agent + `DD_CSIDRIVER_WATCH_NAMESPACE` namespaces); startup toleration on Agent and, when the DatadogCSIDriver controller is enabled, on the CSI node DaemonSet. | | ||
|
|
||
| `--untaintControllerWaitForCSIDriver` requires `--untaintControllerEnabled=true` (the operator exits on invalid combinations). |
There was a problem hiding this comment.
| `--untaintControllerWaitForCSIDriver` requires `--untaintControllerEnabled=true` (the operator exits on invalid combinations). | |
| **Note**: `--untaintControllerWaitForCSIDriver` requires `--untaintControllerEnabled=true`. The operator exits at startup if this combination is invalid. |
| // to agent pods. | ||
| func (r *UntaintReconciler) podWatchPredicate() predicate.Predicate { | ||
| if !r.waitForCSIDriver { | ||
| return agentPodPredicate() |
There was a problem hiding this comment.
this function can be merged in agentPodPredicate returning something like
return ok && (isAgentPod(p) || (r.waitForCSIDriver && isCSINodeServerPod(p)))| // but the CSI node-server requirement is not met (no pod on node yet, or pod | ||
| // not Ready). RequeueAfter matches reconcileTaintedNodeAgentTimeouts: full | ||
| // remaining scheduling or readiness window (watch-driven reconciles still apply). | ||
| func (r *UntaintReconciler) reconcileAgentReadyWaitForCSI(ctx context.Context, node *corev1.Node, csiPodList *corev1.PodList) (ctrl.Result, error) { |
There was a problem hiding this comment.
is this really necessary, meaning tracking two timeout windows? Can we updated existing logic to wait for StartTime from both pods and pick the latest for tracking the readiness timeout window, and still apply scheduling timeout for cases where one/both of them don't get StartTime?
There was a problem hiding this comment.
..merging two would simplify logic and reduce duplication.
|
|
||
| `--untaintControllerWaitForCSIDriver` requires `--untaintControllerEnabled=true` (the operator exits on invalid combinations). | ||
|
|
||
| When `--untaintControllerEnabled` is enabled, the operator injects a toleration for |
There was a problem hiding this comment.
misread.UntaintControllerWaitForCSIDriver this should be enabled too right, to inject tolerations?
| agentLatest, agentOK := latestPodStartTime(agents) | ||
| csiLatest, csiOK := latestPodStartTime(csis) | ||
| if !agentOK || !csiOK { | ||
| return r.schedulingTimeoutResult(ctx, node, log, now) |
There was a problem hiding this comment.
nit - want to confirm if this intentional. When CSI isn't enabled we just requeue if there is no startime on line 320, we only check scheduling timeout if no pod has been scheduled.
There was a problem hiding this comment.
Good catch 👍
No this is not intentional.
I was trying to avoid infinite requeue in case the kubelet is blocked and never setting startTime, but I realize it is an extreme edge case and actually we schedulingTimeout here is bad, because if the csi is enabled long time after the agent started on the node, then hitting this branch will anchor the timeout against node creation time, which will result in immediate timeout.
Changing to requeue behavior, more consistent with the agent-only path 👍
| result, err := r.removeTaint(ctx, node) | ||
| if err != nil { | ||
| return result, err | ||
| if r.waitForCSIDriver { |
There was a problem hiding this comment.
nit - there is a bit of duplication between the if/else block e.g. taint removal blocks at line 245 and 279 and maybe more.
levan-m
left a comment
There was a problem hiding this comment.
looks good, left some nit comments.
What does this PR do?
Adds
--untaintControllerWaitForCSIDriver(defaultfalse). When--untaintControllerEnabled=trueand this flag istrue, the Untaint controller removesagent.datadoghq.com/not-ready=presence:NoScheduleonly after both:agent.datadoghq.com/component=agent) isReady, andapp=datadog-csi-driver-node-server) on that node is Ready (strict: no untaint whilelen(csiPods)==0).--datadogCSIDriverEnabledis not used to decide dual-readiness; it only controls whether the DatadogCSIDriver controller runs. Operators who enable “wait for CSI” must actually run CSI node-server pods (or they will hit scheduling/readiness timeouts per existing env vars).Startup validation:
--untaintControllerWaitForCSIDriverrequires--untaintControllerEnabled=true(invalid combination fails fast at process start).This is a continuation of previous PR that added untaint controller
Additional Notes
--untaintControllerWaitForCSIDrivernext to--untaintControllerEnabledwhere appropriate.Minimum Agent Versions
(Fill if applicable.)
Describe your test plan
Create Kind cluster
Create a kind cluster with 3 nodes:
kind create cluster --config ./kind.yaml
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
Deploy datadog operator
Deploy the datadog operator with the following settings:
Apply
datadog.yamlwith csi enabledThis will run the agent, but not the csi driver, because csi controller is disabled.
Taint worker
Taint one of the worker nodes to block scheduling:
kubectl taint node kind-worker agent.datadoghq.com/not-ready=presence:NoScheduleCreate a sample deployment
Only one of the pods is scheduled on the untainted node, and the second pod is blocked at
Pendingbecause CSI pods are not running:Activate CSI controller
Activate CSI controller by setting
--datadogCSIDriverEnabled=trueon the operator deployment.After the operator rolls out, CSI pods should spin up, and nginx pod should transition from Pending to Running.
Checklist
bug,enhancement,refactoring,documentation,tooling, and/ordependenciesqa/skip-qalabel