Untaint controller#2753
Conversation
2dc7aa1 to
0b567c6
Compare
|
🎯 Code Coverage (details) 🔗 Commit SHA: 9fa9fc5 | Docs | Datadog PR Page | Give us feedback! |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #2753 +/- ##
==========================================
+ Coverage 42.24% 44.07% +1.83%
==========================================
Files 337 343 +6
Lines 28951 30931 +1980
==========================================
+ Hits 12230 13633 +1403
- Misses 15916 16428 +512
- Partials 805 870 +65
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 40 files with indirect coverage changes Continue to review full report in Codecov by Sentry.
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 359ab79dc6
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| .PHONY: docker-build-ci | ||
| docker-build-ci: | ||
| docker build . -t ${IMG} --build-arg FIPS_ENABLED="${FIPS_ENABLED}" --build-arg LDFLAGS="${LDFLAGS}" --build-arg GOARCH="${GOARCH}" | ||
| docker build . -t ${IMG} --build-arg FIPS_ENABLED="${FIPS_ENABLED}" --build-arg LDFLAGS="${LDFLAGS}" --build-arg GOARCH="${GOARCH}" --platform=linux/${GOARCH} |
There was a problem hiding this comment.
Avoid passing empty GOARCH to docker --platform
docker-build-ci now always appends --platform=linux/${GOARCH}, but GOARCH is declared as optional (GOARCH?=) and is often unset for local runs. In that common case the command becomes --platform=linux/, which makes docker build fail before the image build starts. This regresses the documented “For local use” target unless every caller exports GOARCH first.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
this was supposed to be local-only change.
janine-c
left a comment
There was a problem hiding this comment.
Some very boring style edits that you can take or leave 🙂
|
|
||
| - `untaint_taint_removals_total` — counter, every taint removal regardless of cause. | ||
| - `untaint_taint_removal_latency_seconds` — histogram, time between pod Ready and taint removal. | ||
| - `untaint_taint_timeouts_total{reason, policy}` — counter, timeout decisions. `reason` ∈ {`readiness`, `scheduling`}; `policy` ∈ {`remove`, `keep`}. Alert on `policy="keep"` to investigate stuck nodes. |
There was a problem hiding this comment.
We don't seem to use this ∈ notation very often. If you think users will understand it, leave it, but I like to err on the side of clarity just in case a user hasn't encountered it before.
There was a problem hiding this comment.
I'll replace it with in.
Co-authored-by: Janine Chan <64388808+janine-c@users.noreply.github.com>
Co-authored-by: Janine Chan <64388808+janine-c@users.noreply.github.com>
| // 1. Any pod exists on the node → readiness timeout | ||
| // (clock = max(pod.Status.StartTime), so a fresh restart resets it). | ||
| // If pods exist but no StartTime is set yet (very early lifecycle — | ||
| // kubelet hasn't transitioned through PodInitializing yet), requeue | ||
| // after the readiness window so we re-check once StartTime appears. | ||
| // We do NOT fall through to the scheduling clock here: a pod *is* | ||
| // scheduled, so the "no agent pod ever scheduled" condition does not | ||
| // hold and applying the scheduling timeout would be incorrect. |
There was a problem hiding this comment.
If I understand correctly, this is assuming that the agent is the only pod that is tolerating the no-schedule taint.
If this is the case, I see 2 potential issues:
- What if the user wanted to exclude specific pods from being blocked? I think this can happen easily if the user is running their own CSI driver (not referring to Datadog CSI Driver, can be any system critical component that should not depend on the existence of the agent itself).
- I think the current implementation will break (or at least cause friction) if we wanted in the future to support covering other components with this controller. The example I have in mind now is Datadog CSI Driver + Datadog Agent Daemonset.
Wouldn't it be more reliable to filter for target pods ? (here it is only the agent pods, but should be configurable to support multiple selectors I assume)
There was a problem hiding this comment.
Are you referring to timeout behavior specifically or in general?
- If user wants to exclude specific pods from being blocked they'll add tolerations. That would be case for CSI, cilium, some other infra daemonsets. Applications won't have that toleration and they'll be blocked until agent is ready.
- To support CSI driver, which we plan to, we can extend this controller. It will handle new taint, cache filter will change. Agent DS will not tolerate CSI driver taints to ensure CSI pods start first (if that's the intention).
There was a problem hiding this comment.
If user wants to exclude specific pods from being blocked they'll add tolerations. That would be case for CSI, cilium, some other infra daemonsets. Applications won't have that toleration and they'll be blocked until agent is ready.
But if I am not mistaken, the code doesn't look for agent pods to consider the node as ready to lose the taint.
It just accepts any pod (see here). So for example, if the user adds the toleration to some infra pod, and it gets scheduled first on the node, then the operator will remove the taint from the node, even if the agent pod is not yet scheduled and healthy.
To support CSI driver, which we plan to, we can extend this controller. It will handle new taint, cache filter will change. Agent DS will not tolerate CSI driver taints to ensure CSI pods start first (if that's the intention).
This is clear, but I still see the same concern as the previous point.
There was a problem hiding this comment.
podList here is list of agent pods.
podList := &corev1.PodList{}
labelSelector := labels.SelectorFromSet(map[string]string{
common.AgentDeploymentComponentLabelKey: constants.DefaultAgentResourceSuffix,
})
if err := r.client.List(ctx, podList,
client.MatchingLabelsSelector{Selector: labelSelector},
client.MatchingFields{untaintPodNodeIndex: req.Name},idea is that we check timeout against to most recent agent pod start. I'll fix the comment as it doesn't explicitly mention that we are looking at agent pods.
There was a problem hiding this comment.
you are right, I missed the label selector in the client.List.
But then now I see that the controller hardcodes the selector to match agent pods only. So currently there is no way to dynamically support other target pods. But this is non-blocking for this PR, can be refactored in the following PRs adding support for CSI driver 👍
adel121
left a comment
There was a problem hiding this comment.
Left minor comments/questions
adel121
left a comment
There was a problem hiding this comment.
Thanks for addressing all the comments and answering all questions 🙇
What does this PR do?
CONTP-1607, CONTP-1613
Untaint controller to prevent scheduling workloads before Agent when nodes join cluster.
When enabled, untaint controller will watch nodes and remove
agent.datadoghq.com/not-ready=presence:NoScheduletaints once Agent reaches ready state. Until then pods can't be scheduled on the node unless they tolerate this taint. This is effective during cluster scale out when new nodes join with the taint. Once removed Operator will not add taint back for any reason.Feature can be enabled via
--untaintControllerEnabled=flagarg, optionally one can enable taint removal events usingDD_UNTAINT_CONTROLLER_EVENTS_ENABLED=trueenv var.Commits
0b567c6 - POC
359ab79 - improvements:
default | keep) controlling whether to untaint anyway or just emit observability.
scheduling/readiness-timeout scenarios.
Motivation
#2052 and similar requests through other channels.
Additional Notes
Anything else we should know when reviewing?
Minimum Agent Versions
Are there minimum versions of the Datadog Agent and/or Cluster Agent required?
Describe your test plan
Can be tested locally on
kindcluster:Checklist
bug,enhancement,refactoring,documentation,tooling, and/ordependenciesqa/skip-qalabel