Untaint controller by levan-m · Pull Request #2753 · DataDog/datadog-operator

levan-m · 2026-03-12T21:41:50Z

What does this PR do?

CONTP-1607, CONTP-1613

Untaint controller to prevent scheduling workloads before Agent when nodes join cluster.

When enabled, untaint controller will watch nodes and remove agent.datadoghq.com/not-ready=presence:NoSchedule taints once Agent reaches ready state. Until then pods can't be scheduled on the node unless they tolerate this taint. This is effective during cluster scale out when new nodes join with the taint. Once removed Operator will not add taint back for any reason.

Feature can be enabled via --untaintControllerEnabled=flag arg, optionally one can enable taint removal events using DD_UNTAINT_CONTROLLER_EVENTS_ENABLED=true env var.

Commits

0b567c6 - POC

359ab79 - improvements:

Conflict-handling fix, requeues in 1s instead of returning an error (no exponential backoff).
Defensive JSON-patch marshaling. Guarantees taint slices serialize as [] rather than null in both test and replace ops.
Field-indexer collision-proofing. Renames the pod-by-nodeName cache index to a namespaced key untaint.spec.nodeName.
Two-tier timeout policy. Adds DD_UNTAINT_CONTROLLER_TIMEOUT (pod-exists-not-ready, default 10m) and DD_UNTAINT_CONTROLLER_SCHEDULING_TIMEOUT (no pod, default 5m), with DD_UNTAINT_CONTROLLER_TIMEOUT_POLICY (remove
default | keep) controlling whether to untaint anyway or just emit observability.
Node watch added. Filters to nodes carrying the target taint, so nodes that never get an agent pod still get reconciled.
Pod predicate widened. Enqueues on any agent-pod Create/Update/Delete (not just Ready transitions) so the readiness clock can't stall under policy=keep or when NotReady pods appear between reconciles.
Log error on config errors - similar to other controllers untaint-controller startup error is logged — no silent fallback to defaults and other controllers continue running.
New metric. untaint_taint_timeouts_total{reason, policy} for alerting on stuck nodes; existing taint_removals_total and taint_removal_latency_seconds kept.
Tests. Adds a unit suite (fake client + fake clock, full coverage of reconcile outcomes / predicates / conflict paths / env-var errors); rewrites the integration suite around fresh per-test nodes and adds
scheduling/readiness-timeout scenarios.
Docs. Adds docs/untaint_controller.md.

Motivation

#2052 and similar requests through other channels.

Additional Notes

Anything else we should know when reviewing?

Minimum Agent Versions

Are there minimum versions of the Datadog Agent and/or Cluster Agent required?

Agent: vX.Y.Z
Cluster Agent: vX.Y.Z

Describe your test plan

Can be tested locally on kind cluster:

Create kind cluster with multiple nodes (assume 1 control plane, 3 workers).
Deploy Operator with feature enabled.

      args:
      - --untaintControllerEnabled=true
      ...
      env:
      - name: DD_UNTAINT_CONTROLLER_EVENTS_ENABLED
        value: "true"

Taint 2 worker nodes

kubectl taint nodes tainted-worker agent.datadoghq.com/not-ready=presence:NoSchedule
kubectl taint nodes tainted-worker2 agent.datadoghq.com/not-ready=presence:NoSchedule

Create nginx deployment with 3 replicas and antiaffinity not to schedule two on same node

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-test
spec:
  replicas: 3
  selector:
    matchLabels:
      app: nginx-test
  template:
    metadata:
      labels:
        app: nginx-test
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - nginx-test
            topologyKey: kubernetes.io/hostname
      containers:
      - name: nginx
        image: nginx

One replica should be schedule on tainted-worker3, other two will be pending.
Apply DatadogAgent manifest (add toleration to agent pods)

apiVersion: datadoghq.com/v2alpha1
kind: DatadogAgent
metadata:
  name: datadog-agent
spec:
  global:
    credentials:
      apiSecret:
        secretName: datadog-secret
        keyName: api-key
    kubelet:
      tlsVerify: false
  features:
    logCollection:
      enabled: true
      containerCollectUsingFiles: true
      containerCollectAll: true
  override:
    nodeAgent:
      tolerations:
        - key: agent.datadoghq.com/not-ready
          operator: Exists
          effect: NoSchedule

Agents will be schedule to all nodes, once Agents become ready on tainted-worker and tainted-worker2 nginx pods will be scheduled too.

# monitor events and taint removal
kubectl get event -w -n default | grep -i taint
kubectl get nodes -w -o custom-columns='NAME:.metadata.name,TAINTS:.spec.taints[*].key'

Checklist

PR has at least one valid label: bug, enhancement, refactoring, documentation, tooling, and/or dependencies
PR has a milestone or the qa/skip-qa label
All commits are signed (see: signing commits)

datadog-official · 2026-05-22T21:44:56Z

🎯 Code Coverage (details)
• Patch Coverage: 82.70%
• Overall Coverage: 44.39% (+1.81%)

_{This comment will be updated automatically if new data arrives.

🔗 Commit SHA: 9fa9fc5 | Docs | Datadog PR Page | Give us feedback!}

codecov-commenter · 2026-05-22T22:22:53Z

Codecov Report

❌ Patch coverage is 83.59375% with 42 lines in your changes missing coverage. Please review.
✅ Project coverage is 44.07%. Comparing base (b2a60a9) to head (9fa9fc5).
⚠️ Report is 21 commits behind head on main.

Files with missing lines	Patch %	Lines
internal/controller/untaint_controller.go	89.86%	14 Missing and 9 partials ⚠️
pkg/config/config.go	33.33%	6 Missing ⚠️
internal/controller/metrics/untaint.go	0.00%	5 Missing ⚠️
internal/controller/setup.go	58.33%	3 Missing and 2 partials ⚠️
cmd/main.go	0.00%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2753      +/-   ##
==========================================
+ Coverage   42.24%   44.07%   +1.83%     
==========================================
  Files         337      343       +6     
  Lines       28951    30931    +1980     
==========================================
+ Hits        12230    13633    +1403     
- Misses      15916    16428     +512     
- Partials      805      870      +65

Flag	Coverage Δ
unittests	`44.07% <83.59%> (+1.83%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
cmd/main.go	`6.80% <0.00%> (-0.07%)`	⬇️
internal/controller/metrics/untaint.go	`0.00% <0.00%> (ø)`
internal/controller/setup.go	`68.53% <58.33%> (+0.59%)`	⬆️
pkg/config/config.go	`69.30% <33.33%> (-2.57%)`	⬇️
internal/controller/untaint_controller.go	`89.86% <89.86%> (ø)`

... and 40 files with indirect coverage changes

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b2a60a9...9fa9fc5. Read the comment docs.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 359ab79dc6

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-22T22:25:40Z

 .PHONY: docker-build-ci
 docker-build-ci:
-	docker build . -t ${IMG} --build-arg FIPS_ENABLED="${FIPS_ENABLED}" --build-arg LDFLAGS="${LDFLAGS}" --build-arg GOARCH="${GOARCH}"
+	docker build . -t ${IMG} --build-arg FIPS_ENABLED="${FIPS_ENABLED}" --build-arg LDFLAGS="${LDFLAGS}" --build-arg GOARCH="${GOARCH}" --platform=linux/${GOARCH}


Avoid passing empty GOARCH to docker --platform

docker-build-ci now always appends --platform=linux/${GOARCH}, but GOARCH is declared as optional (GOARCH?=) and is often unset for local runs. In that common case the command becomes --platform=linux/, which makes docker build fail before the image build starts. This regresses the documented “For local use” target unless every caller exports GOARCH first.

Useful? React with 👍 / 👎.

this was supposed to be local-only change.

janine-c

Some very boring style edits that you can take or leave 🙂

janine-c · 2026-05-25T16:43:41Z

+
+- `untaint_taint_removals_total` — counter, every taint removal regardless of cause.
+- `untaint_taint_removal_latency_seconds` — histogram, time between pod Ready and taint removal.
+- `untaint_taint_timeouts_total{reason, policy}` — counter, timeout decisions. `reason` ∈ {`readiness`, `scheduling`}; `policy` ∈ {`remove`, `keep`}. Alert on `policy="keep"` to investigate stuck nodes.


We don't seem to use this ∈ notation very often. If you think users will understand it, leave it, but I like to err on the side of clarity just in case a user hasn't encountered it before.

I'll replace it with in.

Co-authored-by: Janine Chan <64388808+janine-c@users.noreply.github.com>

adel121 · 2026-05-29T10:59:21Z

+	//   1. Any pod exists on the node → readiness timeout
+	//      (clock = max(pod.Status.StartTime), so a fresh restart resets it).
+	//      If pods exist but no StartTime is set yet (very early lifecycle —
+	//      kubelet hasn't transitioned through PodInitializing yet), requeue
+	//      after the readiness window so we re-check once StartTime appears.
+	//      We do NOT fall through to the scheduling clock here: a pod *is*
+	//      scheduled, so the "no agent pod ever scheduled" condition does not
+	//      hold and applying the scheduling timeout would be incorrect.


If I understand correctly, this is assuming that the agent is the only pod that is tolerating the no-schedule taint.

If this is the case, I see 2 potential issues:

What if the user wanted to exclude specific pods from being blocked? I think this can happen easily if the user is running their own CSI driver (not referring to Datadog CSI Driver, can be any system critical component that should not depend on the existence of the agent itself).

I think the current implementation will break (or at least cause friction) if we wanted in the future to support covering other components with this controller. The example I have in mind now is Datadog CSI Driver + Datadog Agent Daemonset.

Wouldn't it be more reliable to filter for target pods ? (here it is only the agent pods, but should be configurable to support multiple selectors I assume)

Are you referring to timeout behavior specifically or in general?

If user wants to exclude specific pods from being blocked they'll add tolerations. That would be case for CSI, cilium, some other infra daemonsets. Applications won't have that toleration and they'll be blocked until agent is ready.

To support CSI driver, which we plan to, we can extend this controller. It will handle new taint, cache filter will change. Agent DS will not tolerate CSI driver taints to ensure CSI pods start first (if that's the intention).

If user wants to exclude specific pods from being blocked they'll add tolerations. That would be case for CSI, cilium, some other infra daemonsets. Applications won't have that toleration and they'll be blocked until agent is ready.

But if I am not mistaken, the code doesn't look for agent pods to consider the node as ready to lose the taint.
It just accepts any pod (see here). So for example, if the user adds the toleration to some infra pod, and it gets scheduled first on the node, then the operator will remove the taint from the node, even if the agent pod is not yet scheduled and healthy.

To support CSI driver, which we plan to, we can extend this controller. It will handle new taint, cache filter will change. Agent DS will not tolerate CSI driver taints to ensure CSI pods start first (if that's the intention).

This is clear, but I still see the same concern as the previous point.

podList here is list of agent pods.

podList := &corev1.PodList{} labelSelector := labels.SelectorFromSet(map[string]string{ common.AgentDeploymentComponentLabelKey: constants.DefaultAgentResourceSuffix, }) if err := r.client.List(ctx, podList, client.MatchingLabelsSelector{Selector: labelSelector}, client.MatchingFields{untaintPodNodeIndex: req.Name},

idea is that we check timeout against to most recent agent pod start. I'll fix the comment as it doesn't explicitly mention that we are looking at agent pods.

you are right, I missed the label selector in the client.List.

But then now I see that the controller hardcodes the selector to match agent pods only. So currently there is no way to dynamically support other target pods. But this is non-blocking for this PR, can be refactored in the following PRs adding support for CSI driver 👍

adel121

Left minor comments/questions

adel121

Thanks for addressing all the comments and answering all questions 🙇

levan-m added the do-not-merge label Mar 12, 2026

[draft] untaint controller

0b567c6

levan-m force-pushed the levan-m/untaint-controller branch from 2dc7aa1 to 0b567c6 Compare March 12, 2026 21:45

levan-m added the enhancement New feature or request label Mar 12, 2026

Merge branch 'main' into levan-m/untaint-controller

83421a8

github-actions Bot added the team/container-platform label May 15, 2026

Merge branch 'main' into levan-m/untaint-controller

434b936

Fixes, improvements, doc

359ab79

github-actions Bot added the team/documentation label May 22, 2026

levan-m changed the title ~~[draft] untaint controller~~ Untaint controller May 22, 2026

levan-m added this to the v1.28.0 milestone May 22, 2026

levan-m removed the do-not-merge label May 22, 2026

levan-m marked this pull request as ready for review May 22, 2026 22:22

levan-m requested a review from a team May 22, 2026 22:22

levan-m requested a review from a team as a code owner May 22, 2026 22:22

chatgpt-codex-connector Bot reviewed May 22, 2026

View reviewed changes

janine-c approved these changes May 25, 2026

View reviewed changes

levan-m and others added 3 commits May 26, 2026 11:51

Update docs/untaint_controller.md

bc3b02d

Co-authored-by: Janine Chan <64388808+janine-c@users.noreply.github.com>

Apply suggestions from code review

6641386

Co-authored-by: Janine Chan <64388808+janine-c@users.noreply.github.com>

doc and makefile fixes

302a746