Skip to content

Untaint controller#2753

Merged
levan-m merged 10 commits into
mainfrom
levan-m/untaint-controller
Jun 3, 2026
Merged

Untaint controller#2753
levan-m merged 10 commits into
mainfrom
levan-m/untaint-controller

Conversation

@levan-m

@levan-m levan-m commented Mar 12, 2026

Copy link
Copy Markdown
Collaborator

What does this PR do?

CONTP-1607, CONTP-1613

Untaint controller to prevent scheduling workloads before Agent when nodes join cluster.

When enabled, untaint controller will watch nodes and remove agent.datadoghq.com/not-ready=presence:NoSchedule taints once Agent reaches ready state. Until then pods can't be scheduled on the node unless they tolerate this taint. This is effective during cluster scale out when new nodes join with the taint. Once removed Operator will not add taint back for any reason.

Feature can be enabled via --untaintControllerEnabled=flag arg, optionally one can enable taint removal events using DD_UNTAINT_CONTROLLER_EVENTS_ENABLED=true env var.

Commits

0b567c6 - POC

359ab79 - improvements:

  • Conflict-handling fix, requeues in 1s instead of returning an error (no exponential backoff).
  • Defensive JSON-patch marshaling. Guarantees taint slices serialize as [] rather than null in both test and replace ops.
  • Field-indexer collision-proofing. Renames the pod-by-nodeName cache index to a namespaced key untaint.spec.nodeName.
  • Two-tier timeout policy. Adds DD_UNTAINT_CONTROLLER_TIMEOUT (pod-exists-not-ready, default 10m) and DD_UNTAINT_CONTROLLER_SCHEDULING_TIMEOUT (no pod, default 5m), with DD_UNTAINT_CONTROLLER_TIMEOUT_POLICY (remove
    default | keep) controlling whether to untaint anyway or just emit observability.
  • Node watch added. Filters to nodes carrying the target taint, so nodes that never get an agent pod still get reconciled.
  • Pod predicate widened. Enqueues on any agent-pod Create/Update/Delete (not just Ready transitions) so the readiness clock can't stall under policy=keep or when NotReady pods appear between reconciles.
  • Log error on config errors - similar to other controllers untaint-controller startup error is logged — no silent fallback to defaults and other controllers continue running.
  • New metric. untaint_taint_timeouts_total{reason, policy} for alerting on stuck nodes; existing taint_removals_total and taint_removal_latency_seconds kept.
  • Tests. Adds a unit suite (fake client + fake clock, full coverage of reconcile outcomes / predicates / conflict paths / env-var errors); rewrites the integration suite around fresh per-test nodes and adds
    scheduling/readiness-timeout scenarios.
  • Docs. Adds docs/untaint_controller.md.

Motivation

#2052 and similar requests through other channels.

Additional Notes

Anything else we should know when reviewing?

Minimum Agent Versions

Are there minimum versions of the Datadog Agent and/or Cluster Agent required?

  • Agent: vX.Y.Z
  • Cluster Agent: vX.Y.Z

Describe your test plan

Can be tested locally on kind cluster:

  1. Create kind cluster with multiple nodes (assume 1 control plane, 3 workers).
  2. Deploy Operator with feature enabled.
      args:
      - --untaintControllerEnabled=true
      ...
      env:
      - name: DD_UNTAINT_CONTROLLER_EVENTS_ENABLED
        value: "true"
  1. Taint 2 worker nodes
kubectl taint nodes tainted-worker agent.datadoghq.com/not-ready=presence:NoSchedule
kubectl taint nodes tainted-worker2 agent.datadoghq.com/not-ready=presence:NoSchedule
  1. Create nginx deployment with 3 replicas and antiaffinity not to schedule two on same node
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-test
spec:
  replicas: 3
  selector:
    matchLabels:
      app: nginx-test
  template:
    metadata:
      labels:
        app: nginx-test
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - nginx-test
            topologyKey: kubernetes.io/hostname
      containers:
      - name: nginx
        image: nginx
  1. One replica should be schedule on tainted-worker3, other two will be pending.
  2. Apply DatadogAgent manifest (add toleration to agent pods)
apiVersion: datadoghq.com/v2alpha1
kind: DatadogAgent
metadata:
  name: datadog-agent
spec:
  global:
    credentials:
      apiSecret:
        secretName: datadog-secret
        keyName: api-key
    kubelet:
      tlsVerify: false
  features:
    logCollection:
      enabled: true
      containerCollectUsingFiles: true
      containerCollectAll: true
  override:
    nodeAgent:
      tolerations:
        - key: agent.datadoghq.com/not-ready
          operator: Exists
          effect: NoSchedule
  1. Agents will be schedule to all nodes, once Agents become ready on tainted-worker and tainted-worker2 nginx pods will be scheduled too.
# monitor events and taint removal
kubectl get event -w -n default | grep -i taint
kubectl get nodes -w -o custom-columns='NAME:.metadata.name,TAINTS:.spec.taints[*].key'

Checklist

  • PR has at least one valid label: bug, enhancement, refactoring, documentation, tooling, and/or dependencies
  • PR has a milestone or the qa/skip-qa label
  • All commits are signed (see: signing commits)

@levan-m levan-m force-pushed the levan-m/untaint-controller branch from 2dc7aa1 to 0b567c6 Compare March 12, 2026 21:45
@levan-m levan-m added the enhancement New feature or request label Mar 12, 2026
@datadog-official

datadog-official Bot commented May 22, 2026

Copy link
Copy Markdown

Code Coverage

🎯 Code Coverage (details)
Patch Coverage: 82.70%
Overall Coverage: 44.39% (+1.81%)

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 9fa9fc5 | Docs | Datadog PR Page | Give us feedback!

@levan-m levan-m changed the title [draft] untaint controller Untaint controller May 22, 2026
@levan-m levan-m added this to the v1.28.0 milestone May 22, 2026
@levan-m levan-m marked this pull request as ready for review May 22, 2026 22:22
@levan-m levan-m requested a review from a team May 22, 2026 22:22
@levan-m levan-m requested a review from a team as a code owner May 22, 2026 22:22
@codecov-commenter

codecov-commenter commented May 22, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 83.59375% with 42 lines in your changes missing coverage. Please review.
✅ Project coverage is 44.07%. Comparing base (b2a60a9) to head (9fa9fc5).
⚠️ Report is 21 commits behind head on main.

Files with missing lines Patch % Lines
internal/controller/untaint_controller.go 89.86% 14 Missing and 9 partials ⚠️
pkg/config/config.go 33.33% 6 Missing ⚠️
internal/controller/metrics/untaint.go 0.00% 5 Missing ⚠️
internal/controller/setup.go 58.33% 3 Missing and 2 partials ⚠️
cmd/main.go 0.00% 3 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2753      +/-   ##
==========================================
+ Coverage   42.24%   44.07%   +1.83%     
==========================================
  Files         337      343       +6     
  Lines       28951    30931    +1980     
==========================================
+ Hits        12230    13633    +1403     
- Misses      15916    16428     +512     
- Partials      805      870      +65     
Flag Coverage Δ
unittests 44.07% <83.59%> (+1.83%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
cmd/main.go 6.80% <0.00%> (-0.07%) ⬇️
internal/controller/metrics/untaint.go 0.00% <0.00%> (ø)
internal/controller/setup.go 68.53% <58.33%> (+0.59%) ⬆️
pkg/config/config.go 69.30% <33.33%> (-2.57%) ⬇️
internal/controller/untaint_controller.go 89.86% <89.86%> (ø)

... and 40 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b2a60a9...9fa9fc5. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 359ab79dc6

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread Makefile Outdated
.PHONY: docker-build-ci
docker-build-ci:
docker build . -t ${IMG} --build-arg FIPS_ENABLED="${FIPS_ENABLED}" --build-arg LDFLAGS="${LDFLAGS}" --build-arg GOARCH="${GOARCH}"
docker build . -t ${IMG} --build-arg FIPS_ENABLED="${FIPS_ENABLED}" --build-arg LDFLAGS="${LDFLAGS}" --build-arg GOARCH="${GOARCH}" --platform=linux/${GOARCH}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid passing empty GOARCH to docker --platform

docker-build-ci now always appends --platform=linux/${GOARCH}, but GOARCH is declared as optional (GOARCH?=) and is often unset for local runs. In that common case the command becomes --platform=linux/, which makes docker build fail before the image build starts. This regresses the documented “For local use” target unless every caller exports GOARCH first.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was supposed to be local-only change.

@janine-c janine-c left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some very boring style edits that you can take or leave 🙂

Comment thread docs/untaint_controller.md Outdated
Comment thread docs/untaint_controller.md Outdated
Comment thread docs/untaint_controller.md Outdated
Comment thread docs/untaint_controller.md Outdated
Comment thread docs/untaint_controller.md Outdated
Comment thread docs/untaint_controller.md Outdated
Comment thread docs/untaint_controller.md Outdated
Comment thread docs/untaint_controller.md Outdated
Comment thread docs/untaint_controller.md Outdated

- `untaint_taint_removals_total` — counter, every taint removal regardless of cause.
- `untaint_taint_removal_latency_seconds` — histogram, time between pod Ready and taint removal.
- `untaint_taint_timeouts_total{reason, policy}` — counter, timeout decisions. `reason` ∈ {`readiness`, `scheduling`}; `policy` ∈ {`remove`, `keep`}. Alert on `policy="keep"` to investigate stuck nodes.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't seem to use this notation very often. If you think users will understand it, leave it, but I like to err on the side of clarity just in case a user hasn't encountered it before.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll replace it with in.

Comment thread docs/untaint_controller.md Outdated
levan-m and others added 3 commits May 26, 2026 11:51
Co-authored-by: Janine Chan <64388808+janine-c@users.noreply.github.com>
Co-authored-by: Janine Chan <64388808+janine-c@users.noreply.github.com>
Comment thread internal/controller/setup.go Outdated
Comment thread internal/controller/suite_v2_test.go
Comment thread internal/controller/suite_v2_test.go Outdated
Comment thread internal/controller/untaint_controller.go
Comment thread internal/controller/untaint_controller.go
Comment thread internal/controller/untaint_controller.go
Comment on lines +241 to +248
// 1. Any pod exists on the node → readiness timeout
// (clock = max(pod.Status.StartTime), so a fresh restart resets it).
// If pods exist but no StartTime is set yet (very early lifecycle —
// kubelet hasn't transitioned through PodInitializing yet), requeue
// after the readiness window so we re-check once StartTime appears.
// We do NOT fall through to the scheduling clock here: a pod *is*
// scheduled, so the "no agent pod ever scheduled" condition does not
// hold and applying the scheduling timeout would be incorrect.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, this is assuming that the agent is the only pod that is tolerating the no-schedule taint.

If this is the case, I see 2 potential issues:

  • What if the user wanted to exclude specific pods from being blocked? I think this can happen easily if the user is running their own CSI driver (not referring to Datadog CSI Driver, can be any system critical component that should not depend on the existence of the agent itself).
  • I think the current implementation will break (or at least cause friction) if we wanted in the future to support covering other components with this controller. The example I have in mind now is Datadog CSI Driver + Datadog Agent Daemonset.

Wouldn't it be more reliable to filter for target pods ? (here it is only the agent pods, but should be configurable to support multiple selectors I assume)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you referring to timeout behavior specifically or in general?

  • If user wants to exclude specific pods from being blocked they'll add tolerations. That would be case for CSI, cilium, some other infra daemonsets. Applications won't have that toleration and they'll be blocked until agent is ready.
  • To support CSI driver, which we plan to, we can extend this controller. It will handle new taint, cache filter will change. Agent DS will not tolerate CSI driver taints to ensure CSI pods start first (if that's the intention).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If user wants to exclude specific pods from being blocked they'll add tolerations. That would be case for CSI, cilium, some other infra daemonsets. Applications won't have that toleration and they'll be blocked until agent is ready.

But if I am not mistaken, the code doesn't look for agent pods to consider the node as ready to lose the taint.
It just accepts any pod (see here). So for example, if the user adds the toleration to some infra pod, and it gets scheduled first on the node, then the operator will remove the taint from the node, even if the agent pod is not yet scheduled and healthy.

To support CSI driver, which we plan to, we can extend this controller. It will handle new taint, cache filter will change. Agent DS will not tolerate CSI driver taints to ensure CSI pods start first (if that's the intention).

This is clear, but I still see the same concern as the previous point.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

podList here is list of agent pods.

	podList := &corev1.PodList{}
	labelSelector := labels.SelectorFromSet(map[string]string{
		common.AgentDeploymentComponentLabelKey: constants.DefaultAgentResourceSuffix,
	})
	if err := r.client.List(ctx, podList,
		client.MatchingLabelsSelector{Selector: labelSelector},
		client.MatchingFields{untaintPodNodeIndex: req.Name},

idea is that we check timeout against to most recent agent pod start. I'll fix the comment as it doesn't explicitly mention that we are looking at agent pods.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right, I missed the label selector in the client.List.

But then now I see that the controller hardcodes the selector to match agent pods only. So currently there is no way to dynamically support other target pods. But this is non-blocking for this PR, can be refactored in the following PRs adding support for CSI driver 👍

Comment thread internal/controller/untaint_controller.go Outdated
Comment thread internal/controller/untaint_controller.go Outdated

@adel121 adel121 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left minor comments/questions

@adel121 adel121 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing all the comments and answering all questions 🙇

Comment thread docs/untaint_controller.md Outdated
@levan-m levan-m merged commit d60ff16 into main Jun 3, 2026
30 of 38 checks passed
@levan-m levan-m deleted the levan-m/untaint-controller branch June 3, 2026 14:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants