Skip to content

[CONTP-1609] Auto-inject agent toleration when untaint controller is enabled#3086

Merged
adel121 merged 9 commits into
mainfrom
adelhajhassan/auto-add-untaint-toleration-to-agent
Jun 5, 2026
Merged

[CONTP-1609] Auto-inject agent toleration when untaint controller is enabled#3086
adel121 merged 9 commits into
mainfrom
adelhajhassan/auto-add-untaint-toleration-to-agent

Conversation

@adel121

@adel121 adel121 commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

Updates the datadog agent controller to Auto-inject agent toleration when untaint controller is enabled

Relates to #2753

Motivation

For the untaint controller to work, the agent DaemonSet must tolerate the startup taint (agent.datadoghq.com/not-ready=presence:NoSchedule) — otherwise the agent can never schedule on tainted nodes, the taint is never removed, and all workloads are blocked indefinitely. Without auto-injection this is a silent foot-gun: enabling the flag without manually adding the toleration produces no error but a permanently deadlocked cluster.

When --untaintControllerEnabled=true, the operator automatically injects the toleration into the node agent DaemonSet spec, following the same pattern used by other feature flags that influence agent pod spec assembly (IntrospectionEnabled, DatadogAgentProfileEnabled).

Pass UntaintControllerEnabled through datadogagent.ReconcilerOptions

Inject toleration in the node agent DaemonSet builder (idempotent — no duplicate if user also sets it manually)

Additional Notes

Anything else we should know when reviewing?

Minimum Agent Versions

Are there minimum versions of the Datadog Agent and/or Cluster Agent required?

  • Agent: vX.Y.Z
  • Cluster Agent: vX.Y.Z

Describe your test plan

Follow the same testing instructions as #2753 , but instead of adding the toleration manually to the datadog agent, ensure the agent gets the toleration automatically.

Checklist

  • PR has at least one valid label: bug, enhancement, refactoring, documentation, tooling, and/or dependencies
  • PR has a milestone or the qa/skip-qa label
  • All commits are signed (see: signing commits)

@datadog-datadog-prod-us1-2

datadog-datadog-prod-us1-2 Bot commented Jun 4, 2026

Copy link
Copy Markdown

Code Coverage

🎯 Code Coverage (details)
Patch Coverage: 86.21%
Overall Coverage: 44.24% (+0.28%)

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 87e8b9d | Docs | Datadog PR Page | Give us feedback!

@adel121 adel121 marked this pull request as ready for review June 4, 2026 13:30
@adel121 adel121 requested a review from a team June 4, 2026 13:30
@adel121 adel121 requested a review from a team as a code owner June 4, 2026 13:30
@adel121 adel121 force-pushed the adelhajhassan/auto-add-untaint-toleration-to-agent branch from 44c716a to 31dae4d Compare June 4, 2026 13:44
@adel121

adel121 commented Jun 4, 2026

Copy link
Copy Markdown
Contributor Author

@codex

@chatgpt-codex-connector

Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Breezy!

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@codecov-commenter

codecov-commenter commented Jun 4, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 88.23529% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 43.95%. Comparing base (5301a0c) to head (87e8b9d).
⚠️ Report is 9 commits behind head on main.

Files with missing lines Patch % Lines
...datadogagentinternal/controller_reconcile_agent.go 0.00% 4 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3086      +/-   ##
==========================================
+ Coverage   43.64%   43.95%   +0.31%     
==========================================
  Files         350      352       +2     
  Lines       30075    30289     +214     
==========================================
+ Hits        13125    13313     +188     
- Misses      16079    16100      +21     
- Partials      871      876       +5     
Flag Coverage Δ
unittests 43.95% <88.23%> (+0.31%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...datadogagent/component/agent/untaint_toleration.go 100.00% <100.00%> (ø)
internal/controller/datadogagent/controller.go 48.00% <ø> (ø)
...rnal/controller/datadogagentinternal/controller.go 0.00% <ø> (ø)
internal/controller/setup.go 68.96% <100.00%> (+0.43%) ⬆️
internal/controller/untaint_controller.go 89.73% <100.00%> (-0.14%) ⬇️
pkg/untaint/taint.go 100.00% <100.00%> (ø)
...datadogagentinternal/controller_reconcile_agent.go 4.03% <0.00%> (-0.14%) ⬇️

... and 4 files with indirect coverage changes


Continue to review full report in Codecov by Harness.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5301a0c...87e8b9d. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@adel121 adel121 requested a review from a team as a code owner June 4, 2026 14:06
@adel121 adel121 force-pushed the adelhajhassan/auto-add-untaint-toleration-to-agent branch from cd1e16f to 9303804 Compare June 4, 2026 14:09
@hmahmood hmahmood removed the request for review from a team June 4, 2026 14:27

@OliviaShoup OliviaShoup left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the PR! left a comment with a minor suggestion

Comment thread docs/untaint_controller.md
func podToleratesAgentNotReadyStartup(tolerations []corev1.Toleration) bool {
taint := untaint.AgentNotReadyTaint()
for i := range tolerations {
if tolerations[i].ToleratesTaint(klog.Background(), &taint, false) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to access klog directly, can be passed from the controller.

experimental.ApplyExperimentalOverrides(objLogger, ddai, podManagers)

if r.options.UntaintControllerEnabled {
componentagent.EnsureAgentNotReadyStartupToleration(&podManagers.PodTemplateSpec().Spec)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pass objLogger here, it will have relevant attributes. Alternative, pass context check start of the function and create logger inside the function ctrl.LoggerFrom(ctx).WithValues with additional attributes.

// EnsureAgentNotReadyStartupToleration appends the agent-not-ready Equal toleration
// when not already tolerated per Kubernetes toleration matching.
func EnsureAgentNotReadyStartupToleration(spec *corev1.PodSpec) {
if spec == nil {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - redundant nil check since PodSpec is struct in PodTemplateSpec and can't be nil here.

Comment thread pkg/untaint/taint.go Outdated
)

// AgentNotReadyTaintEffect is the effect for the agent-not-ready startup taint.
const AgentNotReadyTaintEffect = corev1.TaintEffectNoSchedule

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - doesn't have to be public and can be inlined.


wantTol := untaint.AgentNotReadyEqualToleration()
tt := testCase{
name: "untaint controller enabled injects agent-not-ready toleration on node agent DS",

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for completeness would be nice to have case asserting tolerations aren't added when feature is disabled.

Co-authored-by: Olivia Shoup <116908616+OliviaShoup@users.noreply.github.com>
@adel121

adel121 commented Jun 5, 2026

Copy link
Copy Markdown
Contributor Author

@levan-m thanks for the review 🙇

Addressed all your comments!

@adel121 adel121 requested a review from levan-m June 5, 2026 13:27
@adel121 adel121 merged commit 39d6bd3 into main Jun 5, 2026
30 of 38 checks passed
@adel121 adel121 deleted the adelhajhassan/auto-add-untaint-toleration-to-agent branch June 5, 2026 14:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants