Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions cmd/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -147,6 +147,7 @@ type options struct {
datadogDashboardEnabled bool
datadogGenericResourceEnabled bool
datadogCSIDriverEnabled bool
untaintControllerEnabled bool

// Secret Backend options
secretBackendCommand string
Expand Down Expand Up @@ -186,6 +187,7 @@ func (opts *options) Parse() {
flag.BoolVar(&opts.datadogDashboardEnabled, "datadogDashboardEnabled", false, "Enable the DatadogDashboard controller")
flag.BoolVar(&opts.datadogGenericResourceEnabled, "datadogGenericResourceEnabled", false, "Enable the DatadogGenericResource controller")
flag.BoolVar(&opts.datadogCSIDriverEnabled, "datadogCSIDriverEnabled", false, "Enable the DatadogCSIDriver controller")
flag.BoolVar(&opts.untaintControllerEnabled, "untaintControllerEnabled", false, "Enable the Untaint controller")

// DatadogAgentInternal
flag.BoolVar(&opts.createControllerRevisions, "createControllerRevisions", false, "Enable creation of ControllerRevision snapshots on each DDA spec change")
Expand Down Expand Up @@ -293,6 +295,7 @@ func run(opts *options) error {
DatadogDashboardEnabled: opts.datadogDashboardEnabled,
DatadogGenericResourceEnabled: opts.datadogGenericResourceEnabled,
DatadogCSIDriverEnabled: opts.datadogCSIDriverEnabled,
UntaintControllerEnabled: opts.untaintControllerEnabled,
}),
// UsePriorityQueue makes all controllers use the priority queue, which
// directly registers workqueue metrics into controller-runtime's metrics
Expand Down Expand Up @@ -376,6 +379,7 @@ func run(opts *options) error {
DatadogDashboardEnabled: opts.datadogDashboardEnabled,
DatadogGenericResourceEnabled: opts.datadogGenericResourceEnabled,
DatadogCSIDriverEnabled: opts.datadogCSIDriverEnabled,
UntaintControllerEnabled: opts.untaintControllerEnabled,
}

versionInfo, platformInfo, err := getVersionAndPlatformInfo(rest.CopyConfig(mgr.GetConfig()))
Expand Down
79 changes: 79 additions & 0 deletions docs/untaint_controller.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
# Untaint Controller

This feature was introduced in Datadog Operator v1.28 and is currently in preview.

## Overview

The Untaint controller watches Kubernetes Nodes carrying the taint
`agent.datadoghq.com/not-ready=presence:NoSchedule` and removes it once the
Datadog Agent pod on that node is `Ready`. It is intended to run alongside a
separate mechanism (cluster-autoscaler hook, CCM, admission webhook, etc.)
that adds the taint to new nodes. The use case is keeping workloads off a
node until the Datadog Agent is Ready, and recovering gracefully if the Agent never
becomes Ready.

Agent pods are matched by the label `agent.datadoghq.com/component=agent` in
the operator's watched namespaces (`WATCH_NAMESPACE` /
`DD_AGENT_WATCH_NAMESPACE`).

If the Agent pod never reaches Ready on a tainted node, a configurable timeout
policy ensures the node is never permanently unschedulable. Two clocks cover
the two failure modes:

- **Readiness timeout** — the Agent pod is on the node but not Ready. Clock:
`pod.Status.StartTime`. Pod recreation restarts the window; container
restarts inside the same pod do not.
- **Scheduling timeout** — no Agent pod is on the node. Clock:
`node.metadata.creationTimestamp`. The expected path when a DaemonSet never
schedules a pod onto the node (taint not tolerated, missing labels, etc.).

A pod-recreation crash-loop faster than the readiness window can hold a node
tainted indefinitely; run with `policy=keep` and alert on
`untaint_taint_timeouts_total{policy="keep"}` to catch this.

The controller removes only this fixed taint and does not add it; both
timeouts are global and cannot be tuned per Node (Group), DDA, or DAP.

## Prerequisites

- Operator v1.x+
- Tested on Kubernetes 1.27.0+

## Enable the Untaint controller

The Untaint controller is disabled by default. Enable it on the operator
manager:

```yaml
args:
- --untaintControllerEnabled=true
```

## Configuration

All other tuning knobs are environment variables on the operator pod. Values
use Go's `time.ParseDuration` format (`90s`, `5m`, `1h`, etc.). Invalid values
fail the controller startup with an ERROR log; other controllers continue to
start normally.


| Env var | Default | Description |
| ------------------------------------------ | -------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `DD_UNTAINT_CONTROLLER_TIMEOUT` | `10m` | Readiness timeout. Tune to the upper bound of legitimate agent startup on your nodes; 2–5m is often enough on clusters with cached images. |
| `DD_UNTAINT_CONTROLLER_SCHEDULING_TIMEOUT` | `5m` | Scheduling timeout. Set larger than your scheduler retry window; raise it on clusters with large pending queues or aggressive autoscaling. |
| `DD_UNTAINT_CONTROLLER_TIMEOUT_POLICY` | `remove` | Action when a timeout fires. `remove` untaints the node anyway (favors scheduling availability over telemetry; lowest operational risk). `keep` leaves the taint in place and emits a Warning event (favors telemetry; pair with an alert on the timeout counter to surface stuck nodes). |
| `DD_UNTAINT_CONTROLLER_EVENTS_ENABLED` | `false` | Emit Kubernetes Events on Nodes for taint removals and timeout decisions. |

## Observability

Metrics, under the `untaint` Prometheus subsystem:

- `untaint_taint_removals_total` — counter, every taint removal regardless of cause.
- `untaint_taint_removal_latency_seconds` — histogram, time between pod Ready and taint removal.
- `untaint_taint_timeouts_total{reason, policy}` — counter, timeout decisions. `reason` in {`readiness`, `scheduling`}; `policy` in {`remove`, `keep`}. Alert on `policy="keep"` to investigate stuck nodes.

Kubernetes Events (gated by `DD_UNTAINT_CONTROLLER_EVENTS_ENABLED=true`):

- `TaintRemoved` (Normal) — taint removed because the Agent pod became Ready.
- `UntaintTimeout` — a timeout fired. Normal under `remove`, Warning under `keep`. Message carries the reason, elapsed time, and policy.

1 change: 1 addition & 0 deletions go.work.sum
Original file line number Diff line number Diff line change
Expand Up @@ -1081,6 +1081,7 @@ github.com/ncw/swift v1.0.47 h1:4DQRPj35Y41WogBxyhOXlrI37nzGlyEcsforeudyYPQ=
github.com/nelsam/hel/v2 v2.3.3 h1:Z3TAKd9JS3BoKi6fW+d1bKD2Mf0FzTqDUEAwLWzYPRQ=
github.com/nelsam/hel/v2 v2.3.3/go.mod h1:1ZTGfU2PFTOd5mx22i5O0Lc2GY933lQ2wb/ggy+rL3w=
github.com/niemeyer/pretty v0.0.0-20200227124842-a10e7caefd8e h1:fD57ERR4JtEqsWbfPhv4DMiApHyliiK5xCTNVSPiaAs=
github.com/nxadm/tail v1.4.4 h1:DQuhQpB1tVlglWS2hLQ5OV6B5r8aGxSrPc5Qo6uTN78=
github.com/nxadm/tail v1.4.8/go.mod h1:+ncqLTQzXmGhMZNUePPaPqPvBxHAIsmXswZKocGu+AU=
github.com/oklog/ulid v1.3.1 h1:EGfNDEx6MqHz8B3uNV6QAib1UR2Lm97sHi3ocA6ESJ4=
github.com/onsi/ginkgo v1.12.1 h1:mFwc4LvZ0xpSvDZ3E+k8Yte0hLOMxXUlP+yXtJqkYfQ=
Expand Down
1 change: 1 addition & 0 deletions internal/controller/metrics/const.go
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ package metrics
const (
datadogAgentSubsystem = "datadogagent"
datadogAgentProfileSubsystem = "datadogagentprofile"
untaintSubsystem = "untaint"

TrueValue = 1.0
FalseValue = 0.0
Expand Down
77 changes: 77 additions & 0 deletions internal/controller/metrics/untaint.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
// Unless explicitly stated otherwise all files in this repository are licensed
// under the Apache License Version 2.0.
// This product includes software developed at Datadog (https://www.datadoghq.com/).
// Copyright 2016-present Datadog, Inc.

package metrics

import (
"github.com/prometheus/client_golang/prometheus"
"sigs.k8s.io/controller-runtime/pkg/metrics"
)

// Label values for TaintTimeoutsTotal.
const (
// UntaintTimeoutReasonReadiness signals that a pod existed on the node but
// never became Ready within --untaintControllerTimeout.
UntaintTimeoutReasonReadiness = "readiness"
// UntaintTimeoutReasonScheduling signals that no agent pod was scheduled on
// the node within --untaintControllerSchedulingTimeout.
UntaintTimeoutReasonScheduling = "scheduling"

// UntaintTimeoutPolicyRemove untaints the node despite the agent not being ready.
UntaintTimeoutPolicyRemove = "remove"
// UntaintTimeoutPolicyKeep leaves the taint in place but emits observability signals.
UntaintTimeoutPolicyKeep = "keep"
)

var (
// TaintRemovalsTotal is the total number of taints removed from nodes.
TaintRemovalsTotal = prometheus.NewCounter(
prometheus.CounterOpts{
Subsystem: untaintSubsystem,
Name: "taint_removals_total",
Help: "Total number of taints removed from nodes",
},
)

// TaintRemovalLatency is the time between agent pod becoming Ready and taint removal.
TaintRemovalLatency = prometheus.NewHistogram(
prometheus.HistogramOpts{
Subsystem: untaintSubsystem,
Name: "taint_removal_latency_seconds",
Help: "Time between agent pod becoming Ready and taint removal from the node",
Buckets: prometheus.DefBuckets,
},
)

// TaintTimeoutsTotal counts timeout decisions broken down by reason and policy.
TaintTimeoutsTotal = prometheus.NewCounterVec(
prometheus.CounterOpts{
Subsystem: untaintSubsystem,
Name: "taint_timeouts_total",
Help: "Total number of untaint-controller timeout decisions, by reason and policy",
},
[]string{"reason", "policy"},
)

// TaintRemovalErrorsTotal counts hard errors encountered while attempting to
// remove the taint (apiserver Patch failures, JSON marshal failures, …).
// Benign optimistic-concurrency races (IsConflict/IsInvalid) are NOT counted
// here — they're handled by requeueing. Inspect the operator's ERROR-level
// logs for the specific failure cause.
TaintRemovalErrorsTotal = prometheus.NewCounter(
prometheus.CounterOpts{
Subsystem: untaintSubsystem,
Name: "taint_removal_errors_total",
Help: "Total number of errors encountered while attempting to remove the agent-not-ready taint from a node",
},
)
)

func init() {
metrics.Registry.MustRegister(TaintRemovalsTotal)
metrics.Registry.MustRegister(TaintRemovalLatency)
metrics.Registry.MustRegister(TaintTimeoutsTotal)
metrics.Registry.MustRegister(TaintRemovalErrorsTotal)
}
19 changes: 19 additions & 0 deletions internal/controller/setup.go
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
package controller

import (
"fmt"
"time"

"github.com/go-logr/logr"
Expand Down Expand Up @@ -48,6 +49,7 @@ type SetupOptions struct {
DatadogGenericResourceEnabled bool
CreateControllerRevisions bool
DatadogCSIDriverEnabled bool
UntaintControllerEnabled bool
}

// ExtendedDaemonsetOptions defines ExtendedDaemonset options
Expand Down Expand Up @@ -77,6 +79,7 @@ var controllerStarters = map[string]starterFunc{
dashboardControllerName: startDatadogDashboard,
genericResourceControllerName: startDatadogGenericResource,
csiDriverControllerName: startDatadogCSIDriver,
untaintControllerName: startUntaint,
}

// SetupControllers starts all controllers (also used by e2e tests)
Expand Down Expand Up @@ -236,6 +239,22 @@ func startDatadogSLO(logger logr.Logger, mgr manager.Manager, pInfo kubernetes.P
return sloReconciler.SetupWithManager(mgr)
}

func startUntaint(logger logr.Logger, mgr manager.Manager, _ kubernetes.PlatformInfo, options SetupOptions, _ datadog.MetricsForwardersManager) error {
if !options.UntaintControllerEnabled {
logger.Info("Feature disabled, not starting the controller", "controller", untaintControllerName)
return nil
}
reconciler, err := NewUntaintReconciler(
mgr.GetClient(),
ctrl.Log.WithName("controllers").WithName(untaintControllerName),
mgr.GetEventRecorderFor(untaintControllerName),
)
if err != nil {
return fmt.Errorf("untaint controller setup: %w", err)
}
return reconciler.SetupWithManager(mgr)
}

func startDatadogAgentProfiles(logger logr.Logger, mgr manager.Manager, pInfo kubernetes.PlatformInfo, options SetupOptions, metricForwardersMgr datadog.MetricsForwardersManager) error {
if !options.DatadogAgentProfileEnabled {
logger.Info("Feature disabled, not starting the controller", "controller", profileControllerName)
Expand Down
44 changes: 44 additions & 0 deletions internal/controller/setup_test.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
// Unless explicitly stated otherwise all files in this repository are licensed
// under the Apache License Version 2.0.
// This product includes software developed at Datadog (https://www.datadoghq.com/).
// Copyright 2016-present Datadog, Inc.

package controller

import (
"errors"
"testing"

"github.com/go-logr/logr"
"github.com/stretchr/testify/assert"
"sigs.k8s.io/controller-runtime/pkg/log"
"sigs.k8s.io/controller-runtime/pkg/manager"

"github.com/DataDog/datadog-operator/pkg/controller/utils/datadog"
"github.com/DataDog/datadog-operator/pkg/kubernetes"
)

// TestSetupControllers_StarterErrorsAreBestEffort confirms that starter
// failures are logged at ERROR by SetupControllers but never propagated up:
// one controller's misconfiguration must not bring the whole operator down
// or prevent other controllers from starting. The untaint controller follows
// this same best-effort pattern.
func TestSetupControllers_StarterErrorsAreBestEffort(t *testing.T) {
originalStarters := controllerStarters
t.Cleanup(func() { controllerStarters = originalStarters })

failing := func(logr.Logger, manager.Manager, kubernetes.PlatformInfo, SetupOptions, datadog.MetricsForwardersManager) error {
return errors.New("simulated starter failure")
}
controllerStarters = map[string]starterFunc{
agentControllerName: failing,
untaintControllerName: failing,
}

assert.NoError(t, SetupControllers(
log.Log,
nil,
kubernetes.PlatformInfo{},
SetupOptions{UntaintControllerEnabled: true},
))
}
8 changes: 8 additions & 0 deletions internal/controller/suite_v2_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -88,6 +88,13 @@ var _ = BeforeSuite(func(ctx context.Context) {
node2 := testutils.NewNode("node2", nil)
Expect(k8sClient.Create(context.Background(), node2)).Should(Succeed())

// Configure the untaint controller via its env vars before
// SetupControllers is called (NewUntaintReconciler reads them on construction).
Expect(os.Setenv(EnvEventsEnabled, "true")).Should(Succeed())
Expect(os.Setenv(EnvReadinessTimeout, (4 * time.Second).String())).Should(Succeed())
Expect(os.Setenv(EnvSchedulingTimeout, (4 * time.Second).String())).Should(Succeed())
Expect(os.Setenv(EnvTimeoutPolicy, string(PolicyRemove))).Should(Succeed())
Comment thread
adel121 marked this conversation as resolved.

// Start controllers
mgr, err := ctrl.NewManager(cfg, ctrl.Options{
Scheme: scheme.Scheme,
Expand All @@ -107,6 +114,7 @@ var _ = BeforeSuite(func(ctx context.Context) {
DatadogMonitorEnabled: true,
DatadogAgentProfileEnabled: true,
V2APIEnabled: true,
UntaintControllerEnabled: true,
}

dummyPlatformInfo := kubernetes.PlatformInfo{}
Expand Down
Loading
Loading