[CASCL-570,CASCL-623] Autoscaling generate DPA based metrics to track configuration and status in dashboard and monitors#42547
[CASCL-570,CASCL-623] Autoscaling generate DPA based metrics to track configuration and status in dashboard and monitors#42547clamoriniere wants to merge 9 commits into
Conversation
| return nil, ErrNotCompiled | ||
| } | ||
|
|
||
| // GetNodeTypeCounts not implemented |
There was a problem hiding this comment.
it seems that this function was missing from the nocompile flavour
There was a problem hiding this comment.
Surprising that it the issue shows up in this PR?
There was a problem hiding this comment.
I saw the issue with claude code running directly go build and go test command without invoke
I suppose the issue was highlighted in the go test error due mess-up due to the selection of built tag that claude has decided to add.
Static quality checks✅ Please find below the results from static quality gates Successful checksInfo
30 successful checks with minimal change (< 2 KiB)
On-wire sizes (compressed)
|
Regression DetectorRegression Detector ResultsMetrics dashboard Baseline: ecede68 Optimization Goals: ✅ No significant changes detected
|
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ➖ | docker_containers_cpu | % cpu utilization | -0.48 | [-3.41, +2.45] | 1 | Logs |
Fine details of change detection per experiment
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ➖ | uds_dogstatsd_20mb_12k_contexts_20_senders | memory utilization | +0.34 | [+0.29, +0.40] | 1 | Logs |
| ➖ | otlp_ingest_metrics | memory utilization | +0.25 | [+0.10, +0.40] | 1 | Logs |
| ➖ | ddot_metrics | memory utilization | +0.23 | [+0.03, +0.43] | 1 | Logs |
| ➖ | ddot_metrics_sum_cumulativetodelta_exporter | memory utilization | +0.14 | [-0.09, +0.37] | 1 | Logs |
| ➖ | file_to_blackhole_1000ms_latency | egress throughput | +0.05 | [-0.35, +0.44] | 1 | Logs |
| ➖ | ddot_metrics_sum_cumulative | memory utilization | +0.04 | [-0.12, +0.20] | 1 | Logs |
| ➖ | file_to_blackhole_100ms_latency | egress throughput | +0.03 | [-0.02, +0.08] | 1 | Logs |
| ➖ | quality_gate_idle | memory utilization | +0.02 | [-0.03, +0.06] | 1 | Logs bounds checks dashboard |
| ➖ | file_to_blackhole_0ms_latency | egress throughput | +0.01 | [-0.51, +0.54] | 1 | Logs |
| ➖ | uds_dogstatsd_to_api_v3 | ingress throughput | +0.01 | [-0.11, +0.14] | 1 | Logs |
| ➖ | tcp_dd_logs_filter_exclude | ingress throughput | +0.01 | [-0.08, +0.10] | 1 | Logs |
| ➖ | uds_dogstatsd_to_api | ingress throughput | -0.01 | [-0.13, +0.12] | 1 | Logs |
| ➖ | file_tree | memory utilization | -0.03 | [-0.08, +0.02] | 1 | Logs |
| ➖ | otlp_ingest_logs | memory utilization | -0.04 | [-0.13, +0.06] | 1 | Logs |
| ➖ | file_to_blackhole_500ms_latency | egress throughput | -0.04 | [-0.43, +0.35] | 1 | Logs |
| ➖ | ddot_metrics_sum_delta | memory utilization | -0.06 | [-0.28, +0.15] | 1 | Logs |
| ➖ | quality_gate_logs | % cpu utilization | -0.36 | [-1.83, +1.12] | 1 | Logs bounds checks dashboard |
| ➖ | ddot_logs | memory utilization | -0.41 | [-0.48, -0.34] | 1 | Logs |
| ➖ | quality_gate_idle_all_features | memory utilization | -0.43 | [-0.46, -0.39] | 1 | Logs bounds checks dashboard |
| ➖ | docker_containers_cpu | % cpu utilization | -0.48 | [-3.41, +2.45] | 1 | Logs |
| ➖ | docker_containers_memory | memory utilization | -0.60 | [-0.68, -0.53] | 1 | Logs |
| ➖ | quality_gate_metrics_logs | memory utilization | -0.63 | [-0.84, -0.41] | 1 | Logs bounds checks dashboard |
| ➖ | tcp_syslog_to_blackhole | ingress throughput | -3.42 | [-3.52, -3.31] | 1 | Logs |
Bounds Checks: ✅ Passed
| perf | experiment | bounds_check_name | replicates_passed | links |
|---|---|---|---|---|
| ✅ | docker_containers_cpu | simple_check_run | 10/10 | |
| ✅ | docker_containers_memory | memory_usage | 10/10 | |
| ✅ | docker_containers_memory | simple_check_run | 10/10 | |
| ✅ | file_to_blackhole_0ms_latency | lost_bytes | 10/10 | |
| ✅ | file_to_blackhole_0ms_latency | memory_usage | 10/10 | |
| ✅ | file_to_blackhole_1000ms_latency | lost_bytes | 10/10 | |
| ✅ | file_to_blackhole_1000ms_latency | memory_usage | 10/10 | |
| ✅ | file_to_blackhole_100ms_latency | lost_bytes | 10/10 | |
| ✅ | file_to_blackhole_100ms_latency | memory_usage | 10/10 | |
| ✅ | file_to_blackhole_500ms_latency | lost_bytes | 10/10 | |
| ✅ | file_to_blackhole_500ms_latency | memory_usage | 10/10 | |
| ✅ | quality_gate_idle | intake_connections | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_idle | memory_usage | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | intake_connections | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | memory_usage | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_logs | intake_connections | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_logs | lost_bytes | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_logs | memory_usage | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | cpu_usage | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | intake_connections | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | lost_bytes | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | memory_usage | 10/10 | bounds checks dashboard |
Explanation
Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%
Performance changes are noted in the perf column of each table:
- ✅ = significantly better comparison variant performance
- ❌ = significantly worse comparison variant performance
- ➖ = no significant change in performance
A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".
For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:
-
Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
-
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
-
Its configuration does not mark it "erratic".
Replicate Execution Details
We run multiple replicates for each experiment/variant. However, we allow replicates to be automatically retried if there are any failures, up to 8 times, at which point the replicate is marked dead and we are unable to run analysis for the entire experiment. We call each of these attempts at running replicates a replicate execution. This section lists all replicate executions that failed due to the target crashing or being oom killed.
Note: In the below tables we bucket failures by experiment, variant, and failure type. For each of these buckets we list out the replicate indexes that failed with an annotation signifying how many times said replicate failed with the given failure mode. In the below example the baseline variant of the experiment named experiment_with_failures had two replicates that failed by oom kills. Replicate 0, which failed 8 executions, and replicate 1 which failed 6 executions, all with the same failure mode.
| Experiment | Variant | Replicates | Failure | Logs | Debug Dashboard |
|---|---|---|---|---|---|
| experiment_with_failures | baseline | 0 (x8) 1 (x6) | Oom killed | Debug Dashboard |
The debug dashboard links will take you to a debugging dashboard specifically designed to investigate replicate execution failures.
❌ Retried Profiling Replicate Execution Failures (target internal profiling)
Note: Profiling replicas may still be executing. See the debug dashboard for up to date status.
| Experiment | Variant | Replicates | Failure | Debug Dashboard |
|---|---|---|---|---|
| quality_gate_idle_all_features | baseline | 11 (x3) | Oom killed | Debug Dashboard |
| quality_gate_idle_all_features | comparison | 11 (x3) | Oom killed | Debug Dashboard |
CI Pass/Fail Decision
✅ Passed. All Quality Gates passed.
- quality_gate_metrics_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check cpu_usage: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check lost_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check lost_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
|
I was able to test locally that the metrics are generated and cleanup when the DPA is update or deleted here is a sample |
| commonOpts, | ||
| ) | ||
|
|
||
| // telemetryVerticalScaleConstraintsContainerCPURequestMin tracks minimum CPU request constraint per container |
There was a problem hiding this comment.
I'm not sure it's very useful to track these but if we want to, it should follow the same pattern as the telemetryVerticalScaleReceivedRecommendationsRequests with tags instead of metric names
There was a problem hiding this comment.
this is a user request to be able to know if they are close to the min and max to update their configuration.
they want to compare the actual value with the min and max. and be alerted when they are close to the max
There was a problem hiding this comment.
the labels/tags on telemetryVerticalScaleReceivedRecommendationsRequests are problematique because they are conflicting with the cluster-agent pod tags for example the container_name tag.
that is why on the other metrics I used target_container_name instead I can update also the telemetryVerticalScaleReceivedRecommendationsRequests and other similar metrics to have target_container_name too
There was a problem hiding this comment.
for the labels, I added the source but I kept the resource (cpu, memory) as part of the metrics name to ensure units works well with the datadog metrics.
There was a problem hiding this comment.
The min/max for horizontal yes, these ones I'm not sure of the added value, but we can submit them anyway.
The container_name does not appear to conflict
If we one metric, we need to change the old ones as well to keep consistency
| ) | ||
|
|
||
| // telemetryStatusHorizontalCurrentReplicas tracks the current replicas value | ||
| telemetryStatusHorizontalCurrentReplicas = telemetry.NewGaugeWithOpts( |
There was a problem hiding this comment.
Not sure what we want to do with that we already have 2 other ways to report current number of replicas more accurately in the product
There was a problem hiding this comment.
the idea here is to represent the value that it is used by the autoscaling controller. so we can detect a gap between the reality and what they used.
This similar information has been useful in the past with the WPA to detect bugs on how we aggregate pods status info.
Anyway all these metrics are for now "experimental" and not officially expose to end users. Lets see if they are useful with our internal usage first.
There was a problem hiding this comment.
We only pull from /scale, so there's no difference with reality here. Actually we already have 3 different metrics for it kubernetes_state.deployment.replicas, all variations of kubernetes_state.pod*, kubernetes.pods.running), IMHO it should be enough.
And if we want to monitor some internal calculations, we should set metric names that suggest it's internal, but so far it has not be needed.
| subsystem, | ||
| "status_desired_replicas", | ||
| dpaLabels, | ||
| "Tracks the current replicas value from DatadogPodAutoscaler.status.horizontal.desiredReplicas", |
There was a problem hiding this comment.
I believe it duplicates with telemetryHorizontalScaleAppliedRecommendations and telemetryHorizontalScaleReceivedRecommendations
There was a problem hiding this comment.
the goal is to detect when we received a reco, but we can applied it due to some logic or bug in the cluster-agent
There was a problem hiding this comment.
That's already the difference between received and applied. desiredReplicas is exactly telemetryHorizontalScaleReceivedRecommendations
| ) | ||
|
|
||
| // telemetryStatusVerticalDesiredPodCPURequest tracks the sum of CPU requests for all containers | ||
| telemetryStatusVerticalDesiredPodCPURequest = telemetry.NewGaugeWithOpts( |
There was a problem hiding this comment.
It can already be done with the sum of telemetryVerticalScaleReceivedRecommendationsRequests and telemetryVerticalScaleReceivedRecommendationsLimits
There was a problem hiding this comment.
but you can but a different value if you want (event if it doesn't make sense)
There was a problem hiding this comment.
The code is doing scalingValues.Vertical.SumCPUMemoryRequests() which is exactly the sum of telemetryVerticalScaleReceivedRecommendationsRequests so I don't know how it can be a different value?
| ) | ||
|
|
||
| // telemetryStatusVerticalDesiredPodMemoryRequest tracks the sum of memory requests for all containers | ||
| telemetryStatusVerticalDesiredPodMemoryRequest = telemetry.NewGaugeWithOpts( |
There was a problem hiding this comment.
Also duplicates telemetryVerticalScaleReceivedRecommendationsLimits and telemetryVerticalScaleReceivedRecommendationsRequests
| } | ||
|
|
||
| // getAutoscalerLabels extracts common labels from a DatadogPodAutoscaler or PodAutoscalerInternal | ||
| func getAutoscalerLabels[T *datadoghq.DatadogPodAutoscaler | model.PodAutoscalerInternal](autoscaler T) []string { |
There was a problem hiding this comment.
This is misleading with actual object labels, it should be called getAutoscalerTelemetryLabels
| return nil, ErrNotCompiled | ||
| } | ||
|
|
||
| // GetNodeTypeCounts not implemented |
There was a problem hiding this comment.
Surprising that it the issue shows up in this PR?
| if spec.Constraints != nil && len(spec.Constraints.Containers) > 0 { | ||
| // Track per-container constraints | ||
| for _, containerConstraint := range spec.Constraints.Containers { | ||
| containerLabels := append(labels[:len(labels)-1], containerConstraint.Name, le.JoinLeaderValue) |
There was a problem hiding this comment.
The container name can be *, it might get some issue to actually retrieve this value in a metric tag. Maybe we can add a all_containers like value instead in this case?
There was a problem hiding this comment.
@HemeryJu
how can we have a container contraint without container name?
from the UI config point of view it seems impossible.
does it means that in that case, we apply the constraint to very containers?
| // Common label definitions for DPA metrics | ||
| dpaTelemetryLabels = []string{"namespace", "target_name", "autoscaler_name", le.JoinLeaderLabel} | ||
| dpaSpecContainerTelemetryLabels = []string{"namespace", "target_name", "autoscaler_name", "target_container_name", le.JoinLeaderLabel} | ||
| dpaStatusContainerTelemetryLabels = []string{"namespace", "target_name", "autoscaler_name", "source", "target_container_name", le.JoinLeaderLabel} |
There was a problem hiding this comment.
I do dpaStatusContainerTelemetryLabels and dpaStatusContainerTelemetryLegacyLabels to remove the container_name tag and use instead target_container_name
target_container_name is present in both label slices
| ) | ||
|
|
||
| // telemetryVerticalScaleConstraintsContainerCPURequestMin tracks minimum CPU request constraint per container | ||
| telemetryVerticalScaleConstraintsContainerCPURequestMin = telemetry.NewGaugeWithOpts( |
There was a problem hiding this comment.
I use different metrics for CPU and memory because the unit is different and with datadog metrics it is not a good practice to have the same metric but with different units.
wdhif
left a comment
There was a problem hiding this comment.
LGTM for @DataDog/container-platform files
| telemetryLocalFallbackEnabled.Set(value, labels...) | ||
| } | ||
|
|
||
| func trackDPATelemetry(podAutoscaler *datadoghq.DatadogPodAutoscaler) { |
There was a problem hiding this comment.
Still not based on the internal DPA
There was a problem hiding this comment.
This is on purpose, to reflect what user can see in the DPA CRD.
Can you explain why we should use the internal DPA instead?
There was a problem hiding this comment.
Because the internal DPA is the source of truth and the object manipulated by the controller, the CRD is displaying some of the information available in there, using the DPA Internal will allow to precisely select what we want and extend to any other information in the future.
frank-spano
left a comment
There was a problem hiding this comment.
lgtm for @DataDog/container-integrations
d6917ca to
8025938
Compare
Use delete-all-recreate pattern for container metrics to prevent stale metrics when containers are removed from spec or status
8025938 to
6fe0f41
Compare
… with dedicate metrics store (#46833) ## Scope of This PR This PR revisits the previous attempt: #42547 The focus here is strictly on **refactoring the existing metric generation logic**, without introducing new metrics. A follow-up PR will build on this foundation to introduce additional DPA metrics. ## Motivation Today, `DatadogPodAutoscaler` (DPA) resource metrics are exposed as OpenMetrics/Prometheus metrics and scraped by the `datadog_cluster_agent` check. While functional, this approach has several drawbacks: - All scraped metrics automatically inherit the **Cluster Agent pod context** (`pod_name`, `kube_namespace`, `kube_container_name`, etc.). - This creates confusion around DPA metric tags, as metrics appear tied to Cluster Agent pods rather than the DPA resources themselves. - It increases metric cardinality due to multiple Cluster Agent instances (leader, followers, rollouts) contributing additional metric contexts. ## Proposed Approach This PR changes how DPA metrics are generated. Instead of exposing them via OpenMetrics and relying on the `datadog_cluster_agent` check for collection, DPA metrics are now produced directly within the Cluster Agent autoscaling component — following the same pattern used for `kubernetes_state` metrics. This provides: - Better control over which tags are attached to DPA metrics - Cleaner and more accurate metric context - Reduced unnecessary metric cardinality ## Summary ### Core changes - Refactor `ObserverFunc` signature from `(string, string)` to `(string, interface{})` to pass the actual object to observers, enabling richer metric generation - Add new `pkg/clusteragent/autoscaling/workload/metrics` package with a `PodAutoscalerMetricsStore` that generates and periodically sends structured metrics (gauges/counts) for `DatadogPodAutoscaler` objects via `sender.Sender` - Replace old telemetry helpers (`telemetry.go` tag-based metrics) with leader-aware metric submission; metrics are only emitted by the leader ### Action metrics consolidation - Move horizontal/vertical scaling action metrics from event-driven `Submit*` functions in `counters.go` into the state-based `GeneratePodAutoscalerMetrics` generator - Delete `counters.go` entirely (`SubmitReceivedRecommendationsVersion` was dead code; remaining `Submit*` functions replaced by the generator) - Remove `sender` and `isLeader` from `horizontalController` and `verticalController` since they were only used for the now-removed `Submit*` calls ### Metrics emitted | Metric | Type | Notes | |--------|------|-------| | `received_recommendations_version` | Gauge | RC version of last received main scaling values; only emitted when > 0 | | `horizontal_scaling_received_replicas` | Gauge | Replicas recommended by the product recommender | | `vertical_scaling_received_requests` | Gauge | Per-container requested resources from recommender | | `vertical_scaling_received_limits` | Gauge | Per-container resource limits from recommender | | `horizontal_scaling_applied_replicas` | Gauge | Replicas from the last applied horizontal action | | `horizontal_scaling_actions` | Count | Cumulative count of horizontal scaling actions, tagged `status:ok` or `status:error` | | `vertical_rollout_triggered` | Count | Cumulative count of vertical rollout actions, tagged `status:ok` or `status:error` | | `autoscaler_conditions` | Gauge | 1.0/0.0 per condition type from CRD status | | `local_fallback_enabled` | Gauge | 1.0 when horizontal active source is local fallback | ### Model changes - Add `mainScalingValuesVersion uint64` to `PodAutoscalerInternal` to persist the remote config version of the last received main scaling values - Extend `UpdateFromMainValues` to accept and persist the RC version; `RemoveMainValues` resets it to 0 - Add `horizontalActionErrorCount`/`horizontalActionSuccessCount` and `verticalActionErrorCount`/`verticalActionSuccessCount` counter fields, incremented on each action outcome - Add public getters: `MainScalingValuesVersion()`, `HorizontalActionErrorCount()`, `HorizontalActionSuccessCount()`, `VerticalActionErrorCount()`, `VerticalActionSuccessCount()` ## Test plan - [x] Unit tests added for `metrics/store`, `metrics/generator`, and `metrics/writer` - [x] `generator_test.go` covers all 9 metrics including both `status:ok` and `status:error` count variants - [x] `config_retriever_values_test.go` updated with expected `MainScalingValuesVersion` values - [x] `controller_horizontal_test.go` updated to assert `HorizontalActionSuccessCount`/`HorizontalActionErrorCount` - [x] Existing `workload` controller tests updated to remove now-deleted `sender`/`isLeader` fixtures - [x] All tests in `pkg/clusteragent/autoscaling/...` pass locally 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: cedric.lamoriniere <cedric.lamoriniere@datadoghq.com>
What does this PR do?
This PR adds a set of telemetry metrics to the Cluster Agent’s
/metricsendpoint.These new metrics are related to the DatadogPodAutoscaler (DPA) Custom Resource.
For each DPA present in a cluster, the Cluster Agent now exposes metrics that provide:
This makes it easier to monitor and analyze DPA behavior directly through metrics.
Motivation
Improve visibility and monitoring of autoscaling behavior by exposing DPA-related data as metrics.
This enhancement simplifies debugging, validation, and correlation of autoscaling activity within the Cluster Agent.
Describe how you validated your changes
Additional Notes