CONTINT-4838 - increase test timeout for autoscalerController by lavigne958 · Pull Request #49881 · DataDog/datadog-agent

lavigne958 · 2026-04-24T13:46:23Z

The test pkg/util/kubernetes/apiserver/controllers.TestAutoscalerController is flaky, on very few occasions it takes 2 tries to complete.

try timeous out on the 10s timeout
try succeeds in a few ms.

Increase that timeout to a larger value if it fails now... it's real problem.

What does this PR do?

Motivation

Describe how you validated your changes

Additional Notes

The test `pkg/util/kubernetes/apiserver/controllers.TestAutoscalerController` is flaky, on very few occasions it takes 2 tries to complete. 1. try timeous out on the 10s timeout 2. try succeeds in a few ms. Increase that timeout to a larger value if it fails now... it's real problem. Signed-off-by: Alexandre Lavigne <alexandre.lavigne@datadoghq.com>

datadog-official · 2026-04-24T13:57:55Z

🎯 Code Coverage (details)
• Patch Coverage: 100.00%
• Overall Coverage: 50.19% (-0.02%)

_{This comment will be updated automatically if new data arrives.

🔗 Commit SHA: e861729 | Docs | Datadog PR Page | Give us feedback!}

dd-octo-sts · 2026-04-24T14:11:53Z

Files inventory check summary

File checks results against ancestor 8078bf48:

Results for datadog-agent_7.80.0~devel.git.204.e861729.pipeline.109552594-1_amd64.deb:

No change detected

cit-pr-commenter-54b7da · 2026-04-24T14:48:16Z

Regression Detector

Regression Detector Results

Metrics dashboard
Target profiles
Run ID: 58c93301-55f2-477c-892f-f7c1b562af45

Baseline: 8078bf4
Comparison: e861729
Diff

Optimization Goals: ✅ No significant changes detected

Experiments ignored for regressions

Regressions in experiments with settings containing erratic: true are ignored.

perf	experiment	goal	Δ mean %	Δ mean % CI	trials	links
➖	docker_containers_cpu	% cpu utilization	+2.58	[-0.58, +5.75]	1	Logs

Fine details of change detection per experiment

perf	experiment	goal	Δ mean %	Δ mean % CI	trials	links
➖	docker_containers_cpu	% cpu utilization	+2.58	[-0.58, +5.75]	1	Logs
➖	otlp_ingest_metrics	memory utilization	+0.47	[+0.32, +0.62]	1	Logs
➖	quality_gate_logs	% cpu utilization	+0.42	[-1.22, +2.06]	1	Logs bounds checks dashboard
➖	file_tree	memory utilization	+0.33	[+0.29, +0.38]	1	Logs
➖	uds_dogstatsd_20mb_12k_contexts_20_senders	memory utilization	+0.19	[+0.13, +0.24]	1	Logs
➖	ddot_metrics	memory utilization	+0.16	[-0.04, +0.36]	1	Logs
➖	ddot_logs	memory utilization	+0.16	[+0.09, +0.23]	1	Logs
➖	quality_gate_idle_all_features	memory utilization	+0.15	[+0.11, +0.19]	1	Logs bounds checks dashboard
➖	ddot_metrics_sum_cumulative	memory utilization	+0.07	[-0.09, +0.22]	1	Logs
➖	ddot_metrics_sum_cumulativetodelta_exporter	memory utilization	+0.06	[-0.17, +0.30]	1	Logs
➖	otlp_ingest_logs	memory utilization	+0.04	[-0.06, +0.15]	1	Logs
➖	uds_dogstatsd_to_api_v3	ingress throughput	+0.02	[-0.18, +0.22]	1	Logs
➖	file_to_blackhole_1000ms_latency	egress throughput	+0.01	[-0.44, +0.45]	1	Logs
➖	tcp_dd_logs_filter_exclude	ingress throughput	+0.00	[-0.11, +0.11]	1	Logs
➖	quality_gate_metrics_logs	memory utilization	+0.00	[-0.26, +0.26]	1	Logs bounds checks dashboard
➖	uds_dogstatsd_to_api	ingress throughput	-0.01	[-0.22, +0.20]	1	Logs
➖	file_to_blackhole_100ms_latency	egress throughput	-0.02	[-0.12, +0.09]	1	Logs
➖	file_to_blackhole_0ms_latency	egress throughput	-0.05	[-0.51, +0.42]	1	Logs
➖	file_to_blackhole_500ms_latency	egress throughput	-0.07	[-0.46, +0.32]	1	Logs
➖	quality_gate_idle	memory utilization	-0.11	[-0.16, -0.07]	1	Logs bounds checks dashboard
➖	ddot_metrics_sum_delta	memory utilization	-0.14	[-0.33, +0.05]	1	Logs
➖	docker_containers_memory	memory utilization	-0.59	[-0.69, -0.49]	1	Logs
➖	tcp_syslog_to_blackhole	ingress throughput	-0.88	[-1.04, -0.71]	1	Logs

Bounds Checks: ✅ Passed

perf	experiment	bounds_check_name	replicates_passed	observed_value	links
✅	docker_containers_cpu	simple_check_run	10/10	715 ≥ 26
✅	docker_containers_memory	memory_usage	10/10	238.60MiB ≤ 370MiB
✅	docker_containers_memory	simple_check_run	10/10	681 ≥ 26
✅	file_to_blackhole_0ms_latency	memory_usage	10/10	0.16GiB ≤ 1.20GiB
✅	file_to_blackhole_0ms_latency	missed_bytes	10/10	0B = 0B
✅	file_to_blackhole_1000ms_latency	memory_usage	10/10	0.20GiB ≤ 1.20GiB
✅	file_to_blackhole_1000ms_latency	missed_bytes	10/10	0B = 0B
✅	file_to_blackhole_100ms_latency	memory_usage	10/10	0.17GiB ≤ 1.20GiB
✅	file_to_blackhole_100ms_latency	missed_bytes	10/10	0B = 0B
✅	file_to_blackhole_500ms_latency	memory_usage	10/10	0.18GiB ≤ 1.20GiB
✅	file_to_blackhole_500ms_latency	missed_bytes	10/10	0B = 0B
✅	quality_gate_idle	intake_connections	10/10	3 ≤ 4	bounds checks dashboard
✅	quality_gate_idle	memory_usage	10/10	138.60MiB ≤ 147MiB	bounds checks dashboard
✅	quality_gate_idle_all_features	intake_connections	10/10	3 ≤ 4	bounds checks dashboard
✅	quality_gate_idle_all_features	memory_usage	10/10	463.35MiB ≤ 495MiB	bounds checks dashboard
✅	quality_gate_logs	intake_connections	10/10	4 ≤ 6	bounds checks dashboard
✅	quality_gate_logs	memory_usage	10/10	171.83MiB ≤ 195MiB	bounds checks dashboard
✅	quality_gate_logs	missed_bytes	10/10	0B = 0B	bounds checks dashboard
✅	quality_gate_metrics_logs	cpu_usage	10/10	358.14 ≤ 2000	bounds checks dashboard
✅	quality_gate_metrics_logs	intake_connections	10/10	3 ≤ 6	bounds checks dashboard
✅	quality_gate_metrics_logs	memory_usage	10/10	406.43MiB ≤ 430MiB	bounds checks dashboard
✅	quality_gate_metrics_logs	missed_bytes	10/10	0B = 0B	bounds checks dashboard

Explanation

Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%

Performance changes are noted in the perf column of each table:

✅ = significantly better comparison variant performance
❌ = significantly worse comparison variant performance
➖ = no significant change in performance

A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".

For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:

Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
Its configuration does not mark it "erratic".

CI Pass/Fail Decision

✅ Passed. All Quality Gates passed.

quality_gate_metrics_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
quality_gate_metrics_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
quality_gate_metrics_logs, bounds check cpu_usage: 10/10 replicas passed. Gate passed.
quality_gate_metrics_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.
quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.
quality_gate_idle_all_features, bounds check intake_connections: 10/10 replicas passed. Gate passed.
quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
quality_gate_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
quality_gate_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.
quality_gate_idle, bounds check intake_connections: 10/10 replicas passed. Gate passed.
quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.

L3n41c · 2026-04-25T17:41:56Z

Hi @lavigne958, I dug into CONTINT-4838 and I think bumping the timeout from 10s to 20s won't actually fix the flake. Sharing the analysis in case it's useful.

Telemetry on main (last 30 days)

7562 pass, 35 fail (~0.46%)
avg/p95/p99/max(pass) = 581/636/649/840 ms
avg(fail) = 10 086 ms (= the 10s timeout exactly)

The distribution is bimodal: the test either passes well under a second, or sits at exactly 10s. There is no "needed 12s but only got 10s" regime, so widening the wall clock just makes each CI failure cost 20s instead of 10s while the flake rate stays the same.

Root cause: race in the client-go fake client between the test's Create() and the reflector's WatchWithContext registration. If Create wins, tracker.add broadcasts the ADDED event to zero watchers and silently drops it. The lister has the object (it lives in tracker.objects), but no AddFunc ever fires — so hpaQueue stays empty, the worker never writes to hctrl.autoscalers, and the select waits the full timeout. All recent failures across Linux x64, Windows x64, macOS arm64 hit hpa_controller_test.go:346 (the first select), which matches.

The sister TestWPAController doesn't flake because it pre-loads the WPA into the fake client before inf.Start, so the reflector's initial List() surfaces it via a delta and AddFunc fires from there — independent of Watch registration timing.

Structural fix: #49900 mirrors that pattern for the HPA test. Verified with 500/500 passes under -race (~633ms each, where ~2-3 failures would be expected pre-fix). Also fixes a latent bug where a single time.NewTimer(10s) was reused across three select blocks — a time.Timer doesn't re-arm, so the second/third selects could pre-expire — and removes a autoscalersListerSynced = func() bool { return true } override that bypassed the cache-sync barrier.

If #49900 looks good, I'd suggest closing this one in favor of it. Happy to discuss if you see something I missed.

L3n41c

The timeout is already 12 times the maximum amount of time it takes for the test to pass when it is successful.

### What does this PR do? Fix the sporadically failing `pkg/util/kubernetes/apiserver/controllers.TestAutoscalerController` test on `main` by removing the race condition between `Create()` and the fake client's reflector setup. Pre-loads the first `mockedHPA` into the fake client tracker (mirroring the pattern already used by the WPA sister test), so the informer's initial `List()` surfaces it via a delta and `AddFunc` fires from there — independent of whether the Watch is registered yet. Also fixes a latent bug where a single `time.NewTimer` was reused across three `select` blocks, and removes a `autoscalersListerSynced = true` override that bypassed the cache-sync barrier. ### Motivation Tracked by [CONTINT-4838](https://datadoghq.atlassian.net/browse/CONTINT-4838). Datadog test telemetry on `main` over the last 30 days: | Metric | Value | |---|---| | Pass | 7562 | | Fail | 35 | | Failure rate | ~0.46% | | avg(duration \| pass) | 581 ms | | p95(duration \| pass) | 636 ms | | p99(duration \| pass) | 649 ms | | max(duration \| pass) | 840 ms | | avg(duration \| fail) | 10 086 ms (= timeout) | The distribution is **bimodal** — the test either passes in well under a second or sits at exactly 10 s and times out. There is no \"slow but eventually succeeds\" regime. All recent failures (Linux x64, Windows x64, macOS arm64) hit the same line — `hpa_controller_test.go:346`, the first `select` on `hctrl.autoscalers`, `<-timeout.C` branch. Root cause: the test's `c.HorizontalPodAutoscalers(\"nsfoo\").Create(...)` runs in parallel with the reflector's `ListAndWatch` setup. If the test goroutine wins the race — i.e. `Create()` runs before the reflector calls `WatchWithContext` and registers a watcher on the fake client tracker — then the fake client's `tracker.add` broadcasts the ADDED event to **zero** watchers and silently drops it. The lister has the object (it lives in the tracker's `objects` map), but no `AddFunc` ever fires, so `hpaQueue` stays empty, the worker never writes to `hctrl.autoscalers`, and the `select` waits the full 10 s before failing. ### Why bumping the timeout (PR #49881) does not help PR #49881 proposes raising the timeout from 10 s to 20 s. The bimodal distribution (max-pass 840 ms, fail 10 086 ms) makes that ineffective: there is no scenario where the test needs 12-15 s and is being denied; the failure mode is \"event lost forever.\" A higher timeout simply makes each CI failure cost 20 s instead of 10 s, while the flake rate stays the same. Closing #49881 in favor of this PR is recommended. ### Describe how you validated your changes - All 37 tests in the package still pass: `dda inv test --targets=./pkg/util/kubernetes/apiserver/controllers`. - 500/500 iterations of `TestAutoscalerController` pass with `-race`, ~633 ms each: `dda inv test --targets=./pkg/util/kubernetes/apiserver/controllers -e '^TestAutoscalerController\$' -r -x '-count=500 -timeout=15m'`. Given the pre-fix ~0.46% failure rate, ~2-3 failures would be expected across 500 runs; observed: 0. - The fix mirrors the WPA sister test pattern (`wpa_controller_test.go:263`), which is already structurally race-free. ### Additional Notes The latent timer bug (single `time.NewTimer(10s)` reused across three `select` blocks) is fixed at the same time: a `time.Timer` does not re-arm after firing, so the second/third selects could see an already-expired timer and fail in unrelated phases. Each select now uses an independent `time.After(timeoutDuration)` (5 s, matching the surrounding `EventuallyWithTf` calls). The `autoscalersListerSynced = func() bool { return true }` override in `newFakeAutoscalerController` is removed because, with the HPA pre-loaded, the real `informer.HasSynced` flips to `true` shortly after `inf.Start` and the test now exercises the production cache-sync path in `runHPA`. 🤖 Generated with [Claude Code](https://claude.com/claude-code) [CONTINT-4838]: https://datadoghq.atlassian.net/browse/CONTINT-4838?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ Co-authored-by: lenaic.huard <lenaic.huard@datadoghq.com>

lavigne958 self-assigned this Apr 24, 2026

lavigne958 requested review from a team as code owners April 24, 2026 13:46

lavigne958 added the changelog/no-changelog No changelog entry needed label Apr 24, 2026

lavigne958 requested a review from a team as a code owner April 24, 2026 13:46

lavigne958 added the qa/no-code-change No code change in Agent code requiring validation label Apr 24, 2026

dd-octo-sts Bot added the internal Identify a non-fork PR label Apr 24, 2026

lavigne958 added ask-review Ask in slack required teams to review this PR and removed internal Identify a non-fork PR labels Apr 24, 2026

github-actions Bot added the short review PR is simple enough to be reviewed quickly label Apr 24, 2026

dd-octo-sts Bot added team/container-platform The Container Platform Team team/kubernetes-experiences labels Apr 24, 2026

triviajon approved these changes Apr 24, 2026

View reviewed changes

L3n41c mentioned this pull request Apr 25, 2026

[CONTINT-4838] Fix race in TestAutoscalerController #49900

Merged

L3n41c requested changes Apr 25, 2026

View reviewed changes

lavigne958 closed this Apr 27, 2026

lavigne958 deleted the crayon/increase_test_timeout_autoscaler branch April 29, 2026 12:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CONTINT-4838 - increase test timeout for autoscalerController#49881

CONTINT-4838 - increase test timeout for autoscalerController#49881
lavigne958 wants to merge 1 commit into
mainfrom
crayon/increase_test_timeout_autoscaler

lavigne958 commented Apr 24, 2026

Uh oh!

datadog-official Bot commented Apr 24, 2026 •

edited by datadog-datadog-prod-us1-2 Bot

Loading

Uh oh!

dd-octo-sts Bot commented Apr 24, 2026

Uh oh!

cit-pr-commenter-54b7da Bot commented Apr 24, 2026

Experiments ignored for regressions

Fine details of change detection per experiment

Bounds Checks: ✅ Passed

Explanation

Uh oh!

L3n41c commented Apr 25, 2026

Uh oh!

L3n41c left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

lavigne958 commented Apr 24, 2026

What does this PR do?

Motivation

Describe how you validated your changes

Additional Notes

Uh oh!

datadog-official Bot commented Apr 24, 2026 • edited by datadog-datadog-prod-us1-2 Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dd-octo-sts Bot commented Apr 24, 2026

Files inventory check summary

Results for datadog-agent_7.80.0~devel.git.204.e861729.pipeline.109552594-1_amd64.deb:

Uh oh!

cit-pr-commenter-54b7da Bot commented Apr 24, 2026

Regression Detector

Regression Detector Results

Optimization Goals: ✅ No significant changes detected

Experiments ignored for regressions

Fine details of change detection per experiment

Bounds Checks: ✅ Passed

Explanation

CI Pass/Fail Decision

Uh oh!

L3n41c commented Apr 25, 2026

Uh oh!

L3n41c left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

datadog-official Bot commented Apr 24, 2026 •

edited by datadog-datadog-prod-us1-2 Bot

Loading