CONTINT-4838 - increase test timeout for autoscalerController#49881
CONTINT-4838 - increase test timeout for autoscalerController#49881lavigne958 wants to merge 1 commit intomainfrom
Conversation
The test `pkg/util/kubernetes/apiserver/controllers.TestAutoscalerController` is flaky, on very few occasions it takes 2 tries to complete. 1. try timeous out on the 10s timeout 2. try succeeds in a few ms. Increase that timeout to a larger value if it fails now... it's real problem. Signed-off-by: Alexandre Lavigne <alexandre.lavigne@datadoghq.com>
|
🎯 Code Coverage (details) 🔗 Commit SHA: e861729 | Docs | Datadog PR Page | Give us feedback! |
Files inventory check summaryFile checks results against ancestor 8078bf48: Results for datadog-agent_7.80.0~devel.git.204.e861729.pipeline.109552594-1_amd64.deb:No change detected |
Regression DetectorRegression Detector ResultsMetrics dashboard Baseline: 8078bf4 Optimization Goals: ✅ No significant changes detected
|
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ➖ | docker_containers_cpu | % cpu utilization | +2.58 | [-0.58, +5.75] | 1 | Logs |
Fine details of change detection per experiment
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ➖ | docker_containers_cpu | % cpu utilization | +2.58 | [-0.58, +5.75] | 1 | Logs |
| ➖ | otlp_ingest_metrics | memory utilization | +0.47 | [+0.32, +0.62] | 1 | Logs |
| ➖ | quality_gate_logs | % cpu utilization | +0.42 | [-1.22, +2.06] | 1 | Logs bounds checks dashboard |
| ➖ | file_tree | memory utilization | +0.33 | [+0.29, +0.38] | 1 | Logs |
| ➖ | uds_dogstatsd_20mb_12k_contexts_20_senders | memory utilization | +0.19 | [+0.13, +0.24] | 1 | Logs |
| ➖ | ddot_metrics | memory utilization | +0.16 | [-0.04, +0.36] | 1 | Logs |
| ➖ | ddot_logs | memory utilization | +0.16 | [+0.09, +0.23] | 1 | Logs |
| ➖ | quality_gate_idle_all_features | memory utilization | +0.15 | [+0.11, +0.19] | 1 | Logs bounds checks dashboard |
| ➖ | ddot_metrics_sum_cumulative | memory utilization | +0.07 | [-0.09, +0.22] | 1 | Logs |
| ➖ | ddot_metrics_sum_cumulativetodelta_exporter | memory utilization | +0.06 | [-0.17, +0.30] | 1 | Logs |
| ➖ | otlp_ingest_logs | memory utilization | +0.04 | [-0.06, +0.15] | 1 | Logs |
| ➖ | uds_dogstatsd_to_api_v3 | ingress throughput | +0.02 | [-0.18, +0.22] | 1 | Logs |
| ➖ | file_to_blackhole_1000ms_latency | egress throughput | +0.01 | [-0.44, +0.45] | 1 | Logs |
| ➖ | tcp_dd_logs_filter_exclude | ingress throughput | +0.00 | [-0.11, +0.11] | 1 | Logs |
| ➖ | quality_gate_metrics_logs | memory utilization | +0.00 | [-0.26, +0.26] | 1 | Logs bounds checks dashboard |
| ➖ | uds_dogstatsd_to_api | ingress throughput | -0.01 | [-0.22, +0.20] | 1 | Logs |
| ➖ | file_to_blackhole_100ms_latency | egress throughput | -0.02 | [-0.12, +0.09] | 1 | Logs |
| ➖ | file_to_blackhole_0ms_latency | egress throughput | -0.05 | [-0.51, +0.42] | 1 | Logs |
| ➖ | file_to_blackhole_500ms_latency | egress throughput | -0.07 | [-0.46, +0.32] | 1 | Logs |
| ➖ | quality_gate_idle | memory utilization | -0.11 | [-0.16, -0.07] | 1 | Logs bounds checks dashboard |
| ➖ | ddot_metrics_sum_delta | memory utilization | -0.14 | [-0.33, +0.05] | 1 | Logs |
| ➖ | docker_containers_memory | memory utilization | -0.59 | [-0.69, -0.49] | 1 | Logs |
| ➖ | tcp_syslog_to_blackhole | ingress throughput | -0.88 | [-1.04, -0.71] | 1 | Logs |
Bounds Checks: ✅ Passed
| perf | experiment | bounds_check_name | replicates_passed | observed_value | links |
|---|---|---|---|---|---|
| ✅ | docker_containers_cpu | simple_check_run | 10/10 | 715 ≥ 26 | |
| ✅ | docker_containers_memory | memory_usage | 10/10 | 238.60MiB ≤ 370MiB | |
| ✅ | docker_containers_memory | simple_check_run | 10/10 | 681 ≥ 26 | |
| ✅ | file_to_blackhole_0ms_latency | memory_usage | 10/10 | 0.16GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_0ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | file_to_blackhole_1000ms_latency | memory_usage | 10/10 | 0.20GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_1000ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | file_to_blackhole_100ms_latency | memory_usage | 10/10 | 0.17GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_100ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | file_to_blackhole_500ms_latency | memory_usage | 10/10 | 0.18GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_500ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | quality_gate_idle | intake_connections | 10/10 | 3 ≤ 4 | bounds checks dashboard |
| ✅ | quality_gate_idle | memory_usage | 10/10 | 138.60MiB ≤ 147MiB | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | intake_connections | 10/10 | 3 ≤ 4 | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | memory_usage | 10/10 | 463.35MiB ≤ 495MiB | bounds checks dashboard |
| ✅ | quality_gate_logs | intake_connections | 10/10 | 4 ≤ 6 | bounds checks dashboard |
| ✅ | quality_gate_logs | memory_usage | 10/10 | 171.83MiB ≤ 195MiB | bounds checks dashboard |
| ✅ | quality_gate_logs | missed_bytes | 10/10 | 0B = 0B | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | cpu_usage | 10/10 | 358.14 ≤ 2000 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | intake_connections | 10/10 | 3 ≤ 6 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | memory_usage | 10/10 | 406.43MiB ≤ 430MiB | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | missed_bytes | 10/10 | 0B = 0B | bounds checks dashboard |
Explanation
Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%
Performance changes are noted in the perf column of each table:
- ✅ = significantly better comparison variant performance
- ❌ = significantly worse comparison variant performance
- ➖ = no significant change in performance
A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".
For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:
-
Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
-
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
-
Its configuration does not mark it "erratic".
CI Pass/Fail Decision
✅ Passed. All Quality Gates passed.
- quality_gate_metrics_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check cpu_usage: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
|
Hi @lavigne958, I dug into CONTINT-4838 and I think bumping the timeout from 10s to 20s won't actually fix the flake. Sharing the analysis in case it's useful. Telemetry on
The distribution is bimodal: the test either passes well under a second, or sits at exactly 10s. There is no "needed 12s but only got 10s" regime, so widening the wall clock just makes each CI failure cost 20s instead of 10s while the flake rate stays the same. Root cause: race in the The sister Structural fix: #49900 mirrors that pattern for the HPA test. Verified with 500/500 passes under If #49900 looks good, I'd suggest closing this one in favor of it. Happy to discuss if you see something I missed. |
### What does this PR do? Fix the sporadically failing `pkg/util/kubernetes/apiserver/controllers.TestAutoscalerController` test on `main` by removing the race condition between `Create()` and the fake client's reflector setup. Pre-loads the first `mockedHPA` into the fake client tracker (mirroring the pattern already used by the WPA sister test), so the informer's initial `List()` surfaces it via a delta and `AddFunc` fires from there — independent of whether the Watch is registered yet. Also fixes a latent bug where a single `time.NewTimer` was reused across three `select` blocks, and removes a `autoscalersListerSynced = true` override that bypassed the cache-sync barrier. ### Motivation Tracked by [CONTINT-4838](https://datadoghq.atlassian.net/browse/CONTINT-4838). Datadog test telemetry on `main` over the last 30 days: | Metric | Value | |---|---| | Pass | 7562 | | Fail | 35 | | Failure rate | ~0.46% | | avg(duration \| pass) | 581 ms | | p95(duration \| pass) | 636 ms | | p99(duration \| pass) | 649 ms | | max(duration \| pass) | 840 ms | | avg(duration \| fail) | 10 086 ms (= timeout) | The distribution is **bimodal** — the test either passes in well under a second or sits at exactly 10 s and times out. There is no \"slow but eventually succeeds\" regime. All recent failures (Linux x64, Windows x64, macOS arm64) hit the same line — `hpa_controller_test.go:346`, the first `select` on `hctrl.autoscalers`, `<-timeout.C` branch. Root cause: the test's `c.HorizontalPodAutoscalers(\"nsfoo\").Create(...)` runs in parallel with the reflector's `ListAndWatch` setup. If the test goroutine wins the race — i.e. `Create()` runs before the reflector calls `WatchWithContext` and registers a watcher on the fake client tracker — then the fake client's `tracker.add` broadcasts the ADDED event to **zero** watchers and silently drops it. The lister has the object (it lives in the tracker's `objects` map), but no `AddFunc` ever fires, so `hpaQueue` stays empty, the worker never writes to `hctrl.autoscalers`, and the `select` waits the full 10 s before failing. ### Why bumping the timeout (PR #49881) does not help PR #49881 proposes raising the timeout from 10 s to 20 s. The bimodal distribution (max-pass 840 ms, fail 10 086 ms) makes that ineffective: there is no scenario where the test needs 12-15 s and is being denied; the failure mode is \"event lost forever.\" A higher timeout simply makes each CI failure cost 20 s instead of 10 s, while the flake rate stays the same. Closing #49881 in favor of this PR is recommended. ### Describe how you validated your changes - All 37 tests in the package still pass: `dda inv test --targets=./pkg/util/kubernetes/apiserver/controllers`. - 500/500 iterations of `TestAutoscalerController` pass with `-race`, ~633 ms each: `dda inv test --targets=./pkg/util/kubernetes/apiserver/controllers -e '^TestAutoscalerController\$' -r -x '-count=500 -timeout=15m'`. Given the pre-fix ~0.46% failure rate, ~2-3 failures would be expected across 500 runs; observed: 0. - The fix mirrors the WPA sister test pattern (`wpa_controller_test.go:263`), which is already structurally race-free. ### Additional Notes The latent timer bug (single `time.NewTimer(10s)` reused across three `select` blocks) is fixed at the same time: a `time.Timer` does not re-arm after firing, so the second/third selects could see an already-expired timer and fail in unrelated phases. Each select now uses an independent `time.After(timeoutDuration)` (5 s, matching the surrounding `EventuallyWithTf` calls). The `autoscalersListerSynced = func() bool { return true }` override in `newFakeAutoscalerController` is removed because, with the HPA pre-loaded, the real `informer.HasSynced` flips to `true` shortly after `inf.Start` and the test now exercises the production cache-sync path in `runHPA`. 🤖 Generated with [Claude Code](https://claude.com/claude-code) [CONTINT-4838]: https://datadoghq.atlassian.net/browse/CONTINT-4838?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ Co-authored-by: lenaic.huard <lenaic.huard@datadoghq.com>
The test
pkg/util/kubernetes/apiserver/controllers.TestAutoscalerControlleris flaky, on very few occasions it takes 2 tries to complete.Increase that timeout to a larger value if it fails now... it's real problem.
What does this PR do?
Motivation
Describe how you validated your changes
Additional Notes