Commit 43dc2e6
authored
[CONTINT-4838] Fix race in TestAutoscalerController (#49900)
### What does this PR do?
Fix the sporadically failing `pkg/util/kubernetes/apiserver/controllers.TestAutoscalerController` test on `main` by removing the race condition between `Create()` and the fake client's reflector setup.
Pre-loads the first `mockedHPA` into the fake client tracker (mirroring the pattern already used by the WPA sister test), so the informer's initial `List()` surfaces it via a delta and `AddFunc` fires from there — independent of whether the Watch is registered yet. Also fixes a latent bug where a single `time.NewTimer` was reused across three `select` blocks, and removes a `autoscalersListerSynced = true` override that bypassed the cache-sync barrier.
### Motivation
Tracked by [CONTINT-4838](https://datadoghq.atlassian.net/browse/CONTINT-4838).
Datadog test telemetry on `main` over the last 30 days:
| Metric | Value |
|---|---|
| Pass | 7562 |
| Fail | 35 |
| Failure rate | ~0.46% |
| avg(duration \| pass) | 581 ms |
| p95(duration \| pass) | 636 ms |
| p99(duration \| pass) | 649 ms |
| max(duration \| pass) | 840 ms |
| avg(duration \| fail) | 10 086 ms (= timeout) |
The distribution is **bimodal** — the test either passes in well under a second or sits at exactly 10 s and times out. There is no \"slow but eventually succeeds\" regime. All recent failures (Linux x64, Windows x64, macOS arm64) hit the same line — `hpa_controller_test.go:346`, the first `select` on `hctrl.autoscalers`, `<-timeout.C` branch.
Root cause: the test's `c.HorizontalPodAutoscalers(\"nsfoo\").Create(...)` runs in parallel with the reflector's `ListAndWatch` setup. If the test goroutine wins the race — i.e. `Create()` runs before the reflector calls `WatchWithContext` and registers a watcher on the fake client tracker — then the fake client's `tracker.add` broadcasts the ADDED event to **zero** watchers and silently drops it. The lister has the object (it lives in the tracker's `objects` map), but no `AddFunc` ever fires, so `hpaQueue` stays empty, the worker never writes to `hctrl.autoscalers`, and the `select` waits the full 10 s before failing.
### Why bumping the timeout (PR #49881) does not help
PR #49881 proposes raising the timeout from 10 s to 20 s. The bimodal distribution (max-pass 840 ms, fail 10 086 ms) makes that ineffective: there is no scenario where the test needs 12-15 s and is being denied; the failure mode is \"event lost forever.\" A higher timeout simply makes each CI failure cost 20 s instead of 10 s, while the flake rate stays the same. Closing #49881 in favor of this PR is recommended.
### Describe how you validated your changes
- All 37 tests in the package still pass: `dda inv test --targets=./pkg/util/kubernetes/apiserver/controllers`.
- 500/500 iterations of `TestAutoscalerController` pass with `-race`, ~633 ms each: `dda inv test --targets=./pkg/util/kubernetes/apiserver/controllers -e '^TestAutoscalerController\$' -r -x '-count=500 -timeout=15m'`. Given the pre-fix ~0.46% failure rate, ~2-3 failures would be expected across 500 runs; observed: 0.
- The fix mirrors the WPA sister test pattern (`wpa_controller_test.go:263`), which is already structurally race-free.
### Additional Notes
The latent timer bug (single `time.NewTimer(10s)` reused across three `select` blocks) is fixed at the same time: a `time.Timer` does not re-arm after firing, so the second/third selects could see an already-expired timer and fail in unrelated phases. Each select now uses an independent `time.After(timeoutDuration)` (5 s, matching the surrounding `EventuallyWithTf` calls).
The `autoscalersListerSynced = func() bool { return true }` override in `newFakeAutoscalerController` is removed because, with the HPA pre-loaded, the real `informer.HasSynced` flips to `true` shortly after `inf.Start` and the test now exercises the production cache-sync path in `runHPA`.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
[CONTINT-4838]: https://datadoghq.atlassian.net/browse/CONTINT-4838?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
Co-authored-by: lenaic.huard <lenaic.huard@datadoghq.com>1 parent f7bbaf9 commit 43dc2e6
1 file changed
Lines changed: 22 additions & 24 deletions
Lines changed: 22 additions & 24 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
20 | 20 | | |
21 | 21 | | |
22 | 22 | | |
| 23 | + | |
23 | 24 | | |
24 | 25 | | |
25 | 26 | | |
| |||
41 | 42 | | |
42 | 43 | | |
43 | 44 | | |
44 | | - | |
45 | | - | |
| 45 | + | |
| 46 | + | |
46 | 47 | | |
47 | 48 | | |
48 | 49 | | |
| |||
58 | 59 | | |
59 | 60 | | |
60 | 61 | | |
61 | | - | |
62 | | - | |
| 62 | + | |
| 63 | + | |
63 | 64 | | |
64 | 65 | | |
65 | 66 | | |
| |||
104 | 105 | | |
105 | 106 | | |
106 | 107 | | |
107 | | - | |
108 | | - | |
109 | 108 | | |
110 | 109 | | |
111 | 110 | | |
| |||
287 | 286 | | |
288 | 287 | | |
289 | 288 | | |
290 | | - | |
| 289 | + | |
| 290 | + | |
| 291 | + | |
| 292 | + | |
| 293 | + | |
| 294 | + | |
| 295 | + | |
| 296 | + | |
| 297 | + | |
| 298 | + | |
| 299 | + | |
| 300 | + | |
| 301 | + | |
| 302 | + | |
291 | 303 | | |
292 | 304 | | |
293 | 305 | | |
| |||
322 | 334 | | |
323 | 335 | | |
324 | 336 | | |
325 | | - | |
326 | | - | |
327 | | - | |
328 | | - | |
329 | | - | |
330 | | - | |
331 | | - | |
332 | | - | |
333 | | - | |
334 | | - | |
335 | | - | |
336 | | - | |
337 | 337 | | |
338 | 338 | | |
339 | | - | |
340 | | - | |
341 | 339 | | |
342 | 340 | | |
343 | 341 | | |
344 | 342 | | |
345 | | - | |
| 343 | + | |
346 | 344 | | |
347 | 345 | | |
348 | 346 | | |
| |||
403 | 401 | | |
404 | 402 | | |
405 | 403 | | |
406 | | - | |
| 404 | + | |
407 | 405 | | |
408 | 406 | | |
409 | 407 | | |
| |||
453 | 451 | | |
454 | 452 | | |
455 | 453 | | |
456 | | - | |
| 454 | + | |
457 | 455 | | |
458 | 456 | | |
459 | 457 | | |
| |||
0 commit comments