Skip to content

Commit 43dc2e6

Browse files
authored
[CONTINT-4838] Fix race in TestAutoscalerController (#49900)
### What does this PR do? Fix the sporadically failing `pkg/util/kubernetes/apiserver/controllers.TestAutoscalerController` test on `main` by removing the race condition between `Create()` and the fake client's reflector setup. Pre-loads the first `mockedHPA` into the fake client tracker (mirroring the pattern already used by the WPA sister test), so the informer's initial `List()` surfaces it via a delta and `AddFunc` fires from there — independent of whether the Watch is registered yet. Also fixes a latent bug where a single `time.NewTimer` was reused across three `select` blocks, and removes a `autoscalersListerSynced = true` override that bypassed the cache-sync barrier. ### Motivation Tracked by [CONTINT-4838](https://datadoghq.atlassian.net/browse/CONTINT-4838). Datadog test telemetry on `main` over the last 30 days: | Metric | Value | |---|---| | Pass | 7562 | | Fail | 35 | | Failure rate | ~0.46% | | avg(duration \| pass) | 581 ms | | p95(duration \| pass) | 636 ms | | p99(duration \| pass) | 649 ms | | max(duration \| pass) | 840 ms | | avg(duration \| fail) | 10 086 ms (= timeout) | The distribution is **bimodal** — the test either passes in well under a second or sits at exactly 10 s and times out. There is no \"slow but eventually succeeds\" regime. All recent failures (Linux x64, Windows x64, macOS arm64) hit the same line — `hpa_controller_test.go:346`, the first `select` on `hctrl.autoscalers`, `<-timeout.C` branch. Root cause: the test's `c.HorizontalPodAutoscalers(\"nsfoo\").Create(...)` runs in parallel with the reflector's `ListAndWatch` setup. If the test goroutine wins the race — i.e. `Create()` runs before the reflector calls `WatchWithContext` and registers a watcher on the fake client tracker — then the fake client's `tracker.add` broadcasts the ADDED event to **zero** watchers and silently drops it. The lister has the object (it lives in the tracker's `objects` map), but no `AddFunc` ever fires, so `hpaQueue` stays empty, the worker never writes to `hctrl.autoscalers`, and the `select` waits the full 10 s before failing. ### Why bumping the timeout (PR #49881) does not help PR #49881 proposes raising the timeout from 10 s to 20 s. The bimodal distribution (max-pass 840 ms, fail 10 086 ms) makes that ineffective: there is no scenario where the test needs 12-15 s and is being denied; the failure mode is \"event lost forever.\" A higher timeout simply makes each CI failure cost 20 s instead of 10 s, while the flake rate stays the same. Closing #49881 in favor of this PR is recommended. ### Describe how you validated your changes - All 37 tests in the package still pass: `dda inv test --targets=./pkg/util/kubernetes/apiserver/controllers`. - 500/500 iterations of `TestAutoscalerController` pass with `-race`, ~633 ms each: `dda inv test --targets=./pkg/util/kubernetes/apiserver/controllers -e '^TestAutoscalerController\$' -r -x '-count=500 -timeout=15m'`. Given the pre-fix ~0.46% failure rate, ~2-3 failures would be expected across 500 runs; observed: 0. - The fix mirrors the WPA sister test pattern (`wpa_controller_test.go:263`), which is already structurally race-free. ### Additional Notes The latent timer bug (single `time.NewTimer(10s)` reused across three `select` blocks) is fixed at the same time: a `time.Timer` does not re-arm after firing, so the second/third selects could see an already-expired timer and fail in unrelated phases. Each select now uses an independent `time.After(timeoutDuration)` (5 s, matching the surrounding `EventuallyWithTf` calls). The `autoscalersListerSynced = func() bool { return true }` override in `newFakeAutoscalerController` is removed because, with the HPA pre-loaded, the real `informer.HasSynced` flips to `true` shortly after `inf.Start` and the test now exercises the production cache-sync path in `runHPA`. 🤖 Generated with [Claude Code](https://claude.com/claude-code) [CONTINT-4838]: https://datadoghq.atlassian.net/browse/CONTINT-4838?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ Co-authored-by: lenaic.huard <lenaic.huard@datadoghq.com>
1 parent f7bbaf9 commit 43dc2e6

1 file changed

Lines changed: 22 additions & 24 deletions

File tree

pkg/util/kubernetes/apiserver/controllers/hpa_controller_test.go

Lines changed: 22 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@ import (
2020
autoscalingv2 "k8s.io/api/autoscaling/v2beta1"
2121
corev1 "k8s.io/api/core/v1"
2222
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
23+
"k8s.io/apimachinery/pkg/runtime"
2324
"k8s.io/apimachinery/pkg/types"
2425
"k8s.io/client-go/informers"
2526
"k8s.io/client-go/kubernetes"
@@ -41,8 +42,8 @@ const (
4142
hpaResource = "horizontalpodautoscalers"
4243
)
4344

44-
func newClient() kubernetes.Interface {
45-
client := fake.NewSimpleClientset()
45+
func newClient(objects ...runtime.Object) kubernetes.Interface {
46+
client := fake.NewSimpleClientset(objects...)
4647
client.Resources = []*metav1.APIResourceList{
4748
{
4849
GroupVersion: autoscalingGroup + "/v2beta1",
@@ -58,8 +59,8 @@ func newClient() kubernetes.Interface {
5859
return client
5960
}
6061

61-
func newFakeConfigMapStore(t *testing.T, ns, name string, metrics map[string]custommetrics.ExternalMetricValue) (custommetrics.Store, kubernetes.Interface) {
62-
client := newClient()
62+
func newFakeConfigMapStore(t *testing.T, ns, name string, metrics map[string]custommetrics.ExternalMetricValue, objects ...runtime.Object) (custommetrics.Store, kubernetes.Interface) {
63+
client := newClient(objects...)
6364
store, err := custommetrics.NewConfigMapStore(client, ns, name)
6465
require.NoError(t, err)
6566
err = store.SetExternalMetricValues(metrics)
@@ -104,8 +105,6 @@ func newFakeAutoscalerController(t *testing.T, client kubernetes.Interface, isLe
104105
)
105106
autoscalerController.enableHPA(client, informerFactory)
106107

107-
autoscalerController.autoscalersListerSynced = func() bool { return true }
108-
109108
return autoscalerController, informerFactory
110109
}
111110

@@ -287,7 +286,20 @@ func TestAutoscalerController(t *testing.T) {
287286

288287
penTime := (int(time.Now().Unix()) - int(maxAge.Seconds()/2)) * 1000
289288
name := custommetrics.GetConfigmapName()
290-
store, client := newFakeConfigMapStore(t, "nsfoo", name, nil)
289+
290+
// Pre-load via the fake client tracker so the reflector's initial List
291+
// surfaces the HPA. Creating after Start races with Watch registration
292+
// and can drop the ADDED event.
293+
mockedHPA := newFakeHorizontalPodAutoscaler(
294+
"hpa_1",
295+
"nsfoo",
296+
"1",
297+
"foo",
298+
map[string]string{"foo": "bar"},
299+
)
300+
mockedHPA.Annotations = makeAnnotations("foo", map[string]string{"foo": "bar"})
301+
302+
store, client := newFakeConfigMapStore(t, "nsfoo", name, nil, mockedHPA)
291303
metricName := "foo"
292304
ddSeries := []datadog.Series{
293305
{
@@ -322,27 +334,13 @@ func TestAutoscalerController(t *testing.T) {
322334
c := client.AutoscalingV2beta1()
323335
require.NotNil(t, c)
324336

325-
mockedHPA := newFakeHorizontalPodAutoscaler(
326-
"hpa_1",
327-
"nsfoo",
328-
"1",
329-
"foo",
330-
map[string]string{"foo": "bar"},
331-
)
332-
mockedHPA.Annotations = makeAnnotations("foo", map[string]string{"foo": "bar"})
333-
334-
_, err := c.HorizontalPodAutoscalers("nsfoo").Create(context.TODO(), mockedHPA, metav1.CreateOptions{})
335-
require.NoError(t, err)
336-
337337
timeoutDuration := 5 * time.Second
338338
retryPeriod := 500 * time.Millisecond
339-
timeout := time.NewTimer(10 * time.Second)
340-
defer timeout.Stop()
341339

342340
select {
343341
case key := <-hctrl.autoscalers:
344342
t.Logf("hctrl process key:%s", key)
345-
case <-timeout.C:
343+
case <-time.After(timeoutDuration):
346344
require.FailNow(t, "Timeout waiting for HPAs to update")
347345
}
348346

@@ -403,7 +401,7 @@ func TestAutoscalerController(t *testing.T) {
403401
select {
404402
case key := <-hctrl.autoscalers:
405403
t.Logf("hctrl process key:%s", key)
406-
case <-timeout.C:
404+
case <-time.After(timeoutDuration):
407405
require.FailNow(t, "Timeout waiting for HPAs to update")
408406
}
409407
storedHPA, err = hctrl.autoscalersLister.ByNamespace(mockedHPA.Namespace).Get(mockedHPA.Name)
@@ -453,7 +451,7 @@ func TestAutoscalerController(t *testing.T) {
453451
select {
454452
case key := <-hctrl.autoscalers:
455453
t.Logf("hctrl process key:%s", key)
456-
case <-timeout.C:
454+
case <-time.After(timeoutDuration):
457455
require.FailNow(t, "Timeout waiting for HPAs to update")
458456
}
459457

0 commit comments

Comments
 (0)