Skip to content

Fix intermittently failing leader-related tests#862

Merged
brandur merged 1 commit into
masterfrom
brandur-fix-intermittent-leader-tests
Apr 28, 2025
Merged

Fix intermittently failing leader-related tests#862
brandur merged 1 commit into
masterfrom
brandur-fix-intermittent-leader-tests

Conversation

@brandur
Copy link
Copy Markdown
Contributor

@brandur brandur commented Apr 26, 2025

There's a bunch of clock-based testing happening in the leader tests for
drivers, resulting in intermittent failures like this one [1]:

--- FAIL: TestDriverRiverPgxV5 (23.18s)
    driver_test.go:73: Reusing idle schema "river_2025_04_26t17_52_50_schema_07" after cleaning in 15.347655ms [7 generated] [6 reused]
    --- FAIL: TestDriverRiverPgxV5/LeaderAttemptElect (0.00s)
        --- FAIL: TestDriverRiverPgxV5/LeaderAttemptElect/ElectsLeader (0.22s)
            riverdrivertest.go:2244: TestTx using schema: river_2025_04_26t17_52_50_schema_06
            riverdrivertest.go:2258:
                        Error Trace:	/home/runner/work/river/river/internal/riverinternaltest/riverdrivertest/riverdrivertest.go:2258
                        Error:      	Max difference between 2025-04-26 17:53:14.813635844 +0000 UTC m=+24.487676847 and 2025-04-26 17:53:14.59636 +0000 UTC allowed is 100ms, but difference was 217.275844ms
                        Test:       	TestDriverRiverPgxV5/LeaderAttemptElect/ElectsLeader
FAIL
FAIL	github.com/riverqueue/river	31.602s

As with all these other clock-based failures, the problem is that in
slower environments like GitHub Actions, goroutines may be paused for
extended periods, so by the time they wake up time.Now() might be
significantly different, and easily off by a few hundred milliseconds.

Here, make the clock for these functions injectable so we can inject our
own time and get to better test reliability. I'm not necessarily against
other approaches either, but I'm still getting reasonable frequent
failures on these, so I'm definitely in favor of getting some sort of
fix into place.

A side benefit is that it makes clock injection from the client-level a
little more thorough as well.

[1] https://github.com/riverqueue/river/actions/runs/14683687209/job/41209541920

There's a bunch of clock-based testing happening in the leader tests for
drivers, resulting in intermittent failures like this one [1]:

    --- FAIL: TestDriverRiverPgxV5 (23.18s)
        driver_test.go:73: Reusing idle schema "river_2025_04_26t17_52_50_schema_07" after cleaning in 15.347655ms [7 generated] [6 reused]
        --- FAIL: TestDriverRiverPgxV5/LeaderAttemptElect (0.00s)
            --- FAIL: TestDriverRiverPgxV5/LeaderAttemptElect/ElectsLeader (0.22s)
                riverdrivertest.go:2244: TestTx using schema: river_2025_04_26t17_52_50_schema_06
                riverdrivertest.go:2258:
                            Error Trace:	/home/runner/work/river/river/internal/riverinternaltest/riverdrivertest/riverdrivertest.go:2258
                            Error:      	Max difference between 2025-04-26 17:53:14.813635844 +0000 UTC m=+24.487676847 and 2025-04-26 17:53:14.59636 +0000 UTC allowed is 100ms, but difference was 217.275844ms
                            Test:       	TestDriverRiverPgxV5/LeaderAttemptElect/ElectsLeader
    FAIL
    FAIL	github.com/riverqueue/river	31.602s

As with all these other clock-based failures, the problem is that in
slower environments like GitHub Actions, goroutines may be paused for
extended periods, so by the time they wake up `time.Now()` might be
significantly different, and easily off by a few hundred milliseconds.

Here, make the clock for these functions injectable so we can inject our
own time and get to better test reliability. I'm not necessarily against
other approaches either, but I'm still getting reasonable frequent
failures on these, so I'm definitely in favor of getting some sort of
fix into place.

A side benefit is that it makes clock injection from the client-level a
little more thorough as well.

[1] https://github.com/riverqueue/river/actions/runs/14683687209/job/41209541920

const leaderTTL = 10 * time.Second

t.Run("LeaderDeleteExpired", func(t *testing.T) {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just moved this set of tests down so it's in alphabetical order and easier to find.

@brandur brandur requested a review from bgentry April 26, 2025 21:42
Copy link
Copy Markdown
Contributor

@bgentry bgentry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good find and fix, thanks 🚀

@brandur
Copy link
Copy Markdown
Contributor Author

brandur commented Apr 28, 2025

ty.

@brandur brandur merged commit f319fd6 into master Apr 28, 2025
19 of 20 checks passed
@brandur brandur deleted the brandur-fix-intermittent-leader-tests branch April 28, 2025 05:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants