Skip to content

Go: Fix flaky TestScriptKillWithRoute race condition#5950

Merged
xShinnRyuu merged 2 commits into
mainfrom
fix/go-flaky-test-script-kill-with-route-5576
May 21, 2026
Merged

Go: Fix flaky TestScriptKillWithRoute race condition#5950
xShinnRyuu merged 2 commits into
mainfrom
fix/go-flaky-test-script-kill-with-route-5576

Conversation

@xShinnRyuu
Copy link
Copy Markdown
Collaborator

@xShinnRyuu xShinnRyuu commented May 13, 2026

Summary

Fix flaky TestGlideTestSuite/TestScriptKillWithRoute test that intermittently fails with "NotBusy: No scripts in execution right now" errors and timeouts.

Issue link

This Pull Request is linked to issue: [Go][Flaky Test] TestGlideTestSuite/TestScriptKillWithRoute
Closes #5576

Features / Behaviour Changes

No behaviour changes. This PR fixes test flakiness only.

Implementation

Root cause: The test launches a long-running Lua script in a goroutine, then immediately starts polling SCRIPT KILL. In slow CI environments, the script has not started executing on the server yet, so every kill attempt returns "NotBusy". The 5-second timeout expires before the script begins, causing the test to fail.

Fix:

  1. Increased script duration from 6s to 10s to ensure it is still running when the kill succeeds (prevents a race where the script finishes before we can kill it).
  2. Increased the kill polling timeout from 5s to 8s to accommodate slow script startup in CI environments.
  3. Replaced the fixed time.Sleep(1 * time.Second) at the end with a deterministic polling loop that waits for the server to confirm "notbusy" state, eliminating another potential timing issue.

Limitations

None

Testing

  • Verified the fix compiles and passes gofmt checks.
  • The polling pattern matches the approach used in testFunctionKillNoWrite which handles the same race condition reliably.
  • Tested locally 250 times without a flaky failure.

Checklist

  • This Pull Request is related to one issue.
  • Commit message has a detailed description of what changed and why.
  • Tests are added or updated.
  • CHANGELOG.md and documentation files are updated.
  • Linters have been run.
  • Destination branch is correct - main or release

@xShinnRyuu xShinnRyuu requested a review from a team as a code owner May 13, 2026 20:18
… sleep with polling

The test was flaky because:
1. The script invoked in a goroutine hadn't started executing on the server
   before SCRIPT KILL was attempted, causing 'NotBusy' errors.
2. The 5-second timeout was insufficient for slow CI environments.
3. The fixed time.Sleep(1s) at the end was unreliable for confirming
   the script was no longer running.

Changes:
- Increase script duration from 6s to 10s to ensure it's still running
  when the kill succeeds.
- Increase kill polling timeout from 5s to 8s to accommodate slow
  script startup in CI.
- Replace time.Sleep(1s) at the end with a polling loop that waits
  for the 'notbusy' state, making the test deterministic.

Signed-off-by: Thomas Zhou <thomas.zhou@improving.com>
Signed-off-by: Thomas Zhou <thomaszhou64@gmail.com>
Restructure the test to match the working testFunctionKillNoWrite pattern:
- Run InvokeScriptWithRoute in the main thread (blocking) to guarantee
  the script is executing on the server before kill is attempted
- Run ScriptKill polling in a goroutine
- Use a longer request timeout (12s) so the invoke call blocks until killed

This eliminates the race condition where ScriptKill was called before
the script started, causing 'NotBusy' errors.

Verified: 250/250 sequential runs with 0 failures.

Fixes #5576

Signed-off-by: Thomas Zhou <thomaszhou64@gmail.com>
@xShinnRyuu xShinnRyuu force-pushed the fix/go-flaky-test-script-kill-with-route-5576 branch from 836f675 to b3f6416 Compare May 13, 2026 22:55
Copy link
Copy Markdown
Collaborator

@jamesx-improving jamesx-improving left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@xShinnRyuu xShinnRyuu merged commit 79ccfa2 into main May 21, 2026
20 checks passed
@xShinnRyuu xShinnRyuu deleted the fix/go-flaky-test-script-kill-with-route-5576 branch May 21, 2026 16:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Go][Flaky Test] TestGlideTestSuite/TestScriptKillWithRoute

3 participants