Skip to content

fix: evict a crashed worker from the container so destroy/reset don't hang#48

Closed
rustatian wants to merge 2 commits into
masterfrom
fix/evict-crashed-worker-from-container
Closed

fix: evict a crashed worker from the container so destroy/reset don't hang#48
rustatian wants to merge 2 commits into
masterfrom
fix/evict-crashed-worker-from-container

Conversation

@rustatian

@rustatian rustatian commented Jun 4, 2026

Copy link
Copy Markdown
Member

When a worker crashes unexpectedly, worker_watcher.wait() removed it from the workers map but left it parked (in StateErrored) in the container channel, then allocated a replacement on top — so container.Len() stayed at numWorkers + 1. The stale entry was only evicted lazily by a later Take. With an idle pool (no Take between the crash and shutdown), Destroy/Reset — which spin until numWorkers == container.Len() — never settle and hang until the destroy/reset timeout fires, then force-kill.

Evict the dead worker via w.Callback() (container.Remove()) before allocating the replacement, mirroring the existing Reset/Destroy cleanup. After Wait() the worker is in StateErrored, so Remove() drops it.

Adds a regression test (TestWorkerWatcher_Wait_EvictsCrashedWorkerFromContainer): kill one worker, then assert the container returns to numWorkers (not numWorkers + 1).

Surfaced by the roadrunner-temporal plugin: an idle activity pool that loses a worker hung shutdown for the full destroy timeout (~60s).

Summary by CodeRabbit

  • Bug Fixes

    • Fixed handling of unexpectedly crashed worker processes so they’re evicted from the internal queue and automatically replaced, preventing cleanup/reset operations from hanging.
  • Tests

    • Added an integration-style test that verifies a crashed worker is evicted and a replacement is allocated, ensuring container counts return to the configured worker count after a crash.

…replacement

When a worker crashed unexpectedly, wait() removed it from the workers map but left it parked (in StateErrored) in the container channel and then allocated a replacement on top, leaving container.Len() at numWorkers+1. The stale entry was only evicted lazily by a later Take, so with an idle pool (no Take) Destroy/Reset — which wait for numWorkers == container.Len() — never settled and hung until the destroy/reset timeout fired.

Evict the dead worker via w.Callback() (container.Remove()) before allocating the replacement, mirroring the Reset/Destroy cleanup. Adds a regression test.
Copilot AI review requested due to automatic review settings June 4, 2026 20:04
@coderabbitai

coderabbitai Bot commented Jun 4, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3fb4b14e-7350-4d39-8038-b1f77f30ae3a

📥 Commits

Reviewing files that changed from the base of the PR and between cd51561 and 52c7aad.

📒 Files selected for processing (1)
  • worker_watcher/worker_watcher_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • worker_watcher/worker_watcher_test.go

📝 Walkthrough

Walkthrough

The watcher now calls a worker's Callback when the worker crashes (not manually destroyed), evicting the dead worker from the container so container.Len() can reach the configured worker count. A new test kills a worker process and asserts the container is replenished and consistent.

Changes

Crashed Worker Eviction

Layer / File(s) Summary
Callback eviction on worker crash
worker_watcher/worker_watcher.go
wait now invokes w.Callback() for workers that crashed unexpectedly (not manually destroyed) to evict them from the internal container so container length can return to the configured worker count.
Test crash handling and eviction
worker_watcher/worker_watcher_test.go
New test TestWorkerWatcher_Wait_EvictsCrashedWorkerFromContainer kills a watched worker process, waits for a replacement allocation, and asserts container.Len() equals the expected worker count after eviction and requeue.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

I watch the burrows, quiet and keen,
A sleeper falls—no crash to screen.
I call, I nudge, the empty spot mends,
A new pup hops in, and balance sends. 🐇✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Description check ❓ Inconclusive The PR description is detailed and complete, but the PR checklist boxes are not marked and the CHANGELOG.md requirement is not addressed. Check the PR checklist items as completed and verify that CHANGELOG.md has been updated with user-facing changes.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the primary fix: evicting crashed workers from the container to prevent destroy/reset hangs.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/evict-crashed-worker-from-container

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
worker_watcher/worker_watcher_test.go (1)

334-341: ⚡ Quick win

Replace fixed sleep with deterministic settle polling.

Line 340 uses a fixed time.Sleep(500 * time.Millisecond), which can make this regression test flaky on slow CI. Poll until ww.container.Len() == 2 (with timeout) instead of sleeping.

Proposed patch
 	select {
 	case <-allocated:
 	case <-time.After(10 * time.Second):
-		t.Error("watcher did not allocate a replacement worker")
+		t.Fatal("watcher did not allocate a replacement worker")
 	}
-	time.Sleep(500 * time.Millisecond)
+	deadline := time.After(2 * time.Second)
+	for ww.container.Len() != 2 {
+		select {
+		case <-deadline:
+			t.Fatalf("container did not settle to expected size, got=%d", ww.container.Len())
+		default:
+			time.Sleep(10 * time.Millisecond)
+		}
+	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@worker_watcher/worker_watcher_test.go` around lines 334 - 341, After the
allocation select, replace the fixed time.Sleep(500 * time.Millisecond) with a
deterministic polling loop that checks ww.container.Len() until it equals 2 (or
until a timeout elapses) to avoid flakiness; implement this by creating a
timeout (e.g., time.After with a few seconds) and a short ticker (e.g.,
50-100ms) and in a select loop return when ww.container.Len() == 2 or call
t.Error/fail when the timeout channel fires, removing the hardcoded sleep and
referencing the existing allocated signal and ww.container.Len() for the
condition.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@worker_watcher/worker_watcher_test.go`:
- Line 293: Update the test comments in worker_watcher_test.go to use US
spelling to satisfy misspell lint: change "signalling" to "signaling" and
"signalled" to "signaled" in the comment near the mkCmd usage inside the test
(search for mkCmd and the surrounding handshake explanation). Ensure both
occurrences (the one referenced around line 293 and the one around line 309) are
corrected and run golangci-lint to verify the misspell check passes.

---

Nitpick comments:
In `@worker_watcher/worker_watcher_test.go`:
- Around line 334-341: After the allocation select, replace the fixed
time.Sleep(500 * time.Millisecond) with a deterministic polling loop that checks
ww.container.Len() until it equals 2 (or until a timeout elapses) to avoid
flakiness; implement this by creating a timeout (e.g., time.After with a few
seconds) and a short ticker (e.g., 50-100ms) and in a select loop return when
ww.container.Len() == 2 or call t.Error/fail when the timeout channel fires,
removing the hardcoded sleep and referencing the existing allocated signal and
ww.container.Len() for the condition.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a6ce92f5-7e06-48de-b70b-ec02afd85291

📥 Commits

Reviewing files that changed from the base of the PR and between 96b38d2 and cd51561.

📒 Files selected for processing (2)
  • worker_watcher/worker_watcher.go
  • worker_watcher/worker_watcher_test.go

Comment thread worker_watcher/worker_watcher_test.go Outdated

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a shutdown hang in WorkerWatcher when a worker crashes while parked in the container: the crashed worker could be removed from the workers map but still remain in the container channel, so after a replacement is allocated container.Len() becomes numWorkers + 1, causing Destroy/Reset to spin until timeout.

Changes:

  • Evict a crashed worker from the container (via the worker’s callback) before allocating a replacement in wait().
  • Add a regression test that kills a worker process and asserts the container length returns to numWorkers after replacement.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
worker_watcher/worker_watcher.go Evicts crashed workers from the container before replacement allocation to prevent Destroy/Reset hangs.
worker_watcher/worker_watcher_test.go Adds a regression test covering the crash-and-replace path and asserting container length invariants.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

// parked in the container channel (now in StateErrored). Evict it so container.Len() stays
// consistent with numWorkers; otherwise Destroy/Reset, which wait for
// numWorkers == container.Len(), never settle and hang until their timeout fires.
w.Callback()
Comment on lines +301 to +303
if err = cmd.Start(); err != nil {
return nil, nil, err
}
Comment on lines +291 to +294
// mkCmd starts a real (sleep) worker process and returns the worker plus its command, so
// the caller can kill the OS process directly — worker.Pid() is 0 without the goridge
// handshake, so signalling it would target the wrong process. No require.* here: mkCmd is
// also called from the watcher goroutine (via the allocator).
Comment on lines +334 to +346
// Wait until the replacement has been allocated, then let the evict+push settle.
select {
case <-allocated:
case <-time.After(10 * time.Second):
t.Error("watcher did not allocate a replacement worker")
}
time.Sleep(500 * time.Millisecond)

// The container must hold exactly numWorkers — the dead worker evicted and the replacement
// pushed. Before the fix the dead worker lingered in the channel, leaving Len() == 3,
// which makes Destroy/Reset hang waiting for numWorkers == container.Len().
assert.Equal(t, 2, ww.container.Len(), "crashed worker was not evicted from the container")

rustatian added a commit to temporalio/roadrunner-temporal that referenced this pull request Jun 5, 2026
…wn fix

The test reliably exposed a pre-existing pool teardown hang (an idle pool that lost a worker hung Destroy for ~60s), fixed separately in roadrunner-server/pool#48. Re-add the (reliable, no-workflow) version once that fix is released and bumped here.
@rustatian rustatian closed this Jun 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants