fix: evict a crashed worker from the container so destroy/reset don't hang by rustatian · Pull Request #48 · roadrunner-server/pool

rustatian · 2026-06-04T20:04:03Z

When a worker crashes unexpectedly, worker_watcher.wait() removed it from the workers map but left it parked (in StateErrored) in the container channel, then allocated a replacement on top — so container.Len() stayed at numWorkers + 1. The stale entry was only evicted lazily by a later Take. With an idle pool (no Take between the crash and shutdown), Destroy/Reset — which spin until numWorkers == container.Len() — never settle and hang until the destroy/reset timeout fires, then force-kill.

Evict the dead worker via w.Callback() (container.Remove()) before allocating the replacement, mirroring the existing Reset/Destroy cleanup. After Wait() the worker is in StateErrored, so Remove() drops it.

Adds a regression test (TestWorkerWatcher_Wait_EvictsCrashedWorkerFromContainer): kill one worker, then assert the container returns to numWorkers (not numWorkers + 1).

Surfaced by the roadrunner-temporal plugin: an idle activity pool that loses a worker hung shutdown for the full destroy timeout (~60s).

Summary by CodeRabbit

Bug Fixes
- Fixed handling of unexpectedly crashed worker processes so they’re evicted from the internal queue and automatically replaced, preventing cleanup/reset operations from hanging.
Tests
- Added an integration-style test that verifies a crashed worker is evicted and a replacement is allocated, ensuring container counts return to the configured worker count after a crash.

…replacement When a worker crashed unexpectedly, wait() removed it from the workers map but left it parked (in StateErrored) in the container channel and then allocated a replacement on top, leaving container.Len() at numWorkers+1. The stale entry was only evicted lazily by a later Take, so with an idle pool (no Take) Destroy/Reset — which wait for numWorkers == container.Len() — never settled and hung until the destroy/reset timeout fired. Evict the dead worker via w.Callback() (container.Remove()) before allocating the replacement, mirroring the Reset/Destroy cleanup. Adds a regression test.

coderabbitai · 2026-06-04T20:04:15Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3fb4b14e-7350-4d39-8038-b1f77f30ae3a

📥 Commits

Reviewing files that changed from the base of the PR and between cd51561 and 52c7aad.

📒 Files selected for processing (1)

worker_watcher/worker_watcher_test.go

🚧 Files skipped from review as they are similar to previous changes (1)

worker_watcher/worker_watcher_test.go

📝 Walkthrough

Walkthrough

The watcher now calls a worker's Callback when the worker crashes (not manually destroyed), evicting the dead worker from the container so container.Len() can reach the configured worker count. A new test kills a worker process and asserts the container is replenished and consistent.

Changes

Crashed Worker Eviction

Layer / File(s)	Summary
Callback eviction on worker crash `worker_watcher/worker_watcher.go`	`wait` now invokes `w.Callback()` for workers that crashed unexpectedly (not manually destroyed) to evict them from the internal container so container length can return to the configured worker count.
Test crash handling and eviction `worker_watcher/worker_watcher_test.go`	New test `TestWorkerWatcher_Wait_EvictsCrashedWorkerFromContainer` kills a watched worker process, waits for a replacement allocation, and asserts `container.Len()` equals the expected worker count after eviction and requeue.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

I watch the burrows, quiet and keen,
A sleeper falls—no crash to screen.
I call, I nudge, the empty spot mends,
A new pup hops in, and balance sends. 🐇✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Description check	❓ Inconclusive	The PR description is detailed and complete, but the PR checklist boxes are not marked and the CHANGELOG.md requirement is not addressed.	Check the PR checklist items as completed and verify that CHANGELOG.md has been updated with user-facing changes.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely summarizes the primary fix: evicting crashed workers from the container to prevent destroy/reset hangs.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/evict-crashed-worker-from-container

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

worker_watcher/worker_watcher_test.go (1)

334-341: ⚡ Quick win

Replace fixed sleep with deterministic settle polling.

Line 340 uses a fixed time.Sleep(500 * time.Millisecond), which can make this regression test flaky on slow CI. Poll until ww.container.Len() == 2 (with timeout) instead of sleeping.

Proposed patch

 	select {
 	case <-allocated:
 	case <-time.After(10 * time.Second):
-		t.Error("watcher did not allocate a replacement worker")
+		t.Fatal("watcher did not allocate a replacement worker")
 	}
-	time.Sleep(500 * time.Millisecond)
+	deadline := time.After(2 * time.Second)
+	for ww.container.Len() != 2 {
+		select {
+		case <-deadline:
+			t.Fatalf("container did not settle to expected size, got=%d", ww.container.Len())
+		default:
+			time.Sleep(10 * time.Millisecond)
+		}
+	}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@worker_watcher/worker_watcher_test.go` around lines 334 - 341, After the
allocation select, replace the fixed time.Sleep(500 * time.Millisecond) with a
deterministic polling loop that checks ww.container.Len() until it equals 2 (or
until a timeout elapses) to avoid flakiness; implement this by creating a
timeout (e.g., time.After with a few seconds) and a short ticker (e.g.,
50-100ms) and in a select loop return when ww.container.Len() == 2 or call
t.Error/fail when the timeout channel fires, removing the hardcoded sleep and
referencing the existing allocated signal and ww.container.Len() for the
condition.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@worker_watcher/worker_watcher_test.go`:
- Line 293: Update the test comments in worker_watcher_test.go to use US
spelling to satisfy misspell lint: change "signalling" to "signaling" and
"signalled" to "signaled" in the comment near the mkCmd usage inside the test
(search for mkCmd and the surrounding handshake explanation). Ensure both
occurrences (the one referenced around line 293 and the one around line 309) are
corrected and run golangci-lint to verify the misspell check passes.

---

Nitpick comments:
In `@worker_watcher/worker_watcher_test.go`:
- Around line 334-341: After the allocation select, replace the fixed
time.Sleep(500 * time.Millisecond) with a deterministic polling loop that checks
ww.container.Len() until it equals 2 (or until a timeout elapses) to avoid
flakiness; implement this by creating a timeout (e.g., time.After with a few
seconds) and a short ticker (e.g., 50-100ms) and in a select loop return when
ww.container.Len() == 2 or call t.Error/fail when the timeout channel fires,
removing the hardcoded sleep and referencing the existing allocated signal and
ww.container.Len() for the condition.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a6ce92f5-7e06-48de-b70b-ec02afd85291

📥 Commits

Reviewing files that changed from the base of the PR and between 96b38d2 and cd51561.

📒 Files selected for processing (2)

worker_watcher/worker_watcher.go
worker_watcher/worker_watcher_test.go

Copilot

Pull request overview

Fixes a shutdown hang in WorkerWatcher when a worker crashes while parked in the container: the crashed worker could be removed from the workers map but still remain in the container channel, so after a replacement is allocated container.Len() becomes numWorkers + 1, causing Destroy/Reset to spin until timeout.

Changes:

Evict a crashed worker from the container (via the worker’s callback) before allocating a replacement in wait().
Add a regression test that kills a worker process and asserts the container length returns to numWorkers after replacement.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
`worker_watcher/worker_watcher.go`	Evicts crashed workers from the container before replacement allocation to prevent `Destroy`/`Reset` hangs.
`worker_watcher/worker_watcher_test.go`	Adds a regression test covering the crash-and-replace path and asserting container length invariants.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+	// parked in the container channel (now in StateErrored). Evict it so container.Len() stays
+	// consistent with numWorkers; otherwise Destroy/Reset, which wait for
+	// numWorkers == container.Len(), never settle and hang until their timeout fires.
+	w.Callback()


+		if err = cmd.Start(); err != nil {
+			return nil, nil, err
+		}


+	// mkCmd starts a real (sleep) worker process and returns the worker plus its command, so
+	// the caller can kill the OS process directly — worker.Pid() is 0 without the goridge
+	// handshake, so signalling it would target the wrong process. No require.* here: mkCmd is
+	// also called from the watcher goroutine (via the allocator).


+	// Wait until the replacement has been allocated, then let the evict+push settle.
+	select {
+	case <-allocated:
+	case <-time.After(10 * time.Second):
+		t.Error("watcher did not allocate a replacement worker")
+	}
+	time.Sleep(500 * time.Millisecond)
+
+	// The container must hold exactly numWorkers — the dead worker evicted and the replacement
+	// pushed. Before the fix the dead worker lingered in the channel, leaving Len() == 3,
+	// which makes Destroy/Reset hang waiting for numWorkers == container.Len().
+	assert.Equal(t, 2, ww.container.Len(), "crashed worker was not evicted from the container")
+


…wn fix The test reliably exposed a pre-existing pool teardown hang (an idle pool that lost a worker hung Destroy for ~60s), fixed separately in roadrunner-server/pool#48. Re-add the (reliable, no-workflow) version once that fix is released and bumped here.

Copilot AI review requested due to automatic review settings June 4, 2026 20:04

Copilot started reviewing on behalf of rustatian June 4, 2026 20:04 View session

coderabbitai Bot reviewed Jun 4, 2026

View reviewed changes

Comment thread worker_watcher/worker_watcher_test.go Outdated

test: fix misspellings flagged by the misspell linter

52c7aad

Copilot AI reviewed Jun 4, 2026

View reviewed changes

rustatian closed this Jun 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: evict a crashed worker from the container so destroy/reset don't hang#48

fix: evict a crashed worker from the container so destroy/reset don't hang#48
rustatian wants to merge 2 commits into
masterfrom
fix/evict-crashed-worker-from-container

rustatian commented Jun 4, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 4, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

rustatian commented Jun 4, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rustatian commented Jun 4, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 4, 2026 •

edited

Loading