Skip to content

Drain polled tasks on shutdown#2261

Open
yuandrew wants to merge 16 commits into
temporalio:mainfrom
yuandrew:shutdown-refactor
Open

Drain polled tasks on shutdown#2261
yuandrew wants to merge 16 commits into
temporalio:mainfrom
yuandrew:shutdown-refactor

Conversation

@yuandrew
Copy link
Copy Markdown
Contributor

@yuandrew yuandrew commented Mar 31, 2026

What was changed

Refactored worker shutdown to use a two-stage approach: pollers shut down first, then the task dispatcher drains any remaining tasks before exiting. This ensures tasks polled during shutdown are processed rather than silently dropped.

Key changes:

  • Added pollerWG to baseWorker to track poller goroutines separately from the global stopWG
  • A closer goroutine waits for all pollers to finish, then closes taskQueueCh
  • runTaskDispatcher now ranges over taskQueueCh
  • pollTask always sends to taskQueueCh
  • Removed the 5s timeout hack in doPoll from Fix test flakes #2253 (no longer needed)

note: Graceful shutdown drain may bypass worker dispatch rate limiting for already-polled tasks, while still respecting slot concurrency. I think this is okay because the tasks are already polled by the worker, thus "owned", and this speeds up draining of tasks, decreasing likelihood of hitting WorkerStopTimeout. I don't think users would have any issues with this.

Why?

PR #2199 changed shutdown to let the server complete in-flight polls, instead of cancelling them. This exposed a pre-existing race when a poller receives a task during shutdown, Go would silently dropping the task. The dispatcher had the same issue — it could exit on stopCh before reading pending tasks from the channel.

This aligns the Go SDK's shutdown with how Core SDK handles it:

  1. Set a flag so pollers stop polling after their current attempt
  2. Close channels from pollers → task processing
  3. Wait for all in-flight tasks to complete & channels to be empty

Checklist

  1. Closes Drain polled tasks on shutdown #1197

  2. How was this tested:

    • New unit test TestTaskNotDroppedDuringShutdown — verifies a task polled during shutdown is processed, not dropped
    • Existing TestDoPollGracefulShutdown — validates both graceful and legacy poll completion
  3. Any docs updates needed?
    No


Note

Medium Risk
Changes worker shutdown semantics to drain already-polled tasks (and avoid canceling in-flight polls) which can affect stop latency and task processing order; includes new logic around channel closing and dispatcher behavior that could introduce shutdown edge cases.

Overview
Implements a two-stage worker shutdown when the server supports worker_poll_complete_on_shutdown: pollers are allowed to finish their current long-poll, then the task dispatcher drains and processes any tasks already received instead of dropping them.

This refactors baseWorker to track poller lifecycles separately (pollerWG), close taskQueueCh only after all pollers exit, and update dispatch/poll paths to always enqueue polled tasks in drain mode. It also extends shutdown coordination to workflow, activity, and nexus task processors (allow processing after stop when draining), sends ShutdownWorker for session worker task queues, enables the cancel-worker-polls dynamic config in build harness, and adds unit/integration tests covering drain vs legacy shutdown behavior and stop-timeout bounding.

Reviewed by Cursor Bugbot for commit 3c03f1f. Bugbot is set up for automated code reviews on this repo. Configure here.

@yuandrew yuandrew changed the title Allow for task finishing on shutdown Drain polled tasks on shutdown Mar 31, 2026
@yuandrew yuandrew marked this pull request as ready for review March 31, 2026 23:36
@yuandrew yuandrew requested a review from a team as a code owner March 31, 2026 23:36
select {
case bw.taskQueueCh <- &polledTask{task: task, permit: slotPermit}:
didSendTask = true
case <-bw.stopCh:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checking the stop channel is still used elsewhere since we removed it in two spots

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep! still used in plenty of other places

@yuandrew
Copy link
Copy Markdown
Contributor Author

yuandrew commented Apr 1, 2026

recheck

@yuandrew yuandrew force-pushed the shutdown-refactor branch from b930672 to 29b355a Compare April 1, 2026 23:28
Comment thread internal/internal_worker_base.go Outdated
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Fix All in Cursor

Comment thread internal/internal_task_pollers.go Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Drain polled tasks on shutdown

2 participants