Skip to content

Commit b038837

Browse files
deploy: b1d3494
1 parent b2ea07b commit b038837

274 files changed

Lines changed: 1571 additions & 1016 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

2.0/llms-full.txt

Lines changed: 205 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9970,9 +9970,17 @@ admission. They are control-plane routes.
99709970
| `GET` | `/api/task-queues` | List task queues and admission status. |
99719971
| `GET` | `/api/task-queues/{taskQueue}` | Describe workflow/activity/query capacity for one queue. |
99729972
| `GET` | `/api/task-queues/{taskQueue}/build-ids` | List build ids observed for one queue. |
9973+
| `POST` | `/api/task-queues/{taskQueue}/build-ids/drain` | Mark a build-id cohort as draining so it stops claiming new tasks. |
9974+
| `POST` | `/api/task-queues/{taskQueue}/build-ids/resume` | Clear a previous drain so the cohort can claim new tasks again. |
99739975

99749976
Use task queue responses to distinguish no-worker conditions from saturated
99759977
worker slots, active lease caps, dispatch budgets, and query-task backpressure.
9978+
Drain and resume take a JSON body of `{"build_id": "..."}` (or
9979+
`{"build_id": null}` for the unversioned cohort), are idempotent, and persist
9980+
operator intent on the cohort so rollout state stays honest even after the
9981+
workers are removed. See
9982+
[Worker Build-Id Rollout](/docs/2.0/polyglot/worker-build-id-rollout) for the
9983+
full unversioned-to-versioned cutover, canary, drain, and rollback lifecycle.
99769984

99779985
## Schedules And Search Attributes
99789986

@@ -10465,11 +10473,23 @@ Use `--paused` for deploy-time registration that should not start work yet.
1046510473
| `dw worker:deregister <worker-id>` | Deregister one worker. | `--json` |
1046610474
| `dw task-queue:list` | List active task queues and admission status. | global output options |
1046710475
| `dw task-queue:describe <queue>` | Describe worker capacity, leases, dispatch budgets, and pending query-task capacity. | `--json` |
10476+
| `dw task-queue:build-ids <queue>` | Inspect per-build-id cohort state and rollout status for one queue. | `--json` |
10477+
| `dw task-queue:drain <queue>` | Mark a build-id cohort as draining so it stops claiming new tasks. | `--build-id <value>`, `--unversioned`, `--json` |
10478+
| `dw task-queue:resume <queue>` | Clear a previous drain so the cohort can claim new tasks again. | `--build-id <value>`, `--unversioned`, `--json` |
1046810479

1046910480
The task queue commands are the preferred operator view for throttling,
1047010481
capacity, and no-worker diagnoses. See
1047110482
[Task Queue Admission](/docs/2.0/polyglot/task-queue-admission) for the
10472-
server-side policy behind those fields.
10483+
server-side policy behind those fields and
10484+
[Worker Build-Id Rollout](/docs/2.0/polyglot/worker-build-id-rollout) for the
10485+
full unversioned-to-versioned cutover, canary, drain, and rollback lifecycle.
10486+
10487+
`dw task-queue:drain` and `dw task-queue:resume` both require either
10488+
`--build-id <value>` to target a specific build cohort or `--unversioned` to
10489+
target the cohort of workers registered without a `build_id`. Combining the
10490+
two fails fast with an invalid-option error. Both commands are idempotent:
10491+
repeated drains do not shift the recorded `drained_at` timestamp, and
10492+
resuming an already-active cohort is a no-op.
1047310493

1047410494
## Worker Protocol Commands
1047510495

@@ -12765,6 +12785,190 @@ An admission payload has three sections:
1276512785
- [Python SDK](/docs/2.0/polyglot/python)
1276612786
- [Worker Protocol](/docs/2.0/polyglot/worker-protocol)
1276712787

12788+
<!-- Source: docs/polyglot/worker-build-id-rollout.md -->
12789+
12790+
# Worker Build-Id Rollout
12791+
12792+
Use this reference when you cut over from unversioned workers to build-tagged
12793+
workers, canary a new build onto a task queue, drain an older build before
12794+
decommissioning it, or roll a bad build back. The server records operator
12795+
intent alongside the live worker rows so the next poll, CLI describe, or
12796+
`list_task_queue_build_ids` call reflects the rollout state honestly even if
12797+
the old workers disappear before their backlog drains.
12798+
12799+
The Durable Workflow server expresses a rollout on one task queue as a set of
12800+
**build-id cohorts**. A cohort groups every worker registration that reported
12801+
the same `build_id` when it called `POST /api/worker/register`. Workers that
12802+
omit `build_id` form the **unversioned cohort**, which is the pre-rollout
12803+
default and the one you migrate away from on the first cutover.
12804+
12805+
## Rollout State The Server Records
12806+
12807+
Each `(namespace, task_queue, build_id)` cohort carries the aggregated worker
12808+
state (active, draining, stale, total counts) plus operator intent:
12809+
12810+
| Field | Purpose |
12811+
| --- | --- |
12812+
| `build_id` | The registered build identity. `null` identifies the unversioned cohort. |
12813+
| `rollout_status` | Aggregate view of what the cohort will do with new tasks: `active`, `active_with_draining`, `draining`, `stale_only`, or `no_workers`. |
12814+
| `drain_intent` | Operator intent for the cohort: `active` or `draining`. |
12815+
| `drained_at` | When the cohort was first marked draining. Absent while the cohort is active. Repeated drain calls do not shift this timestamp. |
12816+
| `active_worker_count` | Live workers currently accepting new tasks. |
12817+
| `draining_worker_count` | Live workers that still hold in-flight tasks but no longer claim new work. |
12818+
| `stale_worker_count` | Workers whose last heartbeat is older than the stale cutoff. |
12819+
| `total_worker_count` | Sum of the three cohort populations. |
12820+
| `runtimes`, `sdk_versions` | Distinct runtime and SDK version strings observed across the cohort. |
12821+
| `last_heartbeat_at`, `first_seen_at` | Cohort-wide heartbeat window, useful for confirming quiet cohorts before deleting them. |
12822+
12823+
`drain_intent` is persistent: resuming a cohort, stopping every worker, or
12824+
letting the cohort go stale does not silently flip it back to `active`. Only
12825+
an explicit `POST .../build-ids/resume` clears `drain_intent` and `drained_at`.
12826+
This keeps `rollout_status` honest even after a cohort has no live workers.
12827+
12828+
## Inspect The Rollout
12829+
12830+
Before draining or deleting a build, confirm which cohorts are still
12831+
reachable on the queue:
12832+
12833+
```bash
12834+
curl -sS "$DURABLE_WORKFLOW_SERVER_URL/api/task-queues/orders-critical/build-ids" \
12835+
-H "Authorization: Bearer $DW_OPERATOR_TOKEN" \
12836+
-H "X-Namespace: orders-prod" \
12837+
-H "X-Durable-Workflow-Control-Plane-Version: 2"
12838+
```
12839+
12840+
The same snapshot is available from the operator CLI and the Python SDK:
12841+
12842+
```bash
12843+
dw task-queue:build-ids orders-critical --json
12844+
```
12845+
12846+
```python
12847+
from durable_workflow import Client
12848+
12849+
async with Client("https://durable-workflow.example", token=operator_token) as client:
12850+
rollout = await client.list_task_queue_build_ids("orders-critical")
12851+
for cohort in rollout.build_ids:
12852+
print(cohort.build_id, cohort.rollout_status, cohort.total_worker_count)
12853+
```
12854+
12855+
## First Cutover: Unversioned To Versioned
12856+
12857+
A queue that has always been served by unversioned workers reports a single
12858+
`build_id: null` cohort with `rollout_status: "active"`. The first cutover
12859+
introduces a new build-tagged cohort alongside it.
12860+
12861+
1. Deploy the new worker fleet with a stable `build_id` (for example,
12862+
`orders-worker-2026-04-22`) registered through
12863+
`POST /api/worker/register`.
12864+
2. Confirm both cohorts are active:
12865+
```bash
12866+
dw task-queue:build-ids orders-critical --json
12867+
```
12868+
You should see `null` and the new `build_id` each reporting
12869+
`rollout_status: "active"` and non-zero `active_worker_count`.
12870+
3. Start the drain on the unversioned cohort once the new workers are
12871+
handling work:
12872+
```bash
12873+
dw task-queue:drain orders-critical --unversioned
12874+
```
12875+
`drain_intent` flips to `draining` for the unversioned cohort. Workers
12876+
that are still running process their in-flight tasks but stop claiming
12877+
new ones. Future worker registrations or heartbeats that arrive without
12878+
a `build_id` land as draining too.
12879+
4. Wait until `active_worker_count` and `draining_worker_count` are both
12880+
zero for the unversioned cohort. The cohort stays listed with
12881+
`drain_intent: "draining"` so you can confirm the cutover is permanent.
12882+
12883+
## Canary A New Build
12884+
12885+
A canary is a second build that takes a small fraction of traffic while the
12886+
primary build keeps serving. Use a separate `build_id` for the canary so each
12887+
cohort's state is individually inspectable.
12888+
12889+
1. Deploy the canary workers with `build_id: orders-worker-2026-04-22-canary`.
12890+
2. Inspect `list_task_queue_build_ids` to confirm both cohorts report
12891+
`rollout_status: "active"` with the expected worker counts.
12892+
3. Promote by starting more workers on the new `build_id` and reducing the
12893+
primary's worker count, or demote the canary by draining it:
12894+
```bash
12895+
dw task-queue:drain orders-critical --build-id orders-worker-2026-04-22-canary
12896+
```
12897+
12898+
The server does not control the task split across cohorts. Operators size
12899+
the cohort populations and rely on polling distribution to weight traffic.
12900+
Build-id rollout state exists so operators can confirm which cohorts can still
12901+
claim work and trigger a clean handoff when one cohort is ready to stop.
12902+
12903+
## Drain An Older Build
12904+
12905+
Draining keeps already-leased tasks on the older build while new tasks go to
12906+
other active cohorts on the queue:
12907+
12908+
```bash
12909+
dw task-queue:drain orders-critical --build-id orders-worker-2026-04-21-z9
12910+
```
12911+
12912+
The server stamps `drain_intent: "draining"` on the cohort and marks every
12913+
worker registered under that `build_id` as draining on its next heartbeat.
12914+
The call is idempotent: rerunning it does not reset `drained_at`, so you can
12915+
safely retry it from automation.
12916+
12917+
Monitor the drain by polling `list_task_queue_build_ids` and watching
12918+
`active_worker_count` and `draining_worker_count` fall to zero. At that point
12919+
the cohort shows `rollout_status: "draining"` with zero worker counts, meaning
12920+
no live workers remain and operator intent still records "drained". That is
12921+
the safe moment to stop the worker processes and delete the build artifact.
12922+
12923+
## Roll Back A Bad Build
12924+
12925+
Rollback is the reverse flow: resume a previously drained cohort, route new
12926+
traffic back to it, and drain the bad build.
12927+
12928+
1. Resume the known-good cohort:
12929+
```bash
12930+
dw task-queue:resume orders-critical --build-id orders-worker-2026-04-21-z9
12931+
```
12932+
The server clears `drain_intent`, wipes `drained_at`, and flips any
12933+
worker rows that are still heartbeating in under that `build_id` back to
12934+
`active` immediately so the read endpoint stops reporting draining state.
12935+
2. Drain the bad cohort:
12936+
```bash
12937+
dw task-queue:drain orders-critical --build-id orders-worker-2026-04-22
12938+
```
12939+
3. Scale the known-good build back up or redeploy it if workers have
12940+
already been stopped. Workers registering under its `build_id` pick up
12941+
the cleared drain intent and land as `active`.
12942+
12943+
Resume is also idempotent. Rerunning it against an already-active cohort
12944+
is a no-op, so automated rollback flows can issue it safely.
12945+
12946+
## Endpoints And Commands Reference
12947+
12948+
| Intent | HTTP endpoint | CLI | Python SDK method |
12949+
| --- | --- | --- | --- |
12950+
| Inspect cohort state | `GET /api/task-queues/{taskQueue}/build-ids` | `dw task-queue:build-ids` | `Client.list_task_queue_build_ids` |
12951+
| Mark a cohort as draining | `POST /api/task-queues/{taskQueue}/build-ids/drain` | `dw task-queue:drain` | `Client.drain_task_queue_build_id` |
12952+
| Resume a previously drained cohort | `POST /api/task-queues/{taskQueue}/build-ids/resume` | `dw task-queue:resume` | `Client.resume_task_queue_build_id` |
12953+
12954+
Drain and resume both take a JSON body of `{"build_id": "..."}`, or
12955+
`{"build_id": null}` for the unversioned cohort. The CLI expresses the
12956+
unversioned cohort with `--unversioned` and any other build with
12957+
`--build-id <value>`; combining the two fails fast.
12958+
12959+
## Related References
12960+
12961+
- [Namespace, Auth, And Worker Registration](/docs/2.0/polyglot/namespace-auth-workers)
12962+
for the `POST /api/worker/register` call that stamps `build_id` on every
12963+
worker.
12964+
- [Task Queue Admission](/docs/2.0/polyglot/task-queue-admission) for the
12965+
worker-slot and dispatch budgets that apply alongside rollout state.
12966+
- [Server API Reference](/docs/2.0/polyglot/server-api-reference) for the
12967+
full list of control-plane routes and their required roles and protocol
12968+
headers.
12969+
- [CLI Command Reference](/docs/2.0/polyglot/cli-reference) for the argument
12970+
and flag shape of every `dw task-queue:*` subcommand.
12971+
1276812972
<!-- Source: docs/compatibility.md -->
1276912973

1277012974
# Version Compatibility

0 commit comments

Comments
 (0)