|
| 1 | +# Workflow V2 Worker Compatibility and Routing Contract |
| 2 | + |
| 3 | +This document freezes the v2 contract for worker build identity, |
| 4 | +compatibility markers, and how in-flight workflow and activity work is |
| 5 | +routed to compatible executors. It is the reference cited by the v2 |
| 6 | +docs, CLI reasoning, Waterline diagnostics, server deployment guidance, |
| 7 | +and test coverage so the whole fleet speaks one language about mixed |
| 8 | +builds, rollout, rollback, and the absence of a compatible worker. |
| 9 | + |
| 10 | +The guarantees below apply to the `durable-workflow/workflow` package at |
| 11 | +v2 and to every host that embeds it or talks to it over the worker |
| 12 | +protocol. A change to any named guarantee is a protocol-level change |
| 13 | +and must be reviewed as such, even if the class that implements it is |
| 14 | +`@internal`. |
| 15 | + |
| 16 | +This contract builds on the semantics frozen in |
| 17 | +`docs/architecture/execution-guarantees.md`. Duplicate execution, |
| 18 | +retries, and redelivery keep the language they have there; this document |
| 19 | +adds the language for which builds are allowed to execute which work. |
| 20 | + |
| 21 | +## Scope |
| 22 | + |
| 23 | +The contract covers: |
| 24 | + |
| 25 | +- **worker build identity** — what each workflow-task worker and |
| 26 | + activity-task worker process presents to the engine so that operators |
| 27 | + and routing logic can reason about the running fleet. |
| 28 | +- **compatibility markers** — the named string that a run is pinned to |
| 29 | + and that a worker advertises as supported. One marker is one |
| 30 | + compatibility family. |
| 31 | +- **task and run compatibility** — how compatibility is recorded on |
| 32 | + workflow runs, workflow tasks, and inherited through retries, |
| 33 | + continue-as-new, and child-workflow starts. |
| 34 | +- **routing of in-flight work** — how polling, claim, and dispatch |
| 35 | + interact with compatibility so that no task is silently executed by an |
| 36 | + incompatible worker. |
| 37 | +- **operator-visible compatibility state** — the fleet and queue |
| 38 | + surfaces that report which markers are live and where. |
| 39 | + |
| 40 | +It does not cover: |
| 41 | + |
| 42 | +- the dedicated task-matching service described by the Phase 3 roadmap |
| 43 | + (#581). The Phase 3 surface will replace broad database polling with |
| 44 | + explicit match/dispatch, but must preserve the compatibility |
| 45 | + guarantees below. |
| 46 | +- the control-plane/data-plane role split described by Phase 4 (#582). |
| 47 | + The split will move compatibility heartbeating onto the control plane |
| 48 | + but must preserve the observable state named here. |
| 49 | +- the scheduler independence work described by Phase 5 (#583). |
| 50 | +- host-level deployment orchestration such as container image selection |
| 51 | + or rolling-restart choreography. Those are deployment concerns that |
| 52 | + consume this contract; they do not define it. |
| 53 | + |
| 54 | +## Terminology |
| 55 | + |
| 56 | +- **Worker build identity** — the tuple `(worker_id, host, process_id, |
| 57 | + namespace, connection, queue, supported[])` recorded by a live worker |
| 58 | + heartbeat. `worker_id` is the stable identifier for one worker |
| 59 | + process; `supported[]` is the set of compatibility markers the worker |
| 60 | + will accept work for. |
| 61 | +- **Compatibility marker** — an opaque, operator-chosen string such as |
| 62 | + `build-2026-04-17` or `api-v3`. The engine does not interpret the |
| 63 | + string beyond equality. The special marker `*` means "accept any |
| 64 | + marker" and is reserved for single-build fleets and test harnesses. |
| 65 | +- **Compatibility family** — the set of builds that share one |
| 66 | + compatibility marker. Two workers that advertise the same marker are |
| 67 | + interchangeable for routing purposes; the engine guarantees nothing |
| 68 | + else about their code parity. |
| 69 | +- **Required marker** — the marker a given workflow task or activity |
| 70 | + task requires. Required markers are resolved from |
| 71 | + `workflow_tasks.compatibility` first, then from the parent run's |
| 72 | + `workflow_runs.compatibility`, and `null` means "no marker required". |
| 73 | +- **Pinned run** — a workflow run whose `workflow_runs.compatibility` |
| 74 | + column is set to a non-null marker. A pinned run is routed to workers |
| 75 | + that advertise that marker until the run terminates or is explicitly |
| 76 | + continued-as-new onto a different marker. |
| 77 | +- **Fingerprint pinning** — the `workflow_definition_fingerprint` |
| 78 | + recorded on `WorkflowStarted` that pins one run to the class |
| 79 | + definition it started under, independent of the compatibility marker. |
| 80 | + See |
| 81 | + `Workflow\V2\Support\WorkflowDefinitionFingerprint::resolveClassForRun()`. |
| 82 | + |
| 83 | +## Worker build identity |
| 84 | + |
| 85 | +Every live worker maintains a heartbeat row under the |
| 86 | +`workflow_worker_compatibility_heartbeats` table (or the legacy fallback |
| 87 | +cache when the table is unavailable). The row is owned by one |
| 88 | +`worker_id` and carries: |
| 89 | + |
| 90 | +- **`worker_id`** — `hostname:pid:ulid`, generated on first heartbeat |
| 91 | + and stable for the life of the worker process. The ULID segment keeps |
| 92 | + the id unique across hostname/pid collisions. |
| 93 | +- **`host`** — the process's hostname as reported by `gethostname()`. |
| 94 | + May be `null` when the host cannot be determined. |
| 95 | +- **`process_id`** — the operating-system pid. May be `null` in |
| 96 | + environments where a pid is not meaningful. |
| 97 | +- **`namespace`** — the value of |
| 98 | + `workflows.v2.compatibility.namespace` (env |
| 99 | + `DW_V2_COMPATIBILITY_NAMESPACE`). Used to scope one workflow database |
| 100 | + across multiple cooperating apps. |
| 101 | +- **`connection`**, **`queue`** — the queue-connection and queue name |
| 102 | + the worker is draining. Either may be `null` when the worker is |
| 103 | + connection- or queue-agnostic. |
| 104 | +- **`supported`** — the JSON list of compatibility markers the worker |
| 105 | + will accept. Either the literal `*` (accept any) or a non-empty set |
| 106 | + of markers. |
| 107 | +- **`recorded_at`**, **`expires_at`** — the heartbeat timestamp and |
| 108 | + expiry computed from `workflows.v2.compatibility.heartbeat_ttl_seconds` |
| 109 | + (default 30 seconds, configured by |
| 110 | + `DW_V2_COMPATIBILITY_HEARTBEAT_TTL`). |
| 111 | + |
| 112 | +Worker identity is a runtime fact, not a configuration contract. The |
| 113 | +only configured inputs are the compatibility markers and namespace; the |
| 114 | +rest of the identity is discovered from the process. |
| 115 | + |
| 116 | +## Compatibility markers |
| 117 | + |
| 118 | +A worker's compatibility configuration is two keys: |
| 119 | + |
| 120 | +- **`workflows.v2.compatibility.current`** |
| 121 | + (`DW_V2_CURRENT_COMPATIBILITY`) — the marker this process advertises |
| 122 | + as its own build. When a workflow run is started from this process, |
| 123 | + its `workflow_runs.compatibility` is stamped with this value. |
| 124 | +- **`workflows.v2.compatibility.supported`** |
| 125 | + (`DW_V2_SUPPORTED_COMPATIBILITIES`) — the comma-separated list of |
| 126 | + markers this worker will accept when claiming tasks. `*` means |
| 127 | + "accept any marker". Empty/`null` defaults to the current marker. |
| 128 | + |
| 129 | +Guarantees: |
| 130 | + |
| 131 | +- The marker is opaque. The engine performs only exact-string equality |
| 132 | + and the `*` wildcard. It does not order markers, does not interpret |
| 133 | + semver, and does not diff their contents. |
| 134 | +- A run stamped with marker `M` is routable only to workers whose |
| 135 | + `supported` list includes `M` or `*`. The engine refuses to dispatch |
| 136 | + or claim it on any other worker and reports the mismatch as an |
| 137 | + explicit operational state rather than running it silently. |
| 138 | +- A run stamped with `null` (no required marker) is routable to any |
| 139 | + worker. Pinning is opt-in — single-build fleets do not need to set |
| 140 | + any compatibility config. |
| 141 | +- The marker is recorded exactly once per run, at start, from |
| 142 | + `WorkerCompatibility::current()`. Subsequent workflow tasks, activity |
| 143 | + tasks, child runs, retry runs, and continue-as-new runs inherit the |
| 144 | + recorded value. Changing `DW_V2_CURRENT_COMPATIBILITY` on the starter |
| 145 | + process only affects newly-started runs; in-flight runs stay on the |
| 146 | + marker they were stamped with. |
| 147 | +- The wildcard marker `*` is an advertisement surface for workers only. |
| 148 | + Runs are never stamped with `*`; that would defeat the purpose. |
| 149 | + |
| 150 | +## Compatibility inheritance |
| 151 | + |
| 152 | +Compatibility flows through the run lifecycle as follows: |
| 153 | + |
| 154 | +- **Start** — a new run is stamped with |
| 155 | + `WorkerCompatibility::current()` on the starter process and the |
| 156 | + value is written to `workflow_runs.compatibility` in the same |
| 157 | + transaction as `WorkflowStarted`. See `DefaultWorkflowControlPlane` |
| 158 | + for the dispatch site. |
| 159 | +- **Workflow tasks** — each `workflow_tasks` row carries a |
| 160 | + `compatibility` column. Existing tasks are synced to the owning run's |
| 161 | + compatibility on claim via `TaskCompatibility::sync()` so repair and |
| 162 | + re-enqueue keep the same marker the run was started under. |
| 163 | +- **Activity tasks** — activity tasks inherit their run's compatibility |
| 164 | + through the same mechanism. An activity task that cannot yet be |
| 165 | + matched to a compatible worker stays in the task table with its |
| 166 | + marker until one appears; it is never silently redirected to an |
| 167 | + incompatible worker. |
| 168 | +- **Retry runs** — when a failed run is retried, the retry run's |
| 169 | + `compatibility` is inherited from the source run. The retry |
| 170 | + continues on the same marker family unless an operator explicitly |
| 171 | + creates a new run on a different marker. |
| 172 | +- **Continue-as-new** — the continued run inherits the previous run's |
| 173 | + `compatibility` column. Continue-as-new is the explicit surface for |
| 174 | + moving long-running work onto a new marker; to do that, start a |
| 175 | + fresh workflow from a process that advertises the new marker, rather |
| 176 | + than relying on continue-as-new to translate between markers. |
| 177 | +- **Child workflows** — child runs inherit the parent run's |
| 178 | + `compatibility` column. A child started by a parent on marker `M` |
| 179 | + runs on marker `M` so a mixed-version deployment does not split a |
| 180 | + parent/child pair across incompatible workers. |
| 181 | +- **Fingerprint pinning** runs in parallel with compatibility pinning. |
| 182 | + Fingerprint pinning guarantees that a run executes against the same |
| 183 | + class *definition* snapshot it started with; compatibility pinning |
| 184 | + guarantees that the run runs on a compatible *worker build*. Both |
| 185 | + guarantees survive redeploy independently. |
| 186 | + |
| 187 | +## Routing and claim enforcement |
| 188 | + |
| 189 | +Routing happens at two surfaces. Both enforce the same marker contract. |
| 190 | + |
| 191 | +### Poll-time filtering |
| 192 | + |
| 193 | +Workers that long-poll the task surfaces pass the |
| 194 | +`?compatibility=marker` query parameter to |
| 195 | +`GET /workflow-tasks/poll` and `GET /activity-tasks/poll`. The server |
| 196 | +filters the eligible task set to rows whose `compatibility` column |
| 197 | +matches the requested marker. A worker advertising `*` does not send |
| 198 | +the filter and sees the full eligible set. |
| 199 | + |
| 200 | +Poll-time filtering is a performance optimisation. It is not the |
| 201 | +correctness boundary — a task that leaks through the filter is still |
| 202 | +rejected at claim time by the enforcement below. |
| 203 | + |
| 204 | +### Claim-time enforcement |
| 205 | + |
| 206 | +At claim time, both bridges call `TaskCompatibility::supported()` / |
| 207 | +`TaskCompatibility::sync()`: |
| 208 | + |
| 209 | +- `Workflow\V2\Support\DefaultWorkflowTaskBridge::claim()` rejects a |
| 210 | + workflow task with the reason code `compatibility_blocked` when the |
| 211 | + claiming worker's `supported` list does not include the task's |
| 212 | + required marker. |
| 213 | +- `Workflow\V2\Support\ActivityTaskClaimer::claimDetailed()` rejects |
| 214 | + an activity task with the reason code `compatibility_unsupported` |
| 215 | + and returns the human-readable mismatch string on the claim |
| 216 | + response. |
| 217 | + |
| 218 | +A rejected claim leaves the task on the queue with its original |
| 219 | +compatibility marker. The worker that attempted the claim does not |
| 220 | +retry; another worker whose `supported` list covers the marker may |
| 221 | +claim it. When no live worker advertises a compatible marker, the task |
| 222 | +remains eligible and the condition is observable through the fleet |
| 223 | +visibility surfaces below. |
| 224 | + |
| 225 | +### Dispatch-time routing |
| 226 | + |
| 227 | +`Workflow\V2\Support\TaskDispatcher` routes tasks to the Laravel queue |
| 228 | +via `connection`/`queue` fields on the task row. Compatibility is not |
| 229 | +encoded into the queue name; instead, every worker on that queue |
| 230 | +applies claim-time enforcement and parks tasks it cannot run. Operators |
| 231 | +who want stronger isolation between compatibility families should use |
| 232 | +separate queues per family; the contract above keeps that policy |
| 233 | +choice out of the engine. |
| 234 | + |
| 235 | +## Operator-visible state |
| 236 | + |
| 237 | +The fleet and queue surfaces must make mixed-version state explicit to |
| 238 | +operators and automation: |
| 239 | + |
| 240 | +- `Workflow\V2\Support\WorkerCompatibilityFleet::summaryForNamespace()` |
| 241 | + returns `active_workers`, `active_worker_scopes`, the live queue |
| 242 | + list, the live `build_ids` list, and the per-worker roll-up. `build_ids` |
| 243 | + is the union of advertised markers across the namespace. |
| 244 | +- `Workflow\V2\Support\WorkerCompatibilityFleet::detailsForNamespace()` |
| 245 | + returns one row per `(worker_id, connection, queue)` scope with a |
| 246 | + `supports_required` flag when a required marker is passed. Automation |
| 247 | + should use this call to detect the absence of a compatible worker |
| 248 | + for a pinned run. |
| 249 | +- `WorkerCompatibility::mismatchReason()` and |
| 250 | + `WorkerCompatibilityFleet::mismatchReason()` return the canonical |
| 251 | + human-readable mismatch string. CLI, Waterline, and cloud |
| 252 | + diagnostics must surface this string verbatim rather than inventing |
| 253 | + their own language. |
| 254 | + |
| 255 | +Guarantees: |
| 256 | + |
| 257 | +- The absence of a compatible worker is an explicit operational |
| 258 | + state, not an error. It reports as `supports_required=false` on the |
| 259 | + fleet surface and as `compatibility_blocked` / |
| 260 | + `compatibility_unsupported` on the claim path. Product docs, CLI, |
| 261 | + and Waterline should describe it as "no compatible worker is |
| 262 | + registered yet" rather than as "the task failed". |
| 263 | +- The heartbeat TTL (`heartbeat_ttl_seconds`, default 30) is the |
| 264 | + upper bound on how stale the fleet view may be. Operators should |
| 265 | + size rollout windows so that the old fleet continues to heartbeat |
| 266 | + until all runs that need it have terminated or been continued onto |
| 267 | + the new marker. |
| 268 | + |
| 269 | +## Rollout and rollback guidance |
| 270 | + |
| 271 | +The contract above is designed to support operator-driven rollout and |
| 272 | +rollback without the engine guessing intent: |
| 273 | + |
| 274 | +- **Add a new marker** — deploy a new fleet with a new |
| 275 | + `DW_V2_CURRENT_COMPATIBILITY` value and leave its `supported` list |
| 276 | + set to advertise both the new marker and any markers still in use |
| 277 | + for in-flight runs. The new fleet will start accepting tasks for |
| 278 | + both old and new runs. Starter processes that point at the new |
| 279 | + fleet will stamp newly-started runs with the new marker. |
| 280 | +- **Drain an old marker** — stop stamping new runs with the old |
| 281 | + marker (change the starter process's current marker), let pinned |
| 282 | + runs either terminate or continue-as-new onto the new marker, and |
| 283 | + only then remove the old marker from any worker's `supported` |
| 284 | + list. |
| 285 | +- **Roll back** — the old fleet still advertises its old marker in |
| 286 | + `supported`; restart the starter processes pointing back at the old |
| 287 | + marker. In-flight runs on the new marker will keep running on the |
| 288 | + new fleet until they finish; no run is quietly rerouted to an |
| 289 | + incompatible build. |
| 290 | +- **Observe safety** — automation watching |
| 291 | + `WorkerCompatibilityFleet::detailsForNamespace()` with the |
| 292 | + pinned-run marker should require `supports_required=true` on at |
| 293 | + least one live heartbeat before declaring the rollout healthy. The |
| 294 | + same signal identifies stuck rollbacks. |
| 295 | + |
| 296 | +## What this contract does not yet guarantee |
| 297 | + |
| 298 | +The following are explicitly deferred to later roadmap phases and must |
| 299 | +not be assumed: |
| 300 | + |
| 301 | +- Per-task queue routing based on build identity is not provided by |
| 302 | + the engine. Deployments that need stronger isolation across |
| 303 | + compatibility families should use separate queue names. |
| 304 | +- Automatic detection of "no compatible worker" as a blocker that |
| 305 | + halts scheduling upstream commands is not provided. The absence is |
| 306 | + observable but operator automation owns the response. |
| 307 | +- Protocol-level compatibility negotiation between a worker and the |
| 308 | + engine is not part of this contract. The worker protocol version is |
| 309 | + frozen separately in `Workflow\V2\Support\WorkerProtocolVersion` and |
| 310 | + is independent of the compatibility marker. |
| 311 | +- Managed-mode or hosted-mode topology (control-plane / data-plane |
| 312 | + split) is outside this contract. See Phase 4 (#582). |
| 313 | + |
| 314 | +## Test strategy alignment |
| 315 | + |
| 316 | +- `tests/Feature/V2/V2CompatibilityWorkflowTest.php` exercises the |
| 317 | + pinning, mismatch, and fleet summary paths end-to-end against the |
| 318 | + workflow engine. |
| 319 | +- `tests/Feature/V2/V2OperatorQueueVisibilityTest.php` and |
| 320 | + `tests/Feature/V2/V2OperatorMetricsTest.php` cover the operator |
| 321 | + surfaces that expose `build_ids` and worker scopes. |
| 322 | +- This document is pinned by |
| 323 | + `tests/Unit/V2/WorkerCompatibilityDocumentationTest.php`. A change |
| 324 | + that renames, removes, or narrows any named guarantee (marker |
| 325 | + inheritance, claim-time enforcement, the `supports_required` flag, |
| 326 | + the heartbeat TTL contract, or the wildcard marker semantics) must |
| 327 | + update the pinning test and this document in the same change so |
| 328 | + the contract does not drift silently. |
| 329 | + |
| 330 | +## Changing this contract |
| 331 | + |
| 332 | +A change to any named guarantee in this document is a protocol-level |
| 333 | +change for the purposes of `docs/api-stability.md` and downstream |
| 334 | +SDKs. Reviewers should treat unmotivated changes to the language above |
| 335 | +as breaking changes and require explicit cross-SDK coordination before |
| 336 | +merge. The Phase 2 roadmap (#580) owns updates to this contract; |
| 337 | +Phases 3–5 must extend the contract rather than silently redefine it. |
0 commit comments