Skip to content

Commit 9a1a7d3

Browse files
Freeze v2 worker-compatibility and routing contract
Issue #580 opens Phase 2 of the v2 multi-node architecture roadmap, which replaces cluster-wide compatibility env discipline with a real worker build identity and routing contract for long-running workflows. Product docs, CLI reasoning, Waterline diagnostics, server deployment guidance, and test coverage need one reference that names which builds are allowed to execute which work, how compatibility flows through runs, and how the absence of a compatible worker is surfaced. This lands the contract doc and a pinning test. docs/architecture/worker-compatibility.md: - scopes the contract to worker build identity, compatibility markers, task/run inheritance, routing and claim enforcement, and the operator-visible fleet surface - defines worker build identity (worker_id, host, process_id, namespace, connection, queue, supported, recorded_at, expires_at) and the hostname:pid:ulid worker id format - freezes compatibility-marker semantics: opaque strings, exact-equality routing, the `*` wildcard for workers only, runs never stamped with `*`, and the full DW_V2_CURRENT_COMPATIBILITY / DW_V2_SUPPORTED_COMPATIBILITIES / DW_V2_COMPATIBILITY_NAMESPACE / DW_V2_COMPATIBILITY_HEARTBEAT_TTL env surface - defines how compatibility is inherited through Start, workflow-task and activity-task claims, retry runs, continue-as-new, and child workflows, and how fingerprint pinning runs in parallel - freezes poll-time filtering, claim-time enforcement with compatibility_blocked / compatibility_unsupported reason codes, and dispatch-time queue routing without encoding compatibility into queue names - names WorkerCompatibilityFleet::summaryForNamespace() / detailsForNamespace() and supports_required=false as the explicit operational state for "no compatible worker is registered yet" - documents the operator rollout / rollback posture that the contract enables and the 30s heartbeat TTL upper bound - defers dedicated task matching (#581), control-plane/data-plane split (#582), and scheduler cache independence (#583) to later phases rather than silently absorbing their guarantees - cites docs/architecture/execution-guarantees.md as the Phase 1 foundation it extends tests/Unit/V2/WorkerCompatibilityDocumentationTest.php: - pins required headings, required terminology, required identity fields, required config/env keys, required enforcement reason codes, required canonical implementation classes, and the lifecycle inheritance transitions so the contract doc cannot drift silently - asserts the absence-of-compatible-worker language is explicit, the "not silently routed" guarantee is present, and that Phase 3 (#581) and Phase 4 (#582) are deferred by number - asserts the 30s heartbeat TTL default and the wildcard marker semantics are named verbatim Verified: - bash scripts/check-public-boundary.sh (exit 0) - vendor/bin/phpunit tests/Unit/V2/WorkerCompatibilityDocumentationTest .php (12 tests, 75 assertions, OK) against PHP 8.4 - vendor/bin/ecs check tests/Unit/V2/WorkerCompatibilityDocumentation Test.php (no errors)
1 parent 68132d3 commit 9a1a7d3

2 files changed

Lines changed: 609 additions & 0 deletions

File tree

Lines changed: 337 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,337 @@
1+
# Workflow V2 Worker Compatibility and Routing Contract
2+
3+
This document freezes the v2 contract for worker build identity,
4+
compatibility markers, and how in-flight workflow and activity work is
5+
routed to compatible executors. It is the reference cited by the v2
6+
docs, CLI reasoning, Waterline diagnostics, server deployment guidance,
7+
and test coverage so the whole fleet speaks one language about mixed
8+
builds, rollout, rollback, and the absence of a compatible worker.
9+
10+
The guarantees below apply to the `durable-workflow/workflow` package at
11+
v2 and to every host that embeds it or talks to it over the worker
12+
protocol. A change to any named guarantee is a protocol-level change
13+
and must be reviewed as such, even if the class that implements it is
14+
`@internal`.
15+
16+
This contract builds on the semantics frozen in
17+
`docs/architecture/execution-guarantees.md`. Duplicate execution,
18+
retries, and redelivery keep the language they have there; this document
19+
adds the language for which builds are allowed to execute which work.
20+
21+
## Scope
22+
23+
The contract covers:
24+
25+
- **worker build identity** — what each workflow-task worker and
26+
activity-task worker process presents to the engine so that operators
27+
and routing logic can reason about the running fleet.
28+
- **compatibility markers** — the named string that a run is pinned to
29+
and that a worker advertises as supported. One marker is one
30+
compatibility family.
31+
- **task and run compatibility** — how compatibility is recorded on
32+
workflow runs, workflow tasks, and inherited through retries,
33+
continue-as-new, and child-workflow starts.
34+
- **routing of in-flight work** — how polling, claim, and dispatch
35+
interact with compatibility so that no task is silently executed by an
36+
incompatible worker.
37+
- **operator-visible compatibility state** — the fleet and queue
38+
surfaces that report which markers are live and where.
39+
40+
It does not cover:
41+
42+
- the dedicated task-matching service described by the Phase 3 roadmap
43+
(#581). The Phase 3 surface will replace broad database polling with
44+
explicit match/dispatch, but must preserve the compatibility
45+
guarantees below.
46+
- the control-plane/data-plane role split described by Phase 4 (#582).
47+
The split will move compatibility heartbeating onto the control plane
48+
but must preserve the observable state named here.
49+
- the scheduler independence work described by Phase 5 (#583).
50+
- host-level deployment orchestration such as container image selection
51+
or rolling-restart choreography. Those are deployment concerns that
52+
consume this contract; they do not define it.
53+
54+
## Terminology
55+
56+
- **Worker build identity** — the tuple `(worker_id, host, process_id,
57+
namespace, connection, queue, supported[])` recorded by a live worker
58+
heartbeat. `worker_id` is the stable identifier for one worker
59+
process; `supported[]` is the set of compatibility markers the worker
60+
will accept work for.
61+
- **Compatibility marker** — an opaque, operator-chosen string such as
62+
`build-2026-04-17` or `api-v3`. The engine does not interpret the
63+
string beyond equality. The special marker `*` means "accept any
64+
marker" and is reserved for single-build fleets and test harnesses.
65+
- **Compatibility family** — the set of builds that share one
66+
compatibility marker. Two workers that advertise the same marker are
67+
interchangeable for routing purposes; the engine guarantees nothing
68+
else about their code parity.
69+
- **Required marker** — the marker a given workflow task or activity
70+
task requires. Required markers are resolved from
71+
`workflow_tasks.compatibility` first, then from the parent run's
72+
`workflow_runs.compatibility`, and `null` means "no marker required".
73+
- **Pinned run** — a workflow run whose `workflow_runs.compatibility`
74+
column is set to a non-null marker. A pinned run is routed to workers
75+
that advertise that marker until the run terminates or is explicitly
76+
continued-as-new onto a different marker.
77+
- **Fingerprint pinning** — the `workflow_definition_fingerprint`
78+
recorded on `WorkflowStarted` that pins one run to the class
79+
definition it started under, independent of the compatibility marker.
80+
See
81+
`Workflow\V2\Support\WorkflowDefinitionFingerprint::resolveClassForRun()`.
82+
83+
## Worker build identity
84+
85+
Every live worker maintains a heartbeat row under the
86+
`workflow_worker_compatibility_heartbeats` table (or the legacy fallback
87+
cache when the table is unavailable). The row is owned by one
88+
`worker_id` and carries:
89+
90+
- **`worker_id`**`hostname:pid:ulid`, generated on first heartbeat
91+
and stable for the life of the worker process. The ULID segment keeps
92+
the id unique across hostname/pid collisions.
93+
- **`host`** — the process's hostname as reported by `gethostname()`.
94+
May be `null` when the host cannot be determined.
95+
- **`process_id`** — the operating-system pid. May be `null` in
96+
environments where a pid is not meaningful.
97+
- **`namespace`** — the value of
98+
`workflows.v2.compatibility.namespace` (env
99+
`DW_V2_COMPATIBILITY_NAMESPACE`). Used to scope one workflow database
100+
across multiple cooperating apps.
101+
- **`connection`**, **`queue`** — the queue-connection and queue name
102+
the worker is draining. Either may be `null` when the worker is
103+
connection- or queue-agnostic.
104+
- **`supported`** — the JSON list of compatibility markers the worker
105+
will accept. Either the literal `*` (accept any) or a non-empty set
106+
of markers.
107+
- **`recorded_at`**, **`expires_at`** — the heartbeat timestamp and
108+
expiry computed from `workflows.v2.compatibility.heartbeat_ttl_seconds`
109+
(default 30 seconds, configured by
110+
`DW_V2_COMPATIBILITY_HEARTBEAT_TTL`).
111+
112+
Worker identity is a runtime fact, not a configuration contract. The
113+
only configured inputs are the compatibility markers and namespace; the
114+
rest of the identity is discovered from the process.
115+
116+
## Compatibility markers
117+
118+
A worker's compatibility configuration is two keys:
119+
120+
- **`workflows.v2.compatibility.current`**
121+
(`DW_V2_CURRENT_COMPATIBILITY`) — the marker this process advertises
122+
as its own build. When a workflow run is started from this process,
123+
its `workflow_runs.compatibility` is stamped with this value.
124+
- **`workflows.v2.compatibility.supported`**
125+
(`DW_V2_SUPPORTED_COMPATIBILITIES`) — the comma-separated list of
126+
markers this worker will accept when claiming tasks. `*` means
127+
"accept any marker". Empty/`null` defaults to the current marker.
128+
129+
Guarantees:
130+
131+
- The marker is opaque. The engine performs only exact-string equality
132+
and the `*` wildcard. It does not order markers, does not interpret
133+
semver, and does not diff their contents.
134+
- A run stamped with marker `M` is routable only to workers whose
135+
`supported` list includes `M` or `*`. The engine refuses to dispatch
136+
or claim it on any other worker and reports the mismatch as an
137+
explicit operational state rather than running it silently.
138+
- A run stamped with `null` (no required marker) is routable to any
139+
worker. Pinning is opt-in — single-build fleets do not need to set
140+
any compatibility config.
141+
- The marker is recorded exactly once per run, at start, from
142+
`WorkerCompatibility::current()`. Subsequent workflow tasks, activity
143+
tasks, child runs, retry runs, and continue-as-new runs inherit the
144+
recorded value. Changing `DW_V2_CURRENT_COMPATIBILITY` on the starter
145+
process only affects newly-started runs; in-flight runs stay on the
146+
marker they were stamped with.
147+
- The wildcard marker `*` is an advertisement surface for workers only.
148+
Runs are never stamped with `*`; that would defeat the purpose.
149+
150+
## Compatibility inheritance
151+
152+
Compatibility flows through the run lifecycle as follows:
153+
154+
- **Start** — a new run is stamped with
155+
`WorkerCompatibility::current()` on the starter process and the
156+
value is written to `workflow_runs.compatibility` in the same
157+
transaction as `WorkflowStarted`. See `DefaultWorkflowControlPlane`
158+
for the dispatch site.
159+
- **Workflow tasks** — each `workflow_tasks` row carries a
160+
`compatibility` column. Existing tasks are synced to the owning run's
161+
compatibility on claim via `TaskCompatibility::sync()` so repair and
162+
re-enqueue keep the same marker the run was started under.
163+
- **Activity tasks** — activity tasks inherit their run's compatibility
164+
through the same mechanism. An activity task that cannot yet be
165+
matched to a compatible worker stays in the task table with its
166+
marker until one appears; it is never silently redirected to an
167+
incompatible worker.
168+
- **Retry runs** — when a failed run is retried, the retry run's
169+
`compatibility` is inherited from the source run. The retry
170+
continues on the same marker family unless an operator explicitly
171+
creates a new run on a different marker.
172+
- **Continue-as-new** — the continued run inherits the previous run's
173+
`compatibility` column. Continue-as-new is the explicit surface for
174+
moving long-running work onto a new marker; to do that, start a
175+
fresh workflow from a process that advertises the new marker, rather
176+
than relying on continue-as-new to translate between markers.
177+
- **Child workflows** — child runs inherit the parent run's
178+
`compatibility` column. A child started by a parent on marker `M`
179+
runs on marker `M` so a mixed-version deployment does not split a
180+
parent/child pair across incompatible workers.
181+
- **Fingerprint pinning** runs in parallel with compatibility pinning.
182+
Fingerprint pinning guarantees that a run executes against the same
183+
class *definition* snapshot it started with; compatibility pinning
184+
guarantees that the run runs on a compatible *worker build*. Both
185+
guarantees survive redeploy independently.
186+
187+
## Routing and claim enforcement
188+
189+
Routing happens at two surfaces. Both enforce the same marker contract.
190+
191+
### Poll-time filtering
192+
193+
Workers that long-poll the task surfaces pass the
194+
`?compatibility=marker` query parameter to
195+
`GET /workflow-tasks/poll` and `GET /activity-tasks/poll`. The server
196+
filters the eligible task set to rows whose `compatibility` column
197+
matches the requested marker. A worker advertising `*` does not send
198+
the filter and sees the full eligible set.
199+
200+
Poll-time filtering is a performance optimisation. It is not the
201+
correctness boundary — a task that leaks through the filter is still
202+
rejected at claim time by the enforcement below.
203+
204+
### Claim-time enforcement
205+
206+
At claim time, both bridges call `TaskCompatibility::supported()` /
207+
`TaskCompatibility::sync()`:
208+
209+
- `Workflow\V2\Support\DefaultWorkflowTaskBridge::claim()` rejects a
210+
workflow task with the reason code `compatibility_blocked` when the
211+
claiming worker's `supported` list does not include the task's
212+
required marker.
213+
- `Workflow\V2\Support\ActivityTaskClaimer::claimDetailed()` rejects
214+
an activity task with the reason code `compatibility_unsupported`
215+
and returns the human-readable mismatch string on the claim
216+
response.
217+
218+
A rejected claim leaves the task on the queue with its original
219+
compatibility marker. The worker that attempted the claim does not
220+
retry; another worker whose `supported` list covers the marker may
221+
claim it. When no live worker advertises a compatible marker, the task
222+
remains eligible and the condition is observable through the fleet
223+
visibility surfaces below.
224+
225+
### Dispatch-time routing
226+
227+
`Workflow\V2\Support\TaskDispatcher` routes tasks to the Laravel queue
228+
via `connection`/`queue` fields on the task row. Compatibility is not
229+
encoded into the queue name; instead, every worker on that queue
230+
applies claim-time enforcement and parks tasks it cannot run. Operators
231+
who want stronger isolation between compatibility families should use
232+
separate queues per family; the contract above keeps that policy
233+
choice out of the engine.
234+
235+
## Operator-visible state
236+
237+
The fleet and queue surfaces must make mixed-version state explicit to
238+
operators and automation:
239+
240+
- `Workflow\V2\Support\WorkerCompatibilityFleet::summaryForNamespace()`
241+
returns `active_workers`, `active_worker_scopes`, the live queue
242+
list, the live `build_ids` list, and the per-worker roll-up. `build_ids`
243+
is the union of advertised markers across the namespace.
244+
- `Workflow\V2\Support\WorkerCompatibilityFleet::detailsForNamespace()`
245+
returns one row per `(worker_id, connection, queue)` scope with a
246+
`supports_required` flag when a required marker is passed. Automation
247+
should use this call to detect the absence of a compatible worker
248+
for a pinned run.
249+
- `WorkerCompatibility::mismatchReason()` and
250+
`WorkerCompatibilityFleet::mismatchReason()` return the canonical
251+
human-readable mismatch string. CLI, Waterline, and cloud
252+
diagnostics must surface this string verbatim rather than inventing
253+
their own language.
254+
255+
Guarantees:
256+
257+
- The absence of a compatible worker is an explicit operational
258+
state, not an error. It reports as `supports_required=false` on the
259+
fleet surface and as `compatibility_blocked` /
260+
`compatibility_unsupported` on the claim path. Product docs, CLI,
261+
and Waterline should describe it as "no compatible worker is
262+
registered yet" rather than as "the task failed".
263+
- The heartbeat TTL (`heartbeat_ttl_seconds`, default 30) is the
264+
upper bound on how stale the fleet view may be. Operators should
265+
size rollout windows so that the old fleet continues to heartbeat
266+
until all runs that need it have terminated or been continued onto
267+
the new marker.
268+
269+
## Rollout and rollback guidance
270+
271+
The contract above is designed to support operator-driven rollout and
272+
rollback without the engine guessing intent:
273+
274+
- **Add a new marker** — deploy a new fleet with a new
275+
`DW_V2_CURRENT_COMPATIBILITY` value and leave its `supported` list
276+
set to advertise both the new marker and any markers still in use
277+
for in-flight runs. The new fleet will start accepting tasks for
278+
both old and new runs. Starter processes that point at the new
279+
fleet will stamp newly-started runs with the new marker.
280+
- **Drain an old marker** — stop stamping new runs with the old
281+
marker (change the starter process's current marker), let pinned
282+
runs either terminate or continue-as-new onto the new marker, and
283+
only then remove the old marker from any worker's `supported`
284+
list.
285+
- **Roll back** — the old fleet still advertises its old marker in
286+
`supported`; restart the starter processes pointing back at the old
287+
marker. In-flight runs on the new marker will keep running on the
288+
new fleet until they finish; no run is quietly rerouted to an
289+
incompatible build.
290+
- **Observe safety** — automation watching
291+
`WorkerCompatibilityFleet::detailsForNamespace()` with the
292+
pinned-run marker should require `supports_required=true` on at
293+
least one live heartbeat before declaring the rollout healthy. The
294+
same signal identifies stuck rollbacks.
295+
296+
## What this contract does not yet guarantee
297+
298+
The following are explicitly deferred to later roadmap phases and must
299+
not be assumed:
300+
301+
- Per-task queue routing based on build identity is not provided by
302+
the engine. Deployments that need stronger isolation across
303+
compatibility families should use separate queue names.
304+
- Automatic detection of "no compatible worker" as a blocker that
305+
halts scheduling upstream commands is not provided. The absence is
306+
observable but operator automation owns the response.
307+
- Protocol-level compatibility negotiation between a worker and the
308+
engine is not part of this contract. The worker protocol version is
309+
frozen separately in `Workflow\V2\Support\WorkerProtocolVersion` and
310+
is independent of the compatibility marker.
311+
- Managed-mode or hosted-mode topology (control-plane / data-plane
312+
split) is outside this contract. See Phase 4 (#582).
313+
314+
## Test strategy alignment
315+
316+
- `tests/Feature/V2/V2CompatibilityWorkflowTest.php` exercises the
317+
pinning, mismatch, and fleet summary paths end-to-end against the
318+
workflow engine.
319+
- `tests/Feature/V2/V2OperatorQueueVisibilityTest.php` and
320+
`tests/Feature/V2/V2OperatorMetricsTest.php` cover the operator
321+
surfaces that expose `build_ids` and worker scopes.
322+
- This document is pinned by
323+
`tests/Unit/V2/WorkerCompatibilityDocumentationTest.php`. A change
324+
that renames, removes, or narrows any named guarantee (marker
325+
inheritance, claim-time enforcement, the `supports_required` flag,
326+
the heartbeat TTL contract, or the wildcard marker semantics) must
327+
update the pinning test and this document in the same change so
328+
the contract does not drift silently.
329+
330+
## Changing this contract
331+
332+
A change to any named guarantee in this document is a protocol-level
333+
change for the purposes of `docs/api-stability.md` and downstream
334+
SDKs. Reviewers should treat unmotivated changes to the language above
335+
as breaking changes and require explicit cross-SDK coordination before
336+
merge. The Phase 2 roadmap (#580) owns updates to this contract;
337+
Phases 3–5 must extend the contract rather than silently redefine it.

0 commit comments

Comments
 (0)