Skip to content

Commit 68132d3

Browse files
Freeze v2 execution-guarantees and idempotency contract
Issue #579 asked for one contract doc that defines what can execute more than once, what is replayed deterministically, what the durable state layer observes exactly once, and what application authors are required to make idempotent. Product docs, CLI reasoning, Waterline diagnostics, and test coverage now need to share one reference for duplicate execution, retries, lease expiry, and redelivery. This lands the contract doc and a pinning test. docs/architecture/execution-guarantees.md: - scopes the contract to workflow tasks, activity attempts, external commands, durable messages, side effects, and schedules - defines at-least-once, at-most-once, deterministic replay, exactly- once at the durable state layer, redelivery, and replay as frozen terms - states workflow-task decision batches are exactly-once at commit by workflow_command_id / activity_execution_id / activity_attempt_id / timer_id / signal_id / update_id / condition_wait_id / child workflow_link_id and child_call_id - states activity attempts are at-least-once and that ActivityOutcomeRecorder records exactly one terminal typed event per activity_attempt_id, with recorded=false + reason as the normal redelivery signal - separates activity-attempt retry, workflow-task repair, and child workflow retry as three distinct surfaces with distinct identifiers - names the full framework idempotency surface set (workflow_instance_ id, workflow_run_id, activity_execution_id, activity_attempt_id, workflow_command_id, stream_key cursor, sendReference idempotencyKey, SideEffectRecorded sequence, VersionMarkerRecorded change_id, schedule_id + occurrence_time) - explicitly documents DuplicateStartPolicy::reject_duplicate / return_existing_active and CommandOutcome::RejectedDuplicate as the duplicate-start surface - states duplicate execution is not a bug condition and directs operator, CLI, and Waterline reasoning to the appropriate idempotency surface instead tests/Unit/V2/ExecutionGuaranteesDocumentationTest.php: - pins required headings, required semantics terms, required idempotency surfaces, and the DuplicateStartPolicy / CommandOutcome case references so the contract doc cannot drift silently - asserts the doc states duplicate execution is not a bug condition and cites ActivityOutcomeRecorder with the recorded=false redelivery signal Verified: - bash scripts/check-public-boundary.sh (exit 0) - vendor/bin/phpunit tests/Unit/V2/ExecutionGuaranteesDocumentationTest .php (6 tests, 47 assertions, OK) - vendor/bin/ecs check tests/Unit/V2/ExecutionGuaranteesDocumentation Test.php (no errors)
1 parent cc9cde2 commit 68132d3

2 files changed

Lines changed: 543 additions & 0 deletions

File tree

Lines changed: 381 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,381 @@
1+
# Workflow V2 Execution Guarantees and Idempotency Contract
2+
3+
This document freezes the v2 contract for what can execute more than once,
4+
what is replayed deterministically, what the durable state layer observes
5+
exactly once, and what application authors are required to make
6+
idempotent. It is the reference cited by the v2 docs, CLI reasoning,
7+
Waterline diagnostics, and test coverage so the whole fleet speaks one
8+
language about duplicate execution, retries, lease expiry, and
9+
redelivery.
10+
11+
The guarantees below apply to the `durable-workflow/workflow` package at
12+
v2 and to every host that embeds it or talks to it over the worker
13+
protocol. A change to any named guarantee is a protocol-level change and
14+
must be reviewed as such, even if the class that implements it is
15+
`@internal`.
16+
17+
## Scope
18+
19+
The contract covers:
20+
21+
- **workflow tasks** — units of replay-driven workflow execution claimed
22+
by a workflow worker and acknowledged via a decision batch.
23+
- **activity attempts** — external side-effecting work claimed by an
24+
activity worker and acknowledged via `complete` / `fail` / `heartbeat`.
25+
- **external commands** — start, signal, update, cancel, terminate,
26+
archive, query, and schedule commands received from outside the engine
27+
over the CLI, server HTTP surface, workflow client, or cloud API.
28+
- **durable messages** — the inbox/outbox primitives backing signals,
29+
updates, workflow-to-workflow messages, and human-input flows.
30+
- **side effects**`sideEffect(...)`, `uuid4`, `uuid7`, and
31+
`record_version_marker` calls recorded inline in history.
32+
- **schedules** — the ScheduleTriggered events that launch scheduled
33+
workflow runs.
34+
35+
It does not cover in-process Laravel queue semantics for the host
36+
application outside v2, application database transactions that the host
37+
runs independently of a workflow task, or the behaviour of third-party
38+
services that an activity calls.
39+
40+
## Terminology
41+
42+
- **At-least-once** means the framework may observe the same logical
43+
work more than once and the application author must tolerate it.
44+
- **At-most-once** is not promised for any side-effecting operation.
45+
Activities, signals, updates, starts, and schedule triggers are never
46+
at-most-once.
47+
- **Deterministic replay** means the engine re-runs the same authoring
48+
code to rebuild workflow state from history, producing the same
49+
decisions in the same order. Replay is not application re-execution
50+
and does not re-invoke external side effects.
51+
- **Exactly-once at the durable state layer** means that a specific
52+
typed history row for a given identifier is written at most once and
53+
is the authoritative record consumed by projections, replayers,
54+
exporters, and operator tools.
55+
- **Redelivery** means the transport requeued a task after lease expiry,
56+
worker crash, or repair, and a different worker may now claim the same
57+
task id or a newly repaired task for the same logical work.
58+
- **Replay** means the engine re-reads history events and re-invokes
59+
workflow authoring code to reconstruct the workflow state machine. It
60+
does not re-invoke activity code or re-emit external side effects.
61+
62+
## Workflow task execution semantics
63+
64+
Workflow tasks are event-sourced decisions produced by the workflow
65+
authoring layer and persisted as typed history events through the
66+
`WorkflowExecutor`.
67+
68+
Guarantees:
69+
70+
- A workflow task may be claimed, rejected, leased, expired, redelivered,
71+
repaired, and retried. Any individual workflow task id may be observed
72+
more than once by a worker.
73+
- The workflow authoring code inside a task is replay-driven. It may run
74+
many times across a run's lifetime. Authoring code must be
75+
deterministic with respect to history, must not depend on wall-clock
76+
time, must not perform IO, and must not mutate external state. See
77+
`constraints/workflow-constraints.md` for the full prohibited list.
78+
- A decision batch commits atomically per task. The typed history
79+
events named in `docs/api-stability.md` are the durable record of that
80+
decision. When the commit is observed, those history rows are
81+
exactly-once at the durable state layer for the decision ids they
82+
carry (`workflow_command_id`, `activity_execution_id`,
83+
`activity_attempt_id`, `timer_id`, `signal_id`, `update_id`,
84+
`condition_wait_id`, child `workflow_link_id`/`child_call_id`, etc.).
85+
- If a decision batch fails to commit, the workflow task is eligible
86+
for redelivery. A later successful commit produces exactly one typed
87+
history row per decision id; the duplicated attempt leaves no
88+
observable effect in history.
89+
- `RepairRequested` / repair redispatch is not a new task — it is the
90+
engine re-enqueueing work that a previous worker could not durably
91+
commit. Repair does not duplicate history. It routes to the same
92+
decision set and is covered by the same exactly-once-at-commit
93+
guarantee.
94+
95+
What application authors must assume:
96+
97+
- Anything the workflow body does outside durable primitives can run
98+
more than once and must be safe to re-run. Use `sideEffect(...)`,
99+
`uuid4`/`uuid7`, activity results, updates, queries, or search
100+
attributes/memos to cross the durable boundary.
101+
- Wall-clock reads, live cache reads, and mutable-shared-state reads
102+
inside the workflow body are prohibited for correctness, not style.
103+
104+
## Activity attempt execution semantics
105+
106+
Activities are the first-class vehicle for external side effects. They
107+
are explicitly at-least-once.
108+
109+
Guarantees:
110+
111+
- Each activity execution has a stable `activity_execution_id` that
112+
does not change across retries. It is the default idempotency surface
113+
exposed to the worker on claim (see
114+
`Workflow\V2\Contracts\ActivityTaskBridge::claim()` and the
115+
`idempotency_key` field it returns, which is set to the execution id).
116+
- Each activity attempt has a distinct `activity_attempt_id` and a
117+
sequential `attempt_number` starting at 1. Attempt ids are durable
118+
and are the correlation key for heartbeats, cancellation, and the
119+
typed attempt-scoped history events (`ActivityStarted`,
120+
`ActivityHeartbeatRecorded`, `ActivityRetryScheduled`,
121+
`ActivityCompleted`, `ActivityFailed`, `ActivityCancelled`,
122+
`ActivityTimedOut`).
123+
- An activity attempt can complete more than once from the worker's
124+
point of view: a worker may finish work, lose its lease to a
125+
redelivery after a heartbeat gap, and still attempt to report
126+
completion. `ActivityOutcomeRecorder` records at most one terminal
127+
typed attempt event per `activity_attempt_id`, and reports
128+
`recorded=false` with a reason code for the late caller. The caller
129+
MUST NOT treat `recorded=false` as failure — another worker has
130+
already recorded the outcome or the attempt was superseded.
131+
- The typed attempt-scoped history row for a given
132+
`activity_attempt_id` is exactly-once at the durable state layer.
133+
Retry scheduling emits a separate `ActivityRetryScheduled` for the
134+
new attempt and the next attempt carries the next
135+
`activity_attempt_id`.
136+
- Heartbeats mirror the latest progress onto the live activity
137+
execution and renew the task lease. They are not retry checkpoints;
138+
a heartbeat never splits one attempt into two.
139+
- Cancellation is observed by the activity via cooperative checks and
140+
does not guarantee termination of in-flight external work. Cancelled
141+
attempts may still produce external side effects up to the moment
142+
the worker honours the cancel.
143+
144+
What application authors must assume:
145+
146+
- The same activity execution may be observed more than once. Either
147+
the activity body or the external service it calls must be safe to
148+
repeat. The framework's default idempotency-key surface is
149+
`activity_execution_id` (same across retries) for remote services
150+
that accept an idempotency key per logical request; use
151+
`activity_attempt_id` only for systems that need to distinguish
152+
separate tries of the same logical activity.
153+
- Database writes that must be exactly-once should be wrapped with a
154+
dedupe key (typically `activity_execution_id`) or placed inside a
155+
transaction that is idempotent under retry.
156+
157+
## Retry semantics
158+
159+
The v2 engine recognises three distinct retry surfaces. Each has its
160+
own identifier and its own durable row.
161+
162+
### Activity attempt retry
163+
164+
- Governed by the activity's retry policy (`retry_policy` on the
165+
execution, with defaults from host configuration).
166+
- On failure, the engine emits `ActivityFailed` for the current
167+
attempt, then — if retries remain — `ActivityRetryScheduled` with
168+
the new `retry_task_id`, `retry_of_task_id`, and backoff. The next
169+
attempt runs as a fresh attempt with a new `activity_attempt_id`.
170+
- Non-retryable failures (`non_retryable=true` on `ActivityFailed`,
171+
structural-limit failures, or policy exhaustion) terminate the
172+
execution without scheduling a retry.
173+
174+
### Workflow-task retry and repair
175+
176+
- Workflow tasks themselves are not retried against application
177+
logic; they are replayed. If the worker that holds the task lease
178+
crashes or loses the lease, `TaskRepair` redispatches the same
179+
task to another worker. Each repair increments `repair_count` on
180+
the task and is surfaced through operator tooling, not through
181+
history, because no application-visible state has changed.
182+
- A workflow-task-level failure that wants to surface as application
183+
behaviour writes typed `WorkflowFailed` / `WorkflowCancelled` /
184+
`WorkflowTerminated` history through the normal decision path. The
185+
engine does not silently retry a workflow decision against the
186+
application code.
187+
188+
### Child workflow retry
189+
190+
- Child workflows follow the child retry policy and produce
191+
`ChildRunStarted` events with `retry_attempt` and
192+
`retry_of_child_workflow_run_id` set when a child is a retry.
193+
- Retries of the child workflow share the same
194+
`child_workflow_instance_id`, `workflow_link_id`, and
195+
`child_call_id` as the originally scheduled child; each retried run
196+
gets a new `child_workflow_run_id`.
197+
198+
## Lease expiry and redelivery
199+
200+
- Every claimed task (workflow task, activity task) carries a
201+
`lease_owner`, `lease_expires_at`, and, for activities, an
202+
`activity_attempt_id`. Once a lease expires, the task is eligible
203+
for redelivery.
204+
- Redelivery is an at-least-once event. The replacement claim may land
205+
on a different worker, may land after the original worker has
206+
already reported completion, and may land before the original worker
207+
has finished. The engine mediates race conditions through:
208+
- typed outcome recording guarded by attempt/sequence checks
209+
(`ActivityOutcomeRecorder` / `TaskRepair`),
210+
- command normalisation that idempotently rejects a second decision
211+
batch for an already-settled `workflow_command_id`, and
212+
- the `MessageCursorAdvanced` monotonic cursor for message
213+
consumption.
214+
- Operators observing two `ActivityStarted` events for the same
215+
`activity_execution_id` with different `activity_attempt_id` values
216+
should treat that as a normal redelivery, not a bug, as long as the
217+
engine eventually records exactly one terminal outcome for each
218+
`activity_attempt_id`.
219+
220+
## External commands and duplicate-start policy
221+
222+
- Start, signal, update, cancel, terminate, archive, and repair
223+
commands are recorded with a `workflow_command_id` that is the
224+
durable dedupe key for that command. External clients that can
225+
retry their HTTP request should send the same `workflow_command_id`;
226+
the engine rejects a second decision batch for the same id and
227+
preserves the original outcome.
228+
- Start commands additionally honour the duplicate-start policy named
229+
on `DuplicateStartPolicy`:
230+
- `reject_duplicate` — the second start with the same
231+
`workflow_instance_id` returns `CommandOutcome::RejectedDuplicate`
232+
and does not begin a second run.
233+
- `return_existing_active` — the second start returns the existing
234+
active run's identity instead of beginning a new run.
235+
- The public contract on `Workflow\V2\Contracts\WorkflowControlPlane`
236+
names these values explicitly; callers must pick one deliberately.
237+
- Signals and updates are idempotent at the durable state layer by
238+
`signal_id` / `update_id`. A `workflow_command_id` that already
239+
matches an applied signal/update is accepted as a no-op.
240+
241+
## Durable message stream semantics
242+
243+
- `MessageService::sendMessage()` creates paired outbound/inbound rows
244+
under one reserved instance sequence. An outbound row is durable and
245+
the matching inbound row carries the same logical identity across
246+
runs (including across continue-as-new; see
247+
`MessageService::transferMessagesToContinuedRun()`).
248+
- `peekMessages()` and `receiveMessages()` are non-mutating reads.
249+
Only `consumeMessage()`/`consumeMessages()` advance the durable
250+
cursor, and each `MessageCursorAdvanced` event names exactly one
251+
`stream_key`. Cursor advance is monotonic; a consumed message cannot
252+
be un-consumed.
253+
- External senders that may retry the same logical message should
254+
populate the optional `idempotencyKey` on `sendReference()` so the
255+
durable message stream can recognise and drop duplicates at the
256+
ingress layer.
257+
258+
## Side effects and version markers
259+
260+
- `sideEffect(...)` records the provided value into history as
261+
`SideEffectRecorded` exactly once per call site per sequence. Replay
262+
reads the recorded value; it does not re-invoke the side-effect
263+
callable. Authors must treat `sideEffect` as the one-shot durable
264+
snapshot primitive for non-deterministic values.
265+
- `uuid4`/`uuid7` are one-shot value-recording operations. Calling
266+
them produces a fresh id on first invocation and replays the same id
267+
afterwards.
268+
- `record_version_marker` is a frozen two-phase primitive. See
269+
`docs/api-stability.md` for its wire format and the PHP/Python
270+
parity contract. Adding, renaming, removing, or retyping a field is
271+
a protocol break, not a minor change.
272+
273+
## Schedule triggers
274+
275+
- `ScheduleTriggered` records an attempted trigger for an individual
276+
schedule occurrence with an `outcome` and
277+
`effective_overlap_policy`. Trigger records are exactly-once at the
278+
durable state layer for a given `schedule_id` and
279+
`occurrence_time`.
280+
- Overlap policy decides whether a triggered occurrence is skipped,
281+
queued, or allowed to run concurrently. The application author
282+
selects the policy deliberately — it is a declared choice, not a
283+
best-effort behaviour.
284+
285+
## Framework-provided idempotency surfaces
286+
287+
The framework exposes the following stable idempotency keys to
288+
application code and external workers:
289+
290+
| surface | identifier | lifetime |
291+
| --- | --- | --- |
292+
| workflow instance | `workflow_instance_id` | stable across continue-as-new |
293+
| workflow run | `workflow_run_id` | bound to one execution generation |
294+
| activity execution | `activity_execution_id` | stable across activity retries |
295+
| activity attempt | `activity_attempt_id` | one attempt only |
296+
| external command | `workflow_command_id` | one command only |
297+
| message stream | `stream_key` + monotonic cursor position | durable |
298+
| message send | caller-provided `idempotencyKey` on `sendReference()` | caller-controlled |
299+
| side effect | `sequence` on `SideEffectRecorded` | frozen per call site |
300+
| version marker | `change_id` on `VersionMarkerRecorded` | frozen per change id |
301+
| schedule trigger | `schedule_id` + `occurrence_time` | durable per occurrence |
302+
303+
Application authors should prefer `activity_execution_id` as the
304+
idempotency key against external services that accept one, because it
305+
is stable across retries. Use `activity_attempt_id` only when the
306+
external system must distinguish distinct tries of the same durable
307+
activity.
308+
309+
## What developers must make idempotent
310+
311+
- **Activity bodies** that write to a database, call an external API,
312+
publish a message, or mutate any state outside the workflow engine
313+
must be safe to repeat. Prefer conditional writes, upserts keyed by
314+
`activity_execution_id`, or external idempotency keys.
315+
- **Workflow bodies** must be deterministic under replay but are
316+
otherwise effect-free; the workflow body does not itself need to be
317+
idempotent because the engine does not re-invoke side effects for
318+
it.
319+
- **External command senders** that retry must send the same
320+
`workflow_command_id` across retries so the engine can recognise and
321+
dedupe.
322+
- **Durable-message senders** that may emit the same logical message
323+
twice must populate `idempotencyKey` on `sendReference()`.
324+
- **Signal and update handlers** should treat a re-delivery of the
325+
same command id as a no-op.
326+
- **Query handlers** see a non-durable read at query time. They must
327+
not produce external side effects; they are pure reads against the
328+
currently-resolved run state.
329+
- **Compensation handlers** registered via `addCompensation()` run as
330+
normal activities on failure/cancel and inherit the same activity
331+
at-least-once contract.
332+
333+
## Operator and diagnostic guidance
334+
335+
- Duplicate execution of an activity attempt or workflow task is a
336+
normal distributed-system event, not a bug condition. Product docs,
337+
CLI reasoning, and Waterline incident messaging should describe it
338+
as an expected outcome of retries, lease expiry, and redelivery, and
339+
steer the reader toward the appropriate idempotency surface.
340+
- A single `activity_execution_id` with multiple
341+
`activity_attempt_id` rows is by design. A single
342+
`activity_attempt_id` with multiple terminal events would be a
343+
bug — enforce that with the typed outcome recorders.
344+
- Repair requests (`RepairRequested`) and repair redispatches are
345+
engine-level recovery steps. They should not read as application
346+
failures in operator UIs; they are the mechanism that keeps the
347+
engine live when transport or workers fall behind.
348+
- Waterline selected-run detail rebuilds attempt status from typed
349+
history first (`ActivityStarted` / `ActivityHeartbeatRecorded` /
350+
`ActivityRetryScheduled` / `ActivityCompleted` / `ActivityFailed` /
351+
`ActivityCancelled`), with mutable attempt rows kept as
352+
enrichment. That layering is deliberate and must be preserved when
353+
adding new diagnostic surfaces.
354+
355+
## Test strategy alignment
356+
357+
- Replay correctness is covered by the PHP `WorkflowReplayer` tests
358+
and the cross-SDK fixtures referenced from `docs/api-stability.md`.
359+
- Activity at-least-once and outcome-exactly-once behaviour is
360+
covered by the `ActivityOutcomeRecorder` tests and the recorded
361+
reason codes (`recorded=false` with a reason is the normal
362+
redelivery path, not a test failure).
363+
- Command dedupe behaviour is covered by normalisation tests on
364+
`WorkflowCommandNormalizer` and the `DuplicateStartPolicy` enum
365+
cases.
366+
- Message cursor monotonicity is covered by the
367+
`MessageCursorAdvanced` sequencing tests.
368+
- This document is pinned by
369+
`tests/Unit/V2/ExecutionGuaranteesDocumentationTest.php`. A future
370+
change that renames, removes, or narrows any named guarantee must
371+
update the test and this document in the same change so the
372+
contract does not drift silently.
373+
374+
## Changing this contract
375+
376+
A change to any named guarantee (at-least-once, replay, durable
377+
exactly-once, dedupe key surface, retry identity) is a protocol-level
378+
change for the purposes of `docs/api-stability.md` and downstream
379+
SDKs. Reviewers should treat unmotivated changes to the language above
380+
as breaking changes and require explicit cross-SDK coordination
381+
before merge.

0 commit comments

Comments
 (0)