|
| 1 | +# Workflow V2 Execution Guarantees and Idempotency Contract |
| 2 | + |
| 3 | +This document freezes the v2 contract for what can execute more than once, |
| 4 | +what is replayed deterministically, what the durable state layer observes |
| 5 | +exactly once, and what application authors are required to make |
| 6 | +idempotent. It is the reference cited by the v2 docs, CLI reasoning, |
| 7 | +Waterline diagnostics, and test coverage so the whole fleet speaks one |
| 8 | +language about duplicate execution, retries, lease expiry, and |
| 9 | +redelivery. |
| 10 | + |
| 11 | +The guarantees below apply to the `durable-workflow/workflow` package at |
| 12 | +v2 and to every host that embeds it or talks to it over the worker |
| 13 | +protocol. A change to any named guarantee is a protocol-level change and |
| 14 | +must be reviewed as such, even if the class that implements it is |
| 15 | +`@internal`. |
| 16 | + |
| 17 | +## Scope |
| 18 | + |
| 19 | +The contract covers: |
| 20 | + |
| 21 | +- **workflow tasks** — units of replay-driven workflow execution claimed |
| 22 | + by a workflow worker and acknowledged via a decision batch. |
| 23 | +- **activity attempts** — external side-effecting work claimed by an |
| 24 | + activity worker and acknowledged via `complete` / `fail` / `heartbeat`. |
| 25 | +- **external commands** — start, signal, update, cancel, terminate, |
| 26 | + archive, query, and schedule commands received from outside the engine |
| 27 | + over the CLI, server HTTP surface, workflow client, or cloud API. |
| 28 | +- **durable messages** — the inbox/outbox primitives backing signals, |
| 29 | + updates, workflow-to-workflow messages, and human-input flows. |
| 30 | +- **side effects** — `sideEffect(...)`, `uuid4`, `uuid7`, and |
| 31 | + `record_version_marker` calls recorded inline in history. |
| 32 | +- **schedules** — the ScheduleTriggered events that launch scheduled |
| 33 | + workflow runs. |
| 34 | + |
| 35 | +It does not cover in-process Laravel queue semantics for the host |
| 36 | +application outside v2, application database transactions that the host |
| 37 | +runs independently of a workflow task, or the behaviour of third-party |
| 38 | +services that an activity calls. |
| 39 | + |
| 40 | +## Terminology |
| 41 | + |
| 42 | +- **At-least-once** means the framework may observe the same logical |
| 43 | + work more than once and the application author must tolerate it. |
| 44 | +- **At-most-once** is not promised for any side-effecting operation. |
| 45 | + Activities, signals, updates, starts, and schedule triggers are never |
| 46 | + at-most-once. |
| 47 | +- **Deterministic replay** means the engine re-runs the same authoring |
| 48 | + code to rebuild workflow state from history, producing the same |
| 49 | + decisions in the same order. Replay is not application re-execution |
| 50 | + and does not re-invoke external side effects. |
| 51 | +- **Exactly-once at the durable state layer** means that a specific |
| 52 | + typed history row for a given identifier is written at most once and |
| 53 | + is the authoritative record consumed by projections, replayers, |
| 54 | + exporters, and operator tools. |
| 55 | +- **Redelivery** means the transport requeued a task after lease expiry, |
| 56 | + worker crash, or repair, and a different worker may now claim the same |
| 57 | + task id or a newly repaired task for the same logical work. |
| 58 | +- **Replay** means the engine re-reads history events and re-invokes |
| 59 | + workflow authoring code to reconstruct the workflow state machine. It |
| 60 | + does not re-invoke activity code or re-emit external side effects. |
| 61 | + |
| 62 | +## Workflow task execution semantics |
| 63 | + |
| 64 | +Workflow tasks are event-sourced decisions produced by the workflow |
| 65 | +authoring layer and persisted as typed history events through the |
| 66 | +`WorkflowExecutor`. |
| 67 | + |
| 68 | +Guarantees: |
| 69 | + |
| 70 | +- A workflow task may be claimed, rejected, leased, expired, redelivered, |
| 71 | + repaired, and retried. Any individual workflow task id may be observed |
| 72 | + more than once by a worker. |
| 73 | +- The workflow authoring code inside a task is replay-driven. It may run |
| 74 | + many times across a run's lifetime. Authoring code must be |
| 75 | + deterministic with respect to history, must not depend on wall-clock |
| 76 | + time, must not perform IO, and must not mutate external state. See |
| 77 | + `constraints/workflow-constraints.md` for the full prohibited list. |
| 78 | +- A decision batch commits atomically per task. The typed history |
| 79 | + events named in `docs/api-stability.md` are the durable record of that |
| 80 | + decision. When the commit is observed, those history rows are |
| 81 | + exactly-once at the durable state layer for the decision ids they |
| 82 | + carry (`workflow_command_id`, `activity_execution_id`, |
| 83 | + `activity_attempt_id`, `timer_id`, `signal_id`, `update_id`, |
| 84 | + `condition_wait_id`, child `workflow_link_id`/`child_call_id`, etc.). |
| 85 | +- If a decision batch fails to commit, the workflow task is eligible |
| 86 | + for redelivery. A later successful commit produces exactly one typed |
| 87 | + history row per decision id; the duplicated attempt leaves no |
| 88 | + observable effect in history. |
| 89 | +- `RepairRequested` / repair redispatch is not a new task — it is the |
| 90 | + engine re-enqueueing work that a previous worker could not durably |
| 91 | + commit. Repair does not duplicate history. It routes to the same |
| 92 | + decision set and is covered by the same exactly-once-at-commit |
| 93 | + guarantee. |
| 94 | + |
| 95 | +What application authors must assume: |
| 96 | + |
| 97 | +- Anything the workflow body does outside durable primitives can run |
| 98 | + more than once and must be safe to re-run. Use `sideEffect(...)`, |
| 99 | + `uuid4`/`uuid7`, activity results, updates, queries, or search |
| 100 | + attributes/memos to cross the durable boundary. |
| 101 | +- Wall-clock reads, live cache reads, and mutable-shared-state reads |
| 102 | + inside the workflow body are prohibited for correctness, not style. |
| 103 | + |
| 104 | +## Activity attempt execution semantics |
| 105 | + |
| 106 | +Activities are the first-class vehicle for external side effects. They |
| 107 | +are explicitly at-least-once. |
| 108 | + |
| 109 | +Guarantees: |
| 110 | + |
| 111 | +- Each activity execution has a stable `activity_execution_id` that |
| 112 | + does not change across retries. It is the default idempotency surface |
| 113 | + exposed to the worker on claim (see |
| 114 | + `Workflow\V2\Contracts\ActivityTaskBridge::claim()` and the |
| 115 | + `idempotency_key` field it returns, which is set to the execution id). |
| 116 | +- Each activity attempt has a distinct `activity_attempt_id` and a |
| 117 | + sequential `attempt_number` starting at 1. Attempt ids are durable |
| 118 | + and are the correlation key for heartbeats, cancellation, and the |
| 119 | + typed attempt-scoped history events (`ActivityStarted`, |
| 120 | + `ActivityHeartbeatRecorded`, `ActivityRetryScheduled`, |
| 121 | + `ActivityCompleted`, `ActivityFailed`, `ActivityCancelled`, |
| 122 | + `ActivityTimedOut`). |
| 123 | +- An activity attempt can complete more than once from the worker's |
| 124 | + point of view: a worker may finish work, lose its lease to a |
| 125 | + redelivery after a heartbeat gap, and still attempt to report |
| 126 | + completion. `ActivityOutcomeRecorder` records at most one terminal |
| 127 | + typed attempt event per `activity_attempt_id`, and reports |
| 128 | + `recorded=false` with a reason code for the late caller. The caller |
| 129 | + MUST NOT treat `recorded=false` as failure — another worker has |
| 130 | + already recorded the outcome or the attempt was superseded. |
| 131 | +- The typed attempt-scoped history row for a given |
| 132 | + `activity_attempt_id` is exactly-once at the durable state layer. |
| 133 | + Retry scheduling emits a separate `ActivityRetryScheduled` for the |
| 134 | + new attempt and the next attempt carries the next |
| 135 | + `activity_attempt_id`. |
| 136 | +- Heartbeats mirror the latest progress onto the live activity |
| 137 | + execution and renew the task lease. They are not retry checkpoints; |
| 138 | + a heartbeat never splits one attempt into two. |
| 139 | +- Cancellation is observed by the activity via cooperative checks and |
| 140 | + does not guarantee termination of in-flight external work. Cancelled |
| 141 | + attempts may still produce external side effects up to the moment |
| 142 | + the worker honours the cancel. |
| 143 | + |
| 144 | +What application authors must assume: |
| 145 | + |
| 146 | +- The same activity execution may be observed more than once. Either |
| 147 | + the activity body or the external service it calls must be safe to |
| 148 | + repeat. The framework's default idempotency-key surface is |
| 149 | + `activity_execution_id` (same across retries) for remote services |
| 150 | + that accept an idempotency key per logical request; use |
| 151 | + `activity_attempt_id` only for systems that need to distinguish |
| 152 | + separate tries of the same logical activity. |
| 153 | +- Database writes that must be exactly-once should be wrapped with a |
| 154 | + dedupe key (typically `activity_execution_id`) or placed inside a |
| 155 | + transaction that is idempotent under retry. |
| 156 | + |
| 157 | +## Retry semantics |
| 158 | + |
| 159 | +The v2 engine recognises three distinct retry surfaces. Each has its |
| 160 | +own identifier and its own durable row. |
| 161 | + |
| 162 | +### Activity attempt retry |
| 163 | + |
| 164 | +- Governed by the activity's retry policy (`retry_policy` on the |
| 165 | + execution, with defaults from host configuration). |
| 166 | +- On failure, the engine emits `ActivityFailed` for the current |
| 167 | + attempt, then — if retries remain — `ActivityRetryScheduled` with |
| 168 | + the new `retry_task_id`, `retry_of_task_id`, and backoff. The next |
| 169 | + attempt runs as a fresh attempt with a new `activity_attempt_id`. |
| 170 | +- Non-retryable failures (`non_retryable=true` on `ActivityFailed`, |
| 171 | + structural-limit failures, or policy exhaustion) terminate the |
| 172 | + execution without scheduling a retry. |
| 173 | + |
| 174 | +### Workflow-task retry and repair |
| 175 | + |
| 176 | +- Workflow tasks themselves are not retried against application |
| 177 | + logic; they are replayed. If the worker that holds the task lease |
| 178 | + crashes or loses the lease, `TaskRepair` redispatches the same |
| 179 | + task to another worker. Each repair increments `repair_count` on |
| 180 | + the task and is surfaced through operator tooling, not through |
| 181 | + history, because no application-visible state has changed. |
| 182 | +- A workflow-task-level failure that wants to surface as application |
| 183 | + behaviour writes typed `WorkflowFailed` / `WorkflowCancelled` / |
| 184 | + `WorkflowTerminated` history through the normal decision path. The |
| 185 | + engine does not silently retry a workflow decision against the |
| 186 | + application code. |
| 187 | + |
| 188 | +### Child workflow retry |
| 189 | + |
| 190 | +- Child workflows follow the child retry policy and produce |
| 191 | + `ChildRunStarted` events with `retry_attempt` and |
| 192 | + `retry_of_child_workflow_run_id` set when a child is a retry. |
| 193 | +- Retries of the child workflow share the same |
| 194 | + `child_workflow_instance_id`, `workflow_link_id`, and |
| 195 | + `child_call_id` as the originally scheduled child; each retried run |
| 196 | + gets a new `child_workflow_run_id`. |
| 197 | + |
| 198 | +## Lease expiry and redelivery |
| 199 | + |
| 200 | +- Every claimed task (workflow task, activity task) carries a |
| 201 | + `lease_owner`, `lease_expires_at`, and, for activities, an |
| 202 | + `activity_attempt_id`. Once a lease expires, the task is eligible |
| 203 | + for redelivery. |
| 204 | +- Redelivery is an at-least-once event. The replacement claim may land |
| 205 | + on a different worker, may land after the original worker has |
| 206 | + already reported completion, and may land before the original worker |
| 207 | + has finished. The engine mediates race conditions through: |
| 208 | + - typed outcome recording guarded by attempt/sequence checks |
| 209 | + (`ActivityOutcomeRecorder` / `TaskRepair`), |
| 210 | + - command normalisation that idempotently rejects a second decision |
| 211 | + batch for an already-settled `workflow_command_id`, and |
| 212 | + - the `MessageCursorAdvanced` monotonic cursor for message |
| 213 | + consumption. |
| 214 | +- Operators observing two `ActivityStarted` events for the same |
| 215 | + `activity_execution_id` with different `activity_attempt_id` values |
| 216 | + should treat that as a normal redelivery, not a bug, as long as the |
| 217 | + engine eventually records exactly one terminal outcome for each |
| 218 | + `activity_attempt_id`. |
| 219 | + |
| 220 | +## External commands and duplicate-start policy |
| 221 | + |
| 222 | +- Start, signal, update, cancel, terminate, archive, and repair |
| 223 | + commands are recorded with a `workflow_command_id` that is the |
| 224 | + durable dedupe key for that command. External clients that can |
| 225 | + retry their HTTP request should send the same `workflow_command_id`; |
| 226 | + the engine rejects a second decision batch for the same id and |
| 227 | + preserves the original outcome. |
| 228 | +- Start commands additionally honour the duplicate-start policy named |
| 229 | + on `DuplicateStartPolicy`: |
| 230 | + - `reject_duplicate` — the second start with the same |
| 231 | + `workflow_instance_id` returns `CommandOutcome::RejectedDuplicate` |
| 232 | + and does not begin a second run. |
| 233 | + - `return_existing_active` — the second start returns the existing |
| 234 | + active run's identity instead of beginning a new run. |
| 235 | + - The public contract on `Workflow\V2\Contracts\WorkflowControlPlane` |
| 236 | + names these values explicitly; callers must pick one deliberately. |
| 237 | +- Signals and updates are idempotent at the durable state layer by |
| 238 | + `signal_id` / `update_id`. A `workflow_command_id` that already |
| 239 | + matches an applied signal/update is accepted as a no-op. |
| 240 | + |
| 241 | +## Durable message stream semantics |
| 242 | + |
| 243 | +- `MessageService::sendMessage()` creates paired outbound/inbound rows |
| 244 | + under one reserved instance sequence. An outbound row is durable and |
| 245 | + the matching inbound row carries the same logical identity across |
| 246 | + runs (including across continue-as-new; see |
| 247 | + `MessageService::transferMessagesToContinuedRun()`). |
| 248 | +- `peekMessages()` and `receiveMessages()` are non-mutating reads. |
| 249 | + Only `consumeMessage()`/`consumeMessages()` advance the durable |
| 250 | + cursor, and each `MessageCursorAdvanced` event names exactly one |
| 251 | + `stream_key`. Cursor advance is monotonic; a consumed message cannot |
| 252 | + be un-consumed. |
| 253 | +- External senders that may retry the same logical message should |
| 254 | + populate the optional `idempotencyKey` on `sendReference()` so the |
| 255 | + durable message stream can recognise and drop duplicates at the |
| 256 | + ingress layer. |
| 257 | + |
| 258 | +## Side effects and version markers |
| 259 | + |
| 260 | +- `sideEffect(...)` records the provided value into history as |
| 261 | + `SideEffectRecorded` exactly once per call site per sequence. Replay |
| 262 | + reads the recorded value; it does not re-invoke the side-effect |
| 263 | + callable. Authors must treat `sideEffect` as the one-shot durable |
| 264 | + snapshot primitive for non-deterministic values. |
| 265 | +- `uuid4`/`uuid7` are one-shot value-recording operations. Calling |
| 266 | + them produces a fresh id on first invocation and replays the same id |
| 267 | + afterwards. |
| 268 | +- `record_version_marker` is a frozen two-phase primitive. See |
| 269 | + `docs/api-stability.md` for its wire format and the PHP/Python |
| 270 | + parity contract. Adding, renaming, removing, or retyping a field is |
| 271 | + a protocol break, not a minor change. |
| 272 | + |
| 273 | +## Schedule triggers |
| 274 | + |
| 275 | +- `ScheduleTriggered` records an attempted trigger for an individual |
| 276 | + schedule occurrence with an `outcome` and |
| 277 | + `effective_overlap_policy`. Trigger records are exactly-once at the |
| 278 | + durable state layer for a given `schedule_id` and |
| 279 | + `occurrence_time`. |
| 280 | +- Overlap policy decides whether a triggered occurrence is skipped, |
| 281 | + queued, or allowed to run concurrently. The application author |
| 282 | + selects the policy deliberately — it is a declared choice, not a |
| 283 | + best-effort behaviour. |
| 284 | + |
| 285 | +## Framework-provided idempotency surfaces |
| 286 | + |
| 287 | +The framework exposes the following stable idempotency keys to |
| 288 | +application code and external workers: |
| 289 | + |
| 290 | +| surface | identifier | lifetime | |
| 291 | +| --- | --- | --- | |
| 292 | +| workflow instance | `workflow_instance_id` | stable across continue-as-new | |
| 293 | +| workflow run | `workflow_run_id` | bound to one execution generation | |
| 294 | +| activity execution | `activity_execution_id` | stable across activity retries | |
| 295 | +| activity attempt | `activity_attempt_id` | one attempt only | |
| 296 | +| external command | `workflow_command_id` | one command only | |
| 297 | +| message stream | `stream_key` + monotonic cursor position | durable | |
| 298 | +| message send | caller-provided `idempotencyKey` on `sendReference()` | caller-controlled | |
| 299 | +| side effect | `sequence` on `SideEffectRecorded` | frozen per call site | |
| 300 | +| version marker | `change_id` on `VersionMarkerRecorded` | frozen per change id | |
| 301 | +| schedule trigger | `schedule_id` + `occurrence_time` | durable per occurrence | |
| 302 | + |
| 303 | +Application authors should prefer `activity_execution_id` as the |
| 304 | +idempotency key against external services that accept one, because it |
| 305 | +is stable across retries. Use `activity_attempt_id` only when the |
| 306 | +external system must distinguish distinct tries of the same durable |
| 307 | +activity. |
| 308 | + |
| 309 | +## What developers must make idempotent |
| 310 | + |
| 311 | +- **Activity bodies** that write to a database, call an external API, |
| 312 | + publish a message, or mutate any state outside the workflow engine |
| 313 | + must be safe to repeat. Prefer conditional writes, upserts keyed by |
| 314 | + `activity_execution_id`, or external idempotency keys. |
| 315 | +- **Workflow bodies** must be deterministic under replay but are |
| 316 | + otherwise effect-free; the workflow body does not itself need to be |
| 317 | + idempotent because the engine does not re-invoke side effects for |
| 318 | + it. |
| 319 | +- **External command senders** that retry must send the same |
| 320 | + `workflow_command_id` across retries so the engine can recognise and |
| 321 | + dedupe. |
| 322 | +- **Durable-message senders** that may emit the same logical message |
| 323 | + twice must populate `idempotencyKey` on `sendReference()`. |
| 324 | +- **Signal and update handlers** should treat a re-delivery of the |
| 325 | + same command id as a no-op. |
| 326 | +- **Query handlers** see a non-durable read at query time. They must |
| 327 | + not produce external side effects; they are pure reads against the |
| 328 | + currently-resolved run state. |
| 329 | +- **Compensation handlers** registered via `addCompensation()` run as |
| 330 | + normal activities on failure/cancel and inherit the same activity |
| 331 | + at-least-once contract. |
| 332 | + |
| 333 | +## Operator and diagnostic guidance |
| 334 | + |
| 335 | +- Duplicate execution of an activity attempt or workflow task is a |
| 336 | + normal distributed-system event, not a bug condition. Product docs, |
| 337 | + CLI reasoning, and Waterline incident messaging should describe it |
| 338 | + as an expected outcome of retries, lease expiry, and redelivery, and |
| 339 | + steer the reader toward the appropriate idempotency surface. |
| 340 | +- A single `activity_execution_id` with multiple |
| 341 | + `activity_attempt_id` rows is by design. A single |
| 342 | + `activity_attempt_id` with multiple terminal events would be a |
| 343 | + bug — enforce that with the typed outcome recorders. |
| 344 | +- Repair requests (`RepairRequested`) and repair redispatches are |
| 345 | + engine-level recovery steps. They should not read as application |
| 346 | + failures in operator UIs; they are the mechanism that keeps the |
| 347 | + engine live when transport or workers fall behind. |
| 348 | +- Waterline selected-run detail rebuilds attempt status from typed |
| 349 | + history first (`ActivityStarted` / `ActivityHeartbeatRecorded` / |
| 350 | + `ActivityRetryScheduled` / `ActivityCompleted` / `ActivityFailed` / |
| 351 | + `ActivityCancelled`), with mutable attempt rows kept as |
| 352 | + enrichment. That layering is deliberate and must be preserved when |
| 353 | + adding new diagnostic surfaces. |
| 354 | + |
| 355 | +## Test strategy alignment |
| 356 | + |
| 357 | +- Replay correctness is covered by the PHP `WorkflowReplayer` tests |
| 358 | + and the cross-SDK fixtures referenced from `docs/api-stability.md`. |
| 359 | +- Activity at-least-once and outcome-exactly-once behaviour is |
| 360 | + covered by the `ActivityOutcomeRecorder` tests and the recorded |
| 361 | + reason codes (`recorded=false` with a reason is the normal |
| 362 | + redelivery path, not a test failure). |
| 363 | +- Command dedupe behaviour is covered by normalisation tests on |
| 364 | + `WorkflowCommandNormalizer` and the `DuplicateStartPolicy` enum |
| 365 | + cases. |
| 366 | +- Message cursor monotonicity is covered by the |
| 367 | + `MessageCursorAdvanced` sequencing tests. |
| 368 | +- This document is pinned by |
| 369 | + `tests/Unit/V2/ExecutionGuaranteesDocumentationTest.php`. A future |
| 370 | + change that renames, removes, or narrows any named guarantee must |
| 371 | + update the test and this document in the same change so the |
| 372 | + contract does not drift silently. |
| 373 | + |
| 374 | +## Changing this contract |
| 375 | + |
| 376 | +A change to any named guarantee (at-least-once, replay, durable |
| 377 | +exactly-once, dedupe key surface, retry identity) is a protocol-level |
| 378 | +change for the purposes of `docs/api-stability.md` and downstream |
| 379 | +SDKs. Reviewers should treat unmotivated changes to the language above |
| 380 | +as breaking changes and require explicit cross-SDK coordination |
| 381 | +before merge. |
0 commit comments