You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
scrape surface. Use worker SDK metrics for runtime latency and custom
8258
8267
application telemetry.
8259
8268
8269
+
The [Operator Operating Envelope](./operator-operating-envelope.md) defines how
8270
+
to interpret those diagnostics during rollouts and incident response. In
8271
+
particular:
8272
+
8273
+
| Field family | Meaning |
8274
+
| --- | --- |
8275
+
| `operator_metrics.backlog.*` | Durable runnable, delayed, leased, unhealthy, repair-needed, claim-failed, and compatibility-blocked work counts. |
8276
+
| `operator_metrics.repair.*` | Repair-loop sweep footprint, including selected candidates, candidate age, and scan pressure. |
8277
+
| `operator_metrics.projections.*` | Projection-drift counts for run summaries, waits, timelines, timers, and lineage. |
8278
+
| `operator_metrics.command_contracts.*` | Legacy WorkflowStarted contract snapshots that still need backfill. |
8279
+
| `operator_metrics.history.*` | History-size and event-count pressure plus continue-as-new recommendations. |
8280
+
| `engine_source`, `readiness_contract` | Whether Waterline is actively using the v2 operator bridge and which readiness contract governs that state. |
8281
+
8282
+
`GET /waterline/api/v2/health` uses the same distinction: `error` is blocking,
8283
+
`warning` is advisory, and `ok` means the current v2 operator bridge is ready.
8284
+
8260
8285
## List Views
8261
8286
8262
8287
List views are bucketed by durable status:
@@ -8462,6 +8487,229 @@ not parse toast text or button labels as the contract.
8462
8487
- [Cancel and Terminate](./features/cancel-and-terminate.md)
This guide defines the operator-facing contract for Durable Workflow v2.
8495
+
Use it to decide which diagnostics block rollouts, which ones are advisory,
8496
+
which queue facts belong to Waterline versus worker telemetry, how to verify
8497
+
rebuild and export workflows, and which deployment shapes are part of the
8498
+
documented operating envelope.
8499
+
8500
+
## Source-of-truth surfaces
8501
+
8502
+
Use these surfaces together:
8503
+
8504
+
| Surface | Use it for | Contract class |
8505
+
| --- | --- | --- |
8506
+
| `php artisan workflow:v2:doctor --strict` | Backend capability gating before v2 traffic or upgrades | Blocking |
8507
+
| `GET /waterline/api/v2/health` | Current engine-source readiness plus blocking vs advisory v2 health checks | Blocking when `status = error`, advisory when `status = warning` |
The durable-state operator contract lives in Waterline and the workflow package.
8516
+
Worker telemetry remains the source of truth for latency and process-level
8517
+
behavior inside your workers.
8518
+
8519
+
## Supported topologies
8520
+
8521
+
Durable Workflow v2 supports these operator shapes:
8522
+
8523
+
| Topology | Supported operator contract |
8524
+
| --- | --- |
8525
+
| Embedded Laravel, single node | Waterline, control-plane routes, health, rebuild, export, and archive all run from one app process against one durable database and one cache store. |
8526
+
| Embedded Laravel, small same-region cluster | Use one shared database, one shared cache backend for wake-signal coordination, identical workflow compatibility/config across nodes, and keep active nodes in the same datacenter or region so queue wake-up and timer wake-up latency stay bounded. |
8527
+
| Standalone server distribution | Use the [Self-Hosting Deployments](./deployment.md) guide for the server-specific deployment matrix, then apply the same health, stats, export, archive, and queue-health distinctions described here. |
8528
+
8529
+
Publish the restore order, backup cadence, expected failover lag, and any
8530
+
region-pinned behavior in the runbook for the topology you operate. The product
8531
+
contract tells you which facts to measure; your deployment contract records the
8532
+
recovery timing and failure domains you accept.
8533
+
8534
+
## Blocking and advisory diagnostics
8535
+
8536
+
Durable Workflow v2 separates blocking diagnostics from advisory diagnostics.
8537
+
8538
+
| Severity | Meaning | Typical operator action |
8539
+
| --- | --- | --- |
8540
+
| Blocking | The current configuration or readiness state is not safe to trust for v2 traffic | Stop rollout, fix the prerequisite, rerun verification |
8541
+
| Advisory | The surface remains readable, but some derived facts need rebuild, backfill, or manual review before you rely on them | Keep serving traffic when appropriate, then repair the named surface |
8542
+
| Healthy | No current issue was found in that surface | Continue normal operation |
8543
+
8544
+
Apply that rule to the shipped surfaces:
8545
+
8546
+
- `workflow:v2:doctor --strict` blocks when backend capability issues have
8547
+
`error` severity. Examples include an unsupported queue driver in queue mode
8548
+
or a cache store without locks. Informational queue diagnostics in poll mode
8549
+
remain advisory.
8550
+
- `GET /waterline/api/v2/health` returns:
8551
+
- `status = ok` when the v2 operator surface is ready and the current checks
8552
+
are aligned.
8553
+
- `status = warning` when the surface remains readable but specific facts
8554
+
need rebuild, backfill, or repair before you trust them fully.
8555
+
- `status = error` with HTTP `503` when the engine-source bridge is not ready
8556
+
or a blocking capability problem makes the v2 surface unavailable.
8557
+
- `GET /waterline/api/stats` publishes durable operator facts. Treat those JSON
8558
+
fields as operator diagnostics for dashboards and scripts, not as a metrics
8559
+
scrape endpoint.
8560
+
8561
+
## Queue-health semantics
8562
+
8563
+
Queue health is split between durable queue state and worker/runtime telemetry.
8564
+
8565
+
### Durable queue facts
8566
+
8567
+
Use Waterline dashboard stats and queue views for durable task state:
8568
+
8569
+
| Fact | Meaning |
8570
+
| --- | --- |
8571
+
| `operator_metrics.backlog.runnable_tasks` | Durable tasks that are ready to be claimed now. |
8572
+
| `operator_metrics.backlog.delayed_tasks` | Durable tasks that exist but are still waiting for `available_at`. |
8573
+
| `operator_metrics.backlog.leased_tasks` | Durable tasks currently claimed by a worker. |
8574
+
| `operator_metrics.backlog.unhealthy_tasks` | Durable tasks with dispatch failure, claim failure, overdue dispatch, or expired lease state. |
8575
+
| `operator_metrics.backlog.repair_needed_runs` | Open runs that do not currently have a trusted durable resume path. |
8576
+
| Queue backlog age / oldest ready task | The durable ready-to-dispatch lag for the oldest ready work. |
8577
+
| Active vs stale pollers | Whether registered workers are still heartbeating for a queue. |
8578
+
| Current leases | Which workflow or activity tasks are leased right now and whether the lease is expired. |
8579
+
8580
+
These facts describe durable workflow-task and activity-task traffic only.
8581
+
8582
+
### Worker and SDK telemetry
8583
+
8584
+
Use worker metrics, traces, and logs for:
8585
+
8586
+
- Workflow and activity `schedule_to_start` latency
8587
+
- Poll success rate and sync/eager-dispatch behavior
8588
+
- Sticky-cache size and eviction behavior
8589
+
- Worker CPU, memory, thread, and event-loop pressure
8590
+
- Custom application metrics emitted from activities or worker code
8591
+
8592
+
Synchronous queries, live-debug tooling, and other non-durable control-plane
8593
+
calls should be labeled separately in your dashboards. They do not count as
8594
+
durable task backlog and they do not change Waterline repair counters.
8595
+
8596
+
## Rebuild, repair, and restore expectations
8597
+
8598
+
Use these checks in order when the operator surface reports drift:
8599
+
8600
+
1. Check `GET /waterline/api/v2/health`.
8601
+
- `run_summary_projection` and `selected_run_projections` warnings mean
8602
+
Waterline can still answer, but some list or detail facts need rebuild.
8603
+
- `command_contract_snapshots` warnings mean some legacy runs still need
8604
+
WorkflowStarted contract backfill before operators can trust declared
8605
+
signal, update, or query forms.
8606
+
- `durable_resume_paths` warnings mean open runs need repair before you rely
2. Verify the bundle includes the expected run id, schema version, and any
8660
+
configured redaction metadata.
8661
+
3. Archive the closed run only after the export artifact is stored where your
8662
+
runbook expects it.
8663
+
4. Keep archived-but-not-pruned runs available for incident review.
8664
+
5. Prune durable rows through your retention job, then rebuild/prune projections
8665
+
with `workflow:v2:rebuild-projections --prune-stale`.
8666
+
8667
+
For Waterline users, the matching history-export and archive routes are listed
8668
+
in the [Waterline Operator API Reference](./waterline-operator-api.md).
8669
+
8670
+
## Benchmark envelope
8671
+
8672
+
Durable Workflow v2 publishes the dimensions you should benchmark for your own
8673
+
environment. Record these baselines in staging or canary before production
8674
+
traffic depends on them:
8675
+
8676
+
| Dimension | What to baseline | Source |
8677
+
| --- | --- | --- |
8678
+
| Projection health | Steady-state `needs_rebuild = 0`, rebuild duration after intentional drift, and stale/orphan cleanup time | `/waterline/api/v2/health`, `/waterline/api/stats`, `workflow:v2:rebuild-projections` |
8679
+
| Queue pressure | Backlog age, oldest ready task age, runnable vs delayed task counts, stale poller count | Waterline dashboard stats and queue views |
8680
+
| Schedule-to-start latency | Workflow and activity queue wait from enqueue to start | Worker SDK metrics |
8681
+
| Timer fan-out wake-up behavior | Wake-signal propagation time and the lag between scheduled fire time and ready-task visibility during burst timers | Worker telemetry plus same-region wake coordination checks |
8682
+
| Repair-loop sweep cost | Candidate counts, selected counts, max candidate age, max missing-run age, and scan-pressure behavior | `operator_metrics.repair.*` |
8683
+
| History pressure | Event count, history size, and continue-as-new recommendation thresholds | `operator_metrics.history.*` |
8684
+
8685
+
These are benchmark dimensions rather than universal latency promises. Publish
8686
+
your own acceptable ranges for the topology you operate.
8687
+
8688
+
## End-to-end operator checklist
8689
+
8690
+
Use this checklist after upgrades and before trusting a new environment:
8691
+
8692
+
1. Run `php artisan workflow:v2:doctor --strict`.
8693
+
2. Check `GET /waterline/api/v2/health` and confirm whether the state is
8694
+
`ok`, `warning`, or `error`.
8695
+
3. Read `GET /waterline/api/stats` for backlog, repair, history, command
8696
+
contract, worker compatibility, and projection drift facts.
8697
+
4. Run projection rebuild or command-contract backfill previews when health
8698
+
reports drift.
8699
+
5. Export one representative run and verify the archive/replay artifact path.
8700
+
6. Confirm archived runs leave active fleet views while durable rows remain
8701
+
available until retention cleanup.
8702
+
7. Rehearse the restore or failover sequence recorded in your deployment
8703
+
runbook and verify the measured lag matches the published expectation for
8704
+
your topology.
8705
+
8706
+
## Related Guides
8707
+
8708
+
- [Monitoring](./monitoring.md)
8709
+
- [Waterline Operator API Reference](./waterline-operator-api.md)
0 commit comments