Skip to content

Commit d3b0858

Browse files
deploy: b978f75
1 parent b038837 commit d3b0858

273 files changed

Lines changed: 1610 additions & 1035 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

2.0/llms-full.txt

Lines changed: 248 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7368,6 +7368,12 @@ The dashboard shows running totals, recent-run counters, and fleet-wide
73687368
metrics so you can tell at a glance whether work is flowing, stalling, or
73697369
failing.
73707370

7371+
Use the [Operator Operating Envelope](./operator-operating-envelope.md) when
7372+
you need the rollout and runbook contract for those facts: which diagnostics
7373+
block traffic, which are advisory, how queue-health facts split between
7374+
Waterline and worker telemetry, and how to verify rebuild, export, and archive
7375+
paths.
7376+
73717377
### Workflow View
73727378

73737379
The workflow detail view shows the durable timeline for a single run: the
@@ -7438,6 +7444,9 @@ not valid for the run's current state.
74387444

74397445
## Related Guides
74407446

7447+
- [Operator Operating Envelope](./operator-operating-envelope.md) ties health,
7448+
queue state, rebuild, export, archive, and topology expectations into one
7449+
operator contract.
74417450
- [Failures and Recovery](./failures-and-recovery.md) explains retry exhaustion,
74427451
non-retryable failures, timeouts, and repair behavior behind the dashboard
74437452
facts.
@@ -8257,6 +8266,22 @@ curl -sS "$APP_URL/waterline/api/instances/order-1001" \
82578266
scrape surface. Use worker SDK metrics for runtime latency and custom
82588267
application telemetry.
82598268

8269+
The [Operator Operating Envelope](./operator-operating-envelope.md) defines how
8270+
to interpret those diagnostics during rollouts and incident response. In
8271+
particular:
8272+
8273+
| Field family | Meaning |
8274+
| --- | --- |
8275+
| `operator_metrics.backlog.*` | Durable runnable, delayed, leased, unhealthy, repair-needed, claim-failed, and compatibility-blocked work counts. |
8276+
| `operator_metrics.repair.*` | Repair-loop sweep footprint, including selected candidates, candidate age, and scan pressure. |
8277+
| `operator_metrics.projections.*` | Projection-drift counts for run summaries, waits, timelines, timers, and lineage. |
8278+
| `operator_metrics.command_contracts.*` | Legacy WorkflowStarted contract snapshots that still need backfill. |
8279+
| `operator_metrics.history.*` | History-size and event-count pressure plus continue-as-new recommendations. |
8280+
| `engine_source`, `readiness_contract` | Whether Waterline is actively using the v2 operator bridge and which readiness contract governs that state. |
8281+
8282+
`GET /waterline/api/v2/health` uses the same distinction: `error` is blocking,
8283+
`warning` is advisory, and `ok` means the current v2 operator bridge is ready.
8284+
82608285
## List Views
82618286

82628287
List views are bucketed by durable status:
@@ -8462,6 +8487,229 @@ not parse toast text or button labels as the contract.
84628487
- [Cancel and Terminate](./features/cancel-and-terminate.md)
84638488
- [Agent Tooling Contract](./agent-tooling-contract.md)
84648489

8490+
<!-- Source: docs/operator-operating-envelope.md -->
8491+
8492+
# Operator Operating Envelope
8493+
8494+
This guide defines the operator-facing contract for Durable Workflow v2.
8495+
Use it to decide which diagnostics block rollouts, which ones are advisory,
8496+
which queue facts belong to Waterline versus worker telemetry, how to verify
8497+
rebuild and export workflows, and which deployment shapes are part of the
8498+
documented operating envelope.
8499+
8500+
## Source-of-truth surfaces
8501+
8502+
Use these surfaces together:
8503+
8504+
| Surface | Use it for | Contract class |
8505+
| --- | --- | --- |
8506+
| `php artisan workflow:v2:doctor --strict` | Backend capability gating before v2 traffic or upgrades | Blocking |
8507+
| `GET /waterline/api/v2/health` | Current engine-source readiness plus blocking vs advisory v2 health checks | Blocking when `status = error`, advisory when `status = warning` |
8508+
| `GET /waterline/api/stats` | Durable fleet totals, backlog counters, repair-loop facts, projection drift counts, worker compatibility summaries | Advisory and benchmarking |
8509+
| `php artisan workflow:v2:rebuild-projections ...` | Previewing and repairing projection drift | Maintenance |
8510+
| `php artisan workflow:v2:backfill-command-contracts ...` | Previewing and backfilling legacy command-contract snapshots | Maintenance |
8511+
| `php artisan workflow:v2:history-export ...` and Waterline history-export routes | Replay, archive handoff, and incident artifacts | Verification |
8512+
| Waterline archive actions and control-plane `archive()` | Lifecycle state transitions for closed runs | Lifecycle |
8513+
| Worker SDK metrics, traces, and logs | Schedule-to-start latency, poll success, sticky-cache behavior, and custom application telemetry | Runtime telemetry |
8514+
8515+
The durable-state operator contract lives in Waterline and the workflow package.
8516+
Worker telemetry remains the source of truth for latency and process-level
8517+
behavior inside your workers.
8518+
8519+
## Supported topologies
8520+
8521+
Durable Workflow v2 supports these operator shapes:
8522+
8523+
| Topology | Supported operator contract |
8524+
| --- | --- |
8525+
| Embedded Laravel, single node | Waterline, control-plane routes, health, rebuild, export, and archive all run from one app process against one durable database and one cache store. |
8526+
| Embedded Laravel, small same-region cluster | Use one shared database, one shared cache backend for wake-signal coordination, identical workflow compatibility/config across nodes, and keep active nodes in the same datacenter or region so queue wake-up and timer wake-up latency stay bounded. |
8527+
| Standalone server distribution | Use the [Self-Hosting Deployments](./deployment.md) guide for the server-specific deployment matrix, then apply the same health, stats, export, archive, and queue-health distinctions described here. |
8528+
8529+
Publish the restore order, backup cadence, expected failover lag, and any
8530+
region-pinned behavior in the runbook for the topology you operate. The product
8531+
contract tells you which facts to measure; your deployment contract records the
8532+
recovery timing and failure domains you accept.
8533+
8534+
## Blocking and advisory diagnostics
8535+
8536+
Durable Workflow v2 separates blocking diagnostics from advisory diagnostics.
8537+
8538+
| Severity | Meaning | Typical operator action |
8539+
| --- | --- | --- |
8540+
| Blocking | The current configuration or readiness state is not safe to trust for v2 traffic | Stop rollout, fix the prerequisite, rerun verification |
8541+
| Advisory | The surface remains readable, but some derived facts need rebuild, backfill, or manual review before you rely on them | Keep serving traffic when appropriate, then repair the named surface |
8542+
| Healthy | No current issue was found in that surface | Continue normal operation |
8543+
8544+
Apply that rule to the shipped surfaces:
8545+
8546+
- `workflow:v2:doctor --strict` blocks when backend capability issues have
8547+
`error` severity. Examples include an unsupported queue driver in queue mode
8548+
or a cache store without locks. Informational queue diagnostics in poll mode
8549+
remain advisory.
8550+
- `GET /waterline/api/v2/health` returns:
8551+
- `status = ok` when the v2 operator surface is ready and the current checks
8552+
are aligned.
8553+
- `status = warning` when the surface remains readable but specific facts
8554+
need rebuild, backfill, or repair before you trust them fully.
8555+
- `status = error` with HTTP `503` when the engine-source bridge is not ready
8556+
or a blocking capability problem makes the v2 surface unavailable.
8557+
- `GET /waterline/api/stats` publishes durable operator facts. Treat those JSON
8558+
fields as operator diagnostics for dashboards and scripts, not as a metrics
8559+
scrape endpoint.
8560+
8561+
## Queue-health semantics
8562+
8563+
Queue health is split between durable queue state and worker/runtime telemetry.
8564+
8565+
### Durable queue facts
8566+
8567+
Use Waterline dashboard stats and queue views for durable task state:
8568+
8569+
| Fact | Meaning |
8570+
| --- | --- |
8571+
| `operator_metrics.backlog.runnable_tasks` | Durable tasks that are ready to be claimed now. |
8572+
| `operator_metrics.backlog.delayed_tasks` | Durable tasks that exist but are still waiting for `available_at`. |
8573+
| `operator_metrics.backlog.leased_tasks` | Durable tasks currently claimed by a worker. |
8574+
| `operator_metrics.backlog.unhealthy_tasks` | Durable tasks with dispatch failure, claim failure, overdue dispatch, or expired lease state. |
8575+
| `operator_metrics.backlog.repair_needed_runs` | Open runs that do not currently have a trusted durable resume path. |
8576+
| Queue backlog age / oldest ready task | The durable ready-to-dispatch lag for the oldest ready work. |
8577+
| Active vs stale pollers | Whether registered workers are still heartbeating for a queue. |
8578+
| Current leases | Which workflow or activity tasks are leased right now and whether the lease is expired. |
8579+
8580+
These facts describe durable workflow-task and activity-task traffic only.
8581+
8582+
### Worker and SDK telemetry
8583+
8584+
Use worker metrics, traces, and logs for:
8585+
8586+
- Workflow and activity `schedule_to_start` latency
8587+
- Poll success rate and sync/eager-dispatch behavior
8588+
- Sticky-cache size and eviction behavior
8589+
- Worker CPU, memory, thread, and event-loop pressure
8590+
- Custom application metrics emitted from activities or worker code
8591+
8592+
Synchronous queries, live-debug tooling, and other non-durable control-plane
8593+
calls should be labeled separately in your dashboards. They do not count as
8594+
durable task backlog and they do not change Waterline repair counters.
8595+
8596+
## Rebuild, repair, and restore expectations
8597+
8598+
Use these checks in order when the operator surface reports drift:
8599+
8600+
1. Check `GET /waterline/api/v2/health`.
8601+
- `run_summary_projection` and `selected_run_projections` warnings mean
8602+
Waterline can still answer, but some list or detail facts need rebuild.
8603+
- `command_contract_snapshots` warnings mean some legacy runs still need
8604+
WorkflowStarted contract backfill before operators can trust declared
8605+
signal, update, or query forms.
8606+
- `durable_resume_paths` warnings mean open runs need repair before you rely
8607+
on their projected next resume source.
8608+
2. Preview projection work with:
8609+
8610+
```bash
8611+
php artisan workflow:v2:rebuild-projections --needs-rebuild --dry-run
8612+
```
8613+
8614+
3. Rebuild the affected projections:
8615+
8616+
```bash
8617+
php artisan workflow:v2:rebuild-projections --needs-rebuild
8618+
```
8619+
8620+
4. Preview command-contract backfill work with:
8621+
8622+
```bash
8623+
php artisan workflow:v2:backfill-command-contracts --dry-run
8624+
```
8625+
8626+
5. Backfill command contracts when the current workflow class is still
8627+
available:
8628+
8629+
```bash
8630+
php artisan workflow:v2:backfill-command-contracts
8631+
```
8632+
8633+
6. Use `--prune-stale` only after your retention workflow has intentionally
8634+
removed durable rows and you want to delete projection rows whose durable run
8635+
or history row no longer exists.
8636+
8637+
`operator_metrics.repair.*` publishes the repair-loop sweep footprint. Use the
8638+
candidate counts, selected counts, maximum candidate age, and scan-limit
8639+
pressure to decide whether repair work is comfortably within your baseline or
8640+
needs capacity investigation.
8641+
8642+
## Export and archive verification
8643+
8644+
History export and archive serve different purposes:
8645+
8646+
- **History export** creates a replay/debug/archive artifact.
8647+
- **Archive** marks a closed run as archived so it leaves active fleet views.
8648+
- **Prune** removes projection or durable rows after retention has definitely
8649+
expired.
8650+
8651+
Use this verification sequence:
8652+
8653+
1. Export the selected run:
8654+
8655+
```bash
8656+
php artisan workflow:v2:history-export <workflow-instance-id> --run-id=<workflow-run-id> --output=storage/app/workflow-history/run.json --pretty
8657+
```
8658+
8659+
2. Verify the bundle includes the expected run id, schema version, and any
8660+
configured redaction metadata.
8661+
3. Archive the closed run only after the export artifact is stored where your
8662+
runbook expects it.
8663+
4. Keep archived-but-not-pruned runs available for incident review.
8664+
5. Prune durable rows through your retention job, then rebuild/prune projections
8665+
with `workflow:v2:rebuild-projections --prune-stale`.
8666+
8667+
For Waterline users, the matching history-export and archive routes are listed
8668+
in the [Waterline Operator API Reference](./waterline-operator-api.md).
8669+
8670+
## Benchmark envelope
8671+
8672+
Durable Workflow v2 publishes the dimensions you should benchmark for your own
8673+
environment. Record these baselines in staging or canary before production
8674+
traffic depends on them:
8675+
8676+
| Dimension | What to baseline | Source |
8677+
| --- | --- | --- |
8678+
| Projection health | Steady-state `needs_rebuild = 0`, rebuild duration after intentional drift, and stale/orphan cleanup time | `/waterline/api/v2/health`, `/waterline/api/stats`, `workflow:v2:rebuild-projections` |
8679+
| Queue pressure | Backlog age, oldest ready task age, runnable vs delayed task counts, stale poller count | Waterline dashboard stats and queue views |
8680+
| Schedule-to-start latency | Workflow and activity queue wait from enqueue to start | Worker SDK metrics |
8681+
| Timer fan-out wake-up behavior | Wake-signal propagation time and the lag between scheduled fire time and ready-task visibility during burst timers | Worker telemetry plus same-region wake coordination checks |
8682+
| Repair-loop sweep cost | Candidate counts, selected counts, max candidate age, max missing-run age, and scan-pressure behavior | `operator_metrics.repair.*` |
8683+
| History pressure | Event count, history size, and continue-as-new recommendation thresholds | `operator_metrics.history.*` |
8684+
8685+
These are benchmark dimensions rather than universal latency promises. Publish
8686+
your own acceptable ranges for the topology you operate.
8687+
8688+
## End-to-end operator checklist
8689+
8690+
Use this checklist after upgrades and before trusting a new environment:
8691+
8692+
1. Run `php artisan workflow:v2:doctor --strict`.
8693+
2. Check `GET /waterline/api/v2/health` and confirm whether the state is
8694+
`ok`, `warning`, or `error`.
8695+
3. Read `GET /waterline/api/stats` for backlog, repair, history, command
8696+
contract, worker compatibility, and projection drift facts.
8697+
4. Run projection rebuild or command-contract backfill previews when health
8698+
reports drift.
8699+
5. Export one representative run and verify the archive/replay artifact path.
8700+
6. Confirm archived runs leave active fleet views while durable rows remain
8701+
available until retention cleanup.
8702+
7. Rehearse the restore or failover sequence recorded in your deployment
8703+
runbook and verify the measured lag matches the published expectation for
8704+
your topology.
8705+
8706+
## Related Guides
8707+
8708+
- [Monitoring](./monitoring.md)
8709+
- [Waterline Operator API Reference](./waterline-operator-api.md)
8710+
- [Pruning Workflows](./configuration/pruning-workflows.md)
8711+
- [Self-Hosting Deployments](./deployment.md)
8712+
84658713
<!-- Source: docs/polyglot/embedded-to-server.md -->
84668714

84678715
# Embedded to Server Migration

0 commit comments

Comments
 (0)