You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: 2.0/llms-full.txt
+57-2Lines changed: 57 additions & 2 deletions
Original file line number
Diff line number
Diff line change
@@ -9436,6 +9436,33 @@ region-pinned behavior in the runbook for the topology you operate. The product
9436
9436
contract tells you which facts to measure; your deployment contract records the
9437
9437
recovery timing, manual steps, and failure domains you accept.
9438
9438
9439
+
### Verify live topology identity before trusting the baseline
9440
+
9441
+
For standalone-server and split-role deployments, confirm the node identity
9442
+
that the product itself reports before you interpret queue, scheduler, or role
9443
+
failure signals. `GET /api/cluster/info` is the source of truth for that
9444
+
identity:
9445
+
9446
+
| Field | Use it for |
9447
+
| --- | --- |
9448
+
| `topology.current_shape` | Confirms whether the node is currently advertising `embedded`, `standalone_server`, or `split_control_execution`. |
9449
+
| `topology.current_process_class` | Confirms which node class the process believes it is serving, such as `server_http_node`, `scheduler_node`, `worker_node`, `matching_node`, or `execution_node`. |
9450
+
| `topology.current_roles` | Confirms the logical roles actually hosted by this node. |
9451
+
| `topology.role_catalog` | Confirms whether the queried node owns `api_ingress`, `control_plane`, `matching`, `history_projection`, `scheduler`, or `execution_plane`. |
9452
+
9453
+
Use those fields as the first topology-drift check during rollouts:
9454
+
9455
+
- In the self-serve standalone-server shape, API nodes should continue to
9456
+
report `server_http_node` with `api_ingress`, `control_plane`, `matching`,
9457
+
and `history_projection`; scheduler nodes should report `scheduler_node`;
9458
+
worker nodes should report `worker_node` and `execution_plane`.
9459
+
- In the split-role shape, verify that each dedicated process class reports the
9460
+
role it is supposed to own before you interpret backlog or scheduler lag as a
9461
+
worker problem.
9462
+
- If `current_process_class` or `current_roles` drift from the deployment plan,
9463
+
treat queue and failover baselines as suspect until the node identity is
9464
+
corrected.
9465
+
9439
9466
## Blocking and advisory diagnostics
9440
9467
9441
9468
Durable Workflow v2 separates blocking diagnostics from advisory diagnostics.
@@ -9544,6 +9571,32 @@ Use `operator_metrics.starts.*` when new workflow starts appear stuck even
9544
9571
though steady-state queue lag looks normal. Those facts separate control-plane
9545
9572
start admission and first-task creation debt from downstream worker pickup.
9546
9573
9574
+
### Poller pressure and admission budgets
9575
+
9576
+
Use task-queue detail routes or `dw task-queue:describe` when queue flow is
9577
+
degrading and you need to separate "not enough worker capacity" from
9578
+
"intentional server throttling" or "no live poller at all":
9579
+
9580
+
| Queue status | Meaning | Treat it as |
9581
+
| --- | --- | --- |
9582
+
| `accepting` | Workers still have available slots and no server cap is full. | Healthy baseline. |
9583
+
| `saturated` | All registered worker slots are currently leased. | Worker-capacity pressure. |
9584
+
| `throttled` | A server-side active-lease or dispatch-rate cap is intentionally holding the queue back. | Advisory unless the cap is unexpected or the backlog keeps growing beyond the published baseline. |
9585
+
| `no_slots` | Active workers are registered, but none advertise slots for that task kind. | Blocking for that queue. |
9586
+
| `no_active_workers` | No healthy poller is currently serving the queue. | Blocking for that queue. |
9587
+
| `unavailable` | The queue cannot acquire the lock needed for its configured admission path. | Blocking until the admission dependency recovers. |
9588
+
9589
+
Use these statuses with the queue-flow facts together:
9590
+
9591
+
- `tasks_added_last_minute > tasks_dispatched_last_minute` plus `saturated`
9592
+
means durable inflow is outrunning worker capacity.
9593
+
- The same rate imbalance plus `throttled` means the queue is being held back
9594
+
by an explicit server cap and should be judged against that cap's intended
9595
+
contract, not against unrestricted throughput.
9596
+
- A rising oldest-ready age plus `no_active_workers` or stale pollers means the
9597
+
queue has lost healthy claimers and should be treated as a routing outage for
9598
+
that scope.
9599
+
9547
9600
### Matching-role deployment shape
9548
9601
9549
9602
Use `operator_metrics.matching_role.*` when you need to confirm which
@@ -9623,7 +9676,7 @@ window for the topology you operate.
9623
9676
| Blocking readiness | `workflow:v2:doctor --strict`, `GET /waterline/api/v2/health` | Blocking | `doctor --strict` returns an error or the health endpoint returns `status = error` / HTTP `503` | Stop rollout or traffic shift, fix the blocking prerequisite, then rerun readiness and compatibility checks. |
9624
9677
| Compatible-worker coverage | `operator_metrics.workers.*`, `worker_compatibility` health check, run diagnostic `no_compatible_worker_for_task` | Blocking | `active_workers_supporting_required = 0` for a namespace or required `(connection, queue)` scope | Drain incompatible workers, register compatible workers, and confirm the `correctness` rollup clears before trusting new claims. |
9625
9678
| Durable queue lag | Waterline queue views, `operator_metrics.backlog.*`, worker `schedule_to_start` telemetry | Blocking when sustained; advisory when brief | The oldest ready-task age or schedule-to-start latency stays above the published topology baseline while compatible workers are available | Add worker capacity, inspect task-queue admission limits, and verify the scheduler or matching path is still making forward progress. |
9626
-
| Worker-slot or poller pressure | Server task-queue visibility routes, `dw task-queue:describe`, worker registrations | Advisory, escalating to blocking if durable lag grows | A hot queue stays `saturated`, `no_slots`, `no_active_workers`, or `unavailable`, or keeps flipping between those states while queue age climbs | Distinguish intentional `throttled` backpressure from accidental starvation, then either add worker slots, restore healthy pollers, or repair the cache-lock path before scaling other components. |
9679
+
| Poller pressure and admission saturation | Task-queue detail routes, `dw task-queue:describe`, queue `status`, stale pollers, and queue-local add/dispatch rates | Blocking for `no_active_workers`, `no_slots`, or `unavailable`; advisory for intentional `throttled` states | One queue stays `saturated` while its oldest-ready age and add-vs-dispatch gap keep growing, or any queue flips to `no_active_workers`, `no_slots`, or `unavailable` outside a planned maintenance window | Add worker slots, restore the missing poller cohort, or confirm the server-side cap and lock dependency are behaving as designed before you scale blindly. |
9627
9680
| Workflow-start backlog | `operator_metrics.starts.*`, control-plane start telemetry, worker `schedule_to_start` telemetry for first workflow tasks | Blocking when sustained; advisory when brief | `pending_commands`, `ready_tasks`, or `max_pending_ms` stay above the published topology baseline while compatible workers and queue capacity are available | Inspect the start boundary end to end: confirm start commands are turning into durable tasks, verify matching or dispatch is creating the first task promptly, and separate start-path debt from general worker lag before scaling. |
9628
9681
| Projection drift and repair debt | `run_summary_projection` / `selected_run_projections` health checks, `operator_metrics.repair.*` | Advisory | Drift warnings persist past one planned rebuild window or the max candidate age keeps climbing | Run the rebuild or repair previews, execute the repair, then verify the warning clears and stale ages return to baseline. |
9629
9682
| Retry or failure storm | `operator_metrics.backlog.unhealthy_tasks`, durable run diagnostics, worker error telemetry | Advisory, escalating to blocking if it prevents durable progress | Dispatch-failed, claim-failed, expired-lease, or retry-exhaustion facts climb above the topology baseline and stay elevated | Inspect the failing task family, compare worker telemetry with durable error facts, and decide whether to drain traffic or isolate the affected queue. |
@@ -9754,7 +9807,7 @@ traffic depends on them:
9754
9807
| Dimension | What to baseline | Source |
9755
9808
| --- | --- | --- |
9756
9809
| Projection health | Steady-state `needs_rebuild = 0`, rebuild duration after intentional drift, and stale/orphan cleanup time | `/waterline/api/v2/health`, `/waterline/api/stats`, `workflow:v2:rebuild-projections` |
9757
-
| Queue pressure | Backlog age, oldest ready task age, runnable vs delayed task counts, task add vs dispatch rate, dispatch-overdue age, stale poller count | Waterline dashboard stats and queue views plus `operator_metrics.backlog.*` / `operator_metrics.tasks.*` |
9810
+
| Queue pressure | Backlog age, oldest ready task age, runnable vs delayed task counts, task add vs dispatch rate, dispatch-overdue age, stale poller count, and queue admission status (`accepting`, `saturated`, `throttled`, `no_slots`, `no_active_workers`) | Waterline dashboard stats and queue views plus `operator_metrics.backlog.*` / `operator_metrics.tasks.*` |
9758
9811
| Workflow-start latency | Accepted start commands waiting for first-task creation, oldest pending-start age, and first-task pickup after admission | `operator_metrics.starts.*` plus worker `schedule_to_start` telemetry |
9759
9812
| Schedule-to-start latency | Workflow and activity queue wait from enqueue to start | Worker SDK metrics |
9760
9813
| Timer fan-out wake-up behavior | Wake-signal propagation time and the lag between scheduled fire time and ready-task visibility during burst timers | Worker telemetry plus same-region wake coordination checks |
@@ -12182,6 +12235,8 @@ still honors the value.
12182
12235
| --- | --- | --- | --- |
12183
12236
| `DW_MODE` | `service` | Server mode: `service` makes external workers poll; `embedded` dispatches locally through the Laravel queue. | `WORKFLOW_SERVER_MODE` |
12184
12237
| `DW_SERVER_ID` | `gethostname()` | Unique server instance identifier used in lease ownership and worker registration. | `WORKFLOW_SERVER_ID` |
12238
+
| `DW_SERVER_TOPOLOGY_SHAPE` | `standalone_server` | Deployment shape advertised from `GET /api/cluster/info` under `topology.current_shape`. Use it to distinguish `embedded`, `standalone_server`, and `split_control_execution` nodes. | `WORKFLOW_SERVER_TOPOLOGY_SHAPE` |
12239
+
| `DW_SERVER_PROCESS_CLASS` | `server_http_node` | Process class advertised from `GET /api/cluster/info` under `topology.current_process_class`. Use it to label API, scheduler, matching, and execution nodes correctly during split-role rollouts. | `WORKFLOW_SERVER_PROCESS_CLASS` |
12185
12240
| `DW_SERVER_KEY` | generated at container boot | Optional server-internal runtime key. Docker images generate one automatically when unset. | - |
12186
12241
| `DW_DEFAULT_NAMESPACE` | `default` | Namespace used when a request omits the namespace header. | `WORKFLOW_SERVER_DEFAULT_NAMESPACE` |
12187
12242
| `DW_TASK_DISPATCH_MODE` | unset | Overrides `workflows.v2.task_dispatch_mode`; in service mode the server defaults to `poll` unless you set a different value. | `WORKFLOW_V2_TASK_DISPATCH_MODE` |
0 commit comments