Skip to content

Commit e77e885

Browse files
deploy: 6caf692
1 parent 6365b5f commit e77e885

273 files changed

Lines changed: 672 additions & 545 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

2.0/llms-full.txt

Lines changed: 57 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -9436,6 +9436,33 @@ region-pinned behavior in the runbook for the topology you operate. The product
94369436
contract tells you which facts to measure; your deployment contract records the
94379437
recovery timing, manual steps, and failure domains you accept.
94389438

9439+
### Verify live topology identity before trusting the baseline
9440+
9441+
For standalone-server and split-role deployments, confirm the node identity
9442+
that the product itself reports before you interpret queue, scheduler, or role
9443+
failure signals. `GET /api/cluster/info` is the source of truth for that
9444+
identity:
9445+
9446+
| Field | Use it for |
9447+
| --- | --- |
9448+
| `topology.current_shape` | Confirms whether the node is currently advertising `embedded`, `standalone_server`, or `split_control_execution`. |
9449+
| `topology.current_process_class` | Confirms which node class the process believes it is serving, such as `server_http_node`, `scheduler_node`, `worker_node`, `matching_node`, or `execution_node`. |
9450+
| `topology.current_roles` | Confirms the logical roles actually hosted by this node. |
9451+
| `topology.role_catalog` | Confirms whether the queried node owns `api_ingress`, `control_plane`, `matching`, `history_projection`, `scheduler`, or `execution_plane`. |
9452+
9453+
Use those fields as the first topology-drift check during rollouts:
9454+
9455+
- In the self-serve standalone-server shape, API nodes should continue to
9456+
report `server_http_node` with `api_ingress`, `control_plane`, `matching`,
9457+
and `history_projection`; scheduler nodes should report `scheduler_node`;
9458+
worker nodes should report `worker_node` and `execution_plane`.
9459+
- In the split-role shape, verify that each dedicated process class reports the
9460+
role it is supposed to own before you interpret backlog or scheduler lag as a
9461+
worker problem.
9462+
- If `current_process_class` or `current_roles` drift from the deployment plan,
9463+
treat queue and failover baselines as suspect until the node identity is
9464+
corrected.
9465+
94399466
## Blocking and advisory diagnostics
94409467

94419468
Durable Workflow v2 separates blocking diagnostics from advisory diagnostics.
@@ -9544,6 +9571,32 @@ Use `operator_metrics.starts.*` when new workflow starts appear stuck even
95449571
though steady-state queue lag looks normal. Those facts separate control-plane
95459572
start admission and first-task creation debt from downstream worker pickup.
95469573

9574+
### Poller pressure and admission budgets
9575+
9576+
Use task-queue detail routes or `dw task-queue:describe` when queue flow is
9577+
degrading and you need to separate "not enough worker capacity" from
9578+
"intentional server throttling" or "no live poller at all":
9579+
9580+
| Queue status | Meaning | Treat it as |
9581+
| --- | --- | --- |
9582+
| `accepting` | Workers still have available slots and no server cap is full. | Healthy baseline. |
9583+
| `saturated` | All registered worker slots are currently leased. | Worker-capacity pressure. |
9584+
| `throttled` | A server-side active-lease or dispatch-rate cap is intentionally holding the queue back. | Advisory unless the cap is unexpected or the backlog keeps growing beyond the published baseline. |
9585+
| `no_slots` | Active workers are registered, but none advertise slots for that task kind. | Blocking for that queue. |
9586+
| `no_active_workers` | No healthy poller is currently serving the queue. | Blocking for that queue. |
9587+
| `unavailable` | The queue cannot acquire the lock needed for its configured admission path. | Blocking until the admission dependency recovers. |
9588+
9589+
Use these statuses with the queue-flow facts together:
9590+
9591+
- `tasks_added_last_minute > tasks_dispatched_last_minute` plus `saturated`
9592+
means durable inflow is outrunning worker capacity.
9593+
- The same rate imbalance plus `throttled` means the queue is being held back
9594+
by an explicit server cap and should be judged against that cap's intended
9595+
contract, not against unrestricted throughput.
9596+
- A rising oldest-ready age plus `no_active_workers` or stale pollers means the
9597+
queue has lost healthy claimers and should be treated as a routing outage for
9598+
that scope.
9599+
95479600
### Matching-role deployment shape
95489601

95499602
Use `operator_metrics.matching_role.*` when you need to confirm which
@@ -9623,7 +9676,7 @@ window for the topology you operate.
96239676
| Blocking readiness | `workflow:v2:doctor --strict`, `GET /waterline/api/v2/health` | Blocking | `doctor --strict` returns an error or the health endpoint returns `status = error` / HTTP `503` | Stop rollout or traffic shift, fix the blocking prerequisite, then rerun readiness and compatibility checks. |
96249677
| Compatible-worker coverage | `operator_metrics.workers.*`, `worker_compatibility` health check, run diagnostic `no_compatible_worker_for_task` | Blocking | `active_workers_supporting_required = 0` for a namespace or required `(connection, queue)` scope | Drain incompatible workers, register compatible workers, and confirm the `correctness` rollup clears before trusting new claims. |
96259678
| Durable queue lag | Waterline queue views, `operator_metrics.backlog.*`, worker `schedule_to_start` telemetry | Blocking when sustained; advisory when brief | The oldest ready-task age or schedule-to-start latency stays above the published topology baseline while compatible workers are available | Add worker capacity, inspect task-queue admission limits, and verify the scheduler or matching path is still making forward progress. |
9626-
| Worker-slot or poller pressure | Server task-queue visibility routes, `dw task-queue:describe`, worker registrations | Advisory, escalating to blocking if durable lag grows | A hot queue stays `saturated`, `no_slots`, `no_active_workers`, or `unavailable`, or keeps flipping between those states while queue age climbs | Distinguish intentional `throttled` backpressure from accidental starvation, then either add worker slots, restore healthy pollers, or repair the cache-lock path before scaling other components. |
9679+
| Poller pressure and admission saturation | Task-queue detail routes, `dw task-queue:describe`, queue `status`, stale pollers, and queue-local add/dispatch rates | Blocking for `no_active_workers`, `no_slots`, or `unavailable`; advisory for intentional `throttled` states | One queue stays `saturated` while its oldest-ready age and add-vs-dispatch gap keep growing, or any queue flips to `no_active_workers`, `no_slots`, or `unavailable` outside a planned maintenance window | Add worker slots, restore the missing poller cohort, or confirm the server-side cap and lock dependency are behaving as designed before you scale blindly. |
96279680
| Workflow-start backlog | `operator_metrics.starts.*`, control-plane start telemetry, worker `schedule_to_start` telemetry for first workflow tasks | Blocking when sustained; advisory when brief | `pending_commands`, `ready_tasks`, or `max_pending_ms` stay above the published topology baseline while compatible workers and queue capacity are available | Inspect the start boundary end to end: confirm start commands are turning into durable tasks, verify matching or dispatch is creating the first task promptly, and separate start-path debt from general worker lag before scaling. |
96289681
| Projection drift and repair debt | `run_summary_projection` / `selected_run_projections` health checks, `operator_metrics.repair.*` | Advisory | Drift warnings persist past one planned rebuild window or the max candidate age keeps climbing | Run the rebuild or repair previews, execute the repair, then verify the warning clears and stale ages return to baseline. |
96299682
| Retry or failure storm | `operator_metrics.backlog.unhealthy_tasks`, durable run diagnostics, worker error telemetry | Advisory, escalating to blocking if it prevents durable progress | Dispatch-failed, claim-failed, expired-lease, or retry-exhaustion facts climb above the topology baseline and stay elevated | Inspect the failing task family, compare worker telemetry with durable error facts, and decide whether to drain traffic or isolate the affected queue. |
@@ -9754,7 +9807,7 @@ traffic depends on them:
97549807
| Dimension | What to baseline | Source |
97559808
| --- | --- | --- |
97569809
| Projection health | Steady-state `needs_rebuild = 0`, rebuild duration after intentional drift, and stale/orphan cleanup time | `/waterline/api/v2/health`, `/waterline/api/stats`, `workflow:v2:rebuild-projections` |
9757-
| Queue pressure | Backlog age, oldest ready task age, runnable vs delayed task counts, task add vs dispatch rate, dispatch-overdue age, stale poller count | Waterline dashboard stats and queue views plus `operator_metrics.backlog.*` / `operator_metrics.tasks.*` |
9810+
| Queue pressure | Backlog age, oldest ready task age, runnable vs delayed task counts, task add vs dispatch rate, dispatch-overdue age, stale poller count, and queue admission status (`accepting`, `saturated`, `throttled`, `no_slots`, `no_active_workers`) | Waterline dashboard stats and queue views plus `operator_metrics.backlog.*` / `operator_metrics.tasks.*` |
97589811
| Workflow-start latency | Accepted start commands waiting for first-task creation, oldest pending-start age, and first-task pickup after admission | `operator_metrics.starts.*` plus worker `schedule_to_start` telemetry |
97599812
| Schedule-to-start latency | Workflow and activity queue wait from enqueue to start | Worker SDK metrics |
97609813
| Timer fan-out wake-up behavior | Wake-signal propagation time and the lag between scheduled fire time and ready-task visibility during burst timers | Worker telemetry plus same-region wake coordination checks |
@@ -12182,6 +12235,8 @@ still honors the value.
1218212235
| --- | --- | --- | --- |
1218312236
| `DW_MODE` | `service` | Server mode: `service` makes external workers poll; `embedded` dispatches locally through the Laravel queue. | `WORKFLOW_SERVER_MODE` |
1218412237
| `DW_SERVER_ID` | `gethostname()` | Unique server instance identifier used in lease ownership and worker registration. | `WORKFLOW_SERVER_ID` |
12238+
| `DW_SERVER_TOPOLOGY_SHAPE` | `standalone_server` | Deployment shape advertised from `GET /api/cluster/info` under `topology.current_shape`. Use it to distinguish `embedded`, `standalone_server`, and `split_control_execution` nodes. | `WORKFLOW_SERVER_TOPOLOGY_SHAPE` |
12239+
| `DW_SERVER_PROCESS_CLASS` | `server_http_node` | Process class advertised from `GET /api/cluster/info` under `topology.current_process_class`. Use it to label API, scheduler, matching, and execution nodes correctly during split-role rollouts. | `WORKFLOW_SERVER_PROCESS_CLASS` |
1218512240
| `DW_SERVER_KEY` | generated at container boot | Optional server-internal runtime key. Docker images generate one automatically when unset. | - |
1218612241
| `DW_DEFAULT_NAMESPACE` | `default` | Namespace used when a request omits the namespace header. | `WORKFLOW_SERVER_DEFAULT_NAMESPACE` |
1218712242
| `DW_TASK_DISPATCH_MODE` | unset | Overrides `workflows.v2.task_dispatch_mode`; in service mode the server defaults to `poll` unless you set a different value. | `WORKFLOW_V2_TASK_DISPATCH_MODE` |

0 commit comments

Comments
 (0)