You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Surface dispatch-failed age on operator metrics and task_transport health
Freezes `operator_metrics.tasks.oldest_dispatch_failed_at` (ISO-8601
or null) and `operator_metrics.tasks.max_dispatch_failed_age_ms`
(integer ms) on `OperatorMetrics::snapshot()`. The pair mirrors the
existing `oldest_claim_failed_at` / `max_claim_failed_age_ms` shape
on the claim path but for the dispatch path, so operators can read
"how long has the worst-case task been sitting with an uncleared
dispatch error?" — the primary transport-failure age indicator on
the dispatch path — from the metric alone without walking
`workflow_tasks`. Forwards the pair plus `dispatch_failed_tasks` on
`HealthCheck::taskTransportCheck()` data so the dispatch-failed
shape sits next to claim-failed, dispatch-overdue, ready-due, and
lease-expired on the same task_transport check.
The dispatch-failed predicate matches `applyDispatchFailed()`
exactly: Ready tasks whose most recent `last_dispatch_attempt_at`
recorded a non-empty `last_dispatch_error` that has not been
superseded by a later successful `last_dispatched_at`. Tasks whose
dispatch error has been cleared, whose last dispatch attempt has
been superseded by a successful dispatch, or whose status has
moved past Ready (Leased, etc.) are excluded so the signal isolates
the active dispatch-failure cohort from healthy and progressing
work.
Pins the row on `docs/architecture/rollout-safety.md` and adds a
dispatch-failed-age bullet, guarded by
`RolloutSafetyDocumentationTest::testContractDocumentFreezesDispatchFailedAgeRow`.
Copy file name to clipboardExpand all lines: docs/architecture/rollout-safety.md
+1Lines changed: 1 addition & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -413,6 +413,7 @@ change.
413
413
|`tasks`|`oldest_ready_due_at`, `max_ready_due_age_ms`| earliest "ready since" timestamp among ready-due tasks (the effective `COALESCE(available_at, created_at)` — `available_at` when the task was delayed, otherwise the creation time that made it immediately actionable) and the largest ready-age in milliseconds, mirroring the `oldest_lease_expired_at` / `max_lease_expired_age_ms` shape so operators can read queue latency ("how long has the oldest actionable task been waiting to dispatch?") from the metric alone without walking `workflow_tasks`|
414
414
|`tasks`|`oldest_dispatch_overdue_since`, `max_dispatch_overdue_age_ms`| earliest `COALESCE(last_dispatched_at, created_at)` among dispatch-overdue tasks — the timestamp the worst-case ready-but-unclaimed task has been waiting for a successful dispatch wake since (either its last attempted dispatch that didn't stick or its creation time if it was never dispatched) — and the largest age in milliseconds, mirroring the `oldest_ready_due_at` / `max_ready_due_age_ms` shape so operators can read wake-latency ("how long has the oldest ready-but-unclaimed task been waiting for a working dispatch wake?") from the metric alone without walking `workflow_tasks`|
415
415
|`tasks`|`oldest_claim_failed_at`, `max_claim_failed_age_ms`| earliest `last_claim_failed_at` among claim-failed tasks (Ready tasks whose most recent claim attempt recorded an uncleared `last_claim_error`) and the largest claim-failed age in milliseconds, mirroring the `oldest_dispatch_overdue_since` / `max_dispatch_overdue_age_ms` shape for the dispatch path so operators can read "how long has the worst-case task been sitting with an uncleared claim error?" — the primary lease-conflict and duplicate-risk age indicator for the claim path — from the metric alone without walking `workflow_tasks`|
416
+
|`tasks`|`oldest_dispatch_failed_at`, `max_dispatch_failed_age_ms`| earliest `last_dispatch_attempt_at` among dispatch-failed tasks (Ready tasks whose most recent dispatch attempt recorded an uncleared `last_dispatch_error` that has not been superseded by a later successful dispatch) and the largest dispatch-failed age in milliseconds, mirroring the `oldest_claim_failed_at` / `max_claim_failed_age_ms` shape for the claim path so operators can read "how long has the worst-case task been sitting with an uncleared dispatch error?" — the primary transport-failure age indicator on the dispatch path — from the metric alone without walking `workflow_tasks`|
416
417
|`tasks`|`unhealthy`| sum of transport failure and lease expiry counts (the primary duplicate-risk indicator) |
417
418
|`activities`|`retrying`, `oldest_retrying_started_at`, `max_retrying_age_ms`| activity executions currently in the retry window (Pending status with `attempt_count > 0`), the earliest `started_at` among them, and the largest retrying age in milliseconds, mirroring the `tasks.oldest_lease_expired_at` / `max_lease_expired_age_ms` shape on the task path so operators can answer "how long has the worst-case activity been chewing retries?" — the primary retry-rate age indicator on the activity path — from the metric alone without walking `activity_executions`|
'Rollout safety contract must pin the tasks dispatch-failed age row so operators can read "how long has the worst-case task been sitting with an uncleared dispatch error?" — the primary transport-failure age indicator on the dispatch path — from OperatorMetrics::snapshot() without walking workflow_tasks.',
0 commit comments