You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Surface claim-failed age on operator metrics and task_transport health
Freezes `operator_metrics.tasks.oldest_claim_failed_at` (ISO-8601 or
null) and `operator_metrics.tasks.max_claim_failed_age_ms` (integer
ms) on `OperatorMetrics::snapshot()`. The pair mirrors the existing
`oldest_dispatch_overdue_since` / `max_dispatch_overdue_age_ms` shape
for the dispatch path but on the claim path, so operators can read
"how long has the worst-case task been sitting with an uncleared
claim error?" — the primary lease-conflict and duplicate-risk age
indicator on the claim path — from the metric alone without walking
`workflow_tasks`. Forwards the pair plus `claim_failed_tasks` on
`HealthCheck::taskTransportCheck()` data.
Pins the row on `docs/architecture/rollout-safety.md` and adds a
stuck-detectors bullet for claim-failed, guarded by
`RolloutSafetyDocumentationTest::testContractDocumentFreezesClaimFailedAgeRow`.
Copy file name to clipboardExpand all lines: docs/architecture/rollout-safety.md
+13Lines changed: 13 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -412,6 +412,7 @@ change.
412
412
|`tasks`|`oldest_lease_expired_at`, `max_lease_expired_age_ms`| earliest `lease_expires_at` among leased tasks whose lease has expired at snapshot time and the largest expired-lease age in milliseconds, mirroring the `backlog.oldest_compatibility_blocked_started_at` / `max_compatibility_blocked_age_ms` shape so operators can answer "how long has the worst leased task been expired without redelivery?" (the primary stuck-lease duplicate-risk age indicator) from the metric alone |
413
413
|`tasks`|`oldest_ready_due_at`, `max_ready_due_age_ms`| earliest "ready since" timestamp among ready-due tasks (the effective `COALESCE(available_at, created_at)` — `available_at` when the task was delayed, otherwise the creation time that made it immediately actionable) and the largest ready-age in milliseconds, mirroring the `oldest_lease_expired_at` / `max_lease_expired_age_ms` shape so operators can read queue latency ("how long has the oldest actionable task been waiting to dispatch?") from the metric alone without walking `workflow_tasks`|
414
414
|`tasks`|`oldest_dispatch_overdue_since`, `max_dispatch_overdue_age_ms`| earliest `COALESCE(last_dispatched_at, created_at)` among dispatch-overdue tasks — the timestamp the worst-case ready-but-unclaimed task has been waiting for a successful dispatch wake since (either its last attempted dispatch that didn't stick or its creation time if it was never dispatched) — and the largest age in milliseconds, mirroring the `oldest_ready_due_at` / `max_ready_due_age_ms` shape so operators can read wake-latency ("how long has the oldest ready-but-unclaimed task been waiting for a working dispatch wake?") from the metric alone without walking `workflow_tasks`|
415
+
|`tasks`|`oldest_claim_failed_at`, `max_claim_failed_age_ms`| earliest `last_claim_failed_at` among claim-failed tasks (Ready tasks whose most recent claim attempt recorded an uncleared `last_claim_error`) and the largest claim-failed age in milliseconds, mirroring the `oldest_dispatch_overdue_since` / `max_dispatch_overdue_age_ms` shape for the dispatch path so operators can read "how long has the worst-case task been sitting with an uncleared claim error?" — the primary lease-conflict and duplicate-risk age indicator for the claim path — from the metric alone without walking `workflow_tasks`|
415
416
|`tasks`|`unhealthy`| sum of transport failure and lease expiry counts (the primary duplicate-risk indicator) |
'Rollout safety contract must pin the tasks claim-failed age row so operators can read "how long has the worst-case task been sitting with an uncleared claim error?" — a lease-conflict and duplicate-risk indicator on the claim path — from OperatorMetrics::snapshot() without walking workflow_tasks.',
0 commit comments