You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Surface dispatch-overdue wake-latency age on operator metrics and task_transport health
Adds tasks.oldest_dispatch_overdue_since and tasks.max_dispatch_overdue_age_ms
to OperatorMetrics::snapshot() so operators can answer "how long has the
oldest ready-but-unclaimed task been waiting for a working dispatch wake?"
from the metric alone, completing the coordination-health age set alongside
tasks.oldest_lease_expired_at, tasks.oldest_ready_due_at, runs.oldest_wait_started_at,
and backlog.oldest_compatibility_blocked_started_at.
The age is the effective COALESCE(last_dispatched_at, created_at) — the
timestamp the task has been waiting for a successful dispatch since (either
the last attempted dispatch that didn't stick, or the task's creation time
if it was never dispatched), computed across the dispatch_overdue subset
only. Refactors the existing filter into a private dispatchOverdueQuery()
helper so the count and the age use one authoritative query.
Forwards the same trio on HealthCheck::taskTransportCheck() data as
dispatch_overdue_tasks / oldest_dispatch_overdue_since /
max_dispatch_overdue_age_ms so wake-latency is legible from /healthz without
re-reading the metrics snapshot. The check's escalation predicate stays
unchanged (it still escalates only on tasks.unhealthy, which already counts
dispatch_overdue); the age data is observability so operators can tell
"dispatch wake is sporadically slow" apart from "dispatch wake has stalled
on this task for minutes".
Pins the new keys in docs/architecture/rollout-safety.md frozen metric
table and "Ready but unclaimed" stuck-detector entry, adds a
dispatch-overdue-age row assertion in RolloutSafetyDocumentationTest, and
extends the task_transport HealthCheckTest assertions to pin the new fields
in the health check data. Extends V2OperatorMetricsTest with two focused
tests proving the metric selects the earliest effective dispatch moment
across mixed dispatched/never-dispatched/failed/healthy task fixtures and
returns null/0 when no tasks are overdue, plus broad-snapshot and
task_transport forward assertions on the existing mixed-state fixture.
Copy file name to clipboardExpand all lines: docs/architecture/rollout-safety.md
+11-2Lines changed: 11 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -411,6 +411,7 @@ change.
411
411
|`tasks`|`dispatch_overdue`, `lease_expired`| lease and dispatch timing |
412
412
|`tasks`|`oldest_lease_expired_at`, `max_lease_expired_age_ms`| earliest `lease_expires_at` among leased tasks whose lease has expired at snapshot time and the largest expired-lease age in milliseconds, mirroring the `backlog.oldest_compatibility_blocked_started_at` / `max_compatibility_blocked_age_ms` shape so operators can answer "how long has the worst leased task been expired without redelivery?" (the primary stuck-lease duplicate-risk age indicator) from the metric alone |
413
413
|`tasks`|`oldest_ready_due_at`, `max_ready_due_age_ms`| earliest "ready since" timestamp among ready-due tasks (the effective `COALESCE(available_at, created_at)` — `available_at` when the task was delayed, otherwise the creation time that made it immediately actionable) and the largest ready-age in milliseconds, mirroring the `oldest_lease_expired_at` / `max_lease_expired_age_ms` shape so operators can read queue latency ("how long has the oldest actionable task been waiting to dispatch?") from the metric alone without walking `workflow_tasks`|
414
+
|`tasks`|`oldest_dispatch_overdue_since`, `max_dispatch_overdue_age_ms`| earliest `COALESCE(last_dispatched_at, created_at)` among dispatch-overdue tasks — the timestamp the worst-case ready-but-unclaimed task has been waiting for a successful dispatch wake since (either its last attempted dispatch that didn't stick or its creation time if it was never dispatched) — and the largest age in milliseconds, mirroring the `oldest_ready_due_at` / `max_ready_due_age_ms` shape so operators can read wake-latency ("how long has the oldest ready-but-unclaimed task been waiting for a working dispatch wake?") from the metric alone without walking `workflow_tasks`|
414
415
|`tasks`|`unhealthy`| sum of transport failure and lease expiry counts (the primary duplicate-risk indicator) |
'Rollout safety contract must pin the tasks dispatch-overdue age row so operators can read wake-latency ("how long has the oldest ready-but-unclaimed task been waiting for a working dispatch wake?") from OperatorMetrics::snapshot() without walking workflow_tasks.',
0 commit comments