You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Surface run wait age on operator metrics and durable_resume_paths health
Adds runs.waiting, runs.oldest_wait_started_at, and runs.max_wait_age_ms
to OperatorMetrics::snapshot() so operators can answer "how long has the
worst-case run been waiting at a durable resume point?" from the metric
alone, mirroring the existing tasks.oldest_lease_expired_at,
tasks.oldest_ready_due_at, and backlog.oldest_compatibility_blocked_started_at
shapes. The signal counts every kind of wait — signal, update, timer, and
compatibility-blocked — because each is a durable resume point the system
is parked on; consumers that want to isolate the non-compatibility share
can subtract runs.compatibility_blocked.
Forwards the same trio on HealthCheck::durableResumePathCheck() data as
waiting_runs / oldest_wait_started_at / max_wait_age_ms so the wait age
is legible from /healthz without re-reading the metrics snapshot. The
check's escalation predicate stays unchanged (it still escalates only on
repair_needed_runs); the wait-age data is observability, since a
long-parked wait is not by itself a stuck condition the system can repair
without application-level action.
Pins the new keys in docs/architecture/rollout-safety.md frozen metric
table and adds a wait-row regression assertion in
RolloutSafetyDocumentationTest. Extends V2OperatorMetricsTest with two
focused tests proving the metric counts non-compatibility waits and
returns null/0 when no runs are waiting, plus an existing-fixture
assertion in the broad metrics test. Extends HealthCheckTest's
testSnapshotWarnsWhenRunSummaryProjectionSchemaIsOutdated assertion to
pin the new health-check fields.
Copy file name to clipboardExpand all lines: docs/architecture/rollout-safety.md
+13Lines changed: 13 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -405,6 +405,7 @@ change.
405
405
|`runs`|`repair_needed`| open runs with `liveness_state = repair_needed`|
406
406
|`runs`|`claim_failed`| runs whose most recent task claim failed |
407
407
|`runs`|`compatibility_blocked`| runs blocked by compatibility mismatch |
408
+
|`runs`|`waiting`, `oldest_wait_started_at`, `max_wait_age_ms`| running runs currently parked at a durable resume point (`status_bucket = 'running'` and `wait_started_at IS NOT NULL`), the earliest `wait_started_at` among them, and the largest wait age in milliseconds. Mirrors the `backlog.oldest_compatibility_blocked_started_at` / `max_compatibility_blocked_age_ms` and `tasks.oldest_lease_expired_at` / `max_lease_expired_age_ms` shapes so operators can answer "how long has the worst-case run been waiting at a signal, update, timer, or compatible-worker arrival?" from the metric alone. The signal is unconditional and includes compatibility-blocked waits; consumers that want the non-compatibility share can subtract `runs.compatibility_blocked` and `backlog.oldest_compatibility_blocked_started_at`. |
408
409
|`tasks`|`ready`, `ready_due`, `delayed`, `leased`| queue depth by phase |
409
410
|`tasks`|`dispatch_failed`, `claim_failed`| transport failure counts |
410
411
|`tasks`|`dispatch_overdue`, `lease_expired`| lease and dispatch timing |
@@ -494,6 +495,18 @@ are authoritative and how they surface.
494
495
-**Stale projection.** A projection behind the authoritative
495
496
history surfaces through the `run_summary_projection` and
496
497
`selected_run_projections` checks on `HealthCheck::snapshot()`.
498
+
-**Long-parked wait.** A running run whose projector has recorded
499
+
a `wait_started_at` is counted under `runs.waiting`, and its
500
+
worst-case wait age is surfaced through `runs.oldest_wait_started_at`
501
+
and `runs.max_wait_age_ms`, both forwarded on the
502
+
`durable_resume_paths` health check (`waiting_runs`,
503
+
`oldest_wait_started_at`, `max_wait_age_ms`). The signal includes
504
+
every kind of wait — signal, update, timer, and compatibility-blocked
505
+
wait — because each is a durable resume point the system is
506
+
parked on. The check itself escalates only on `repair_needed_runs`;
507
+
the wait-age data is observability so operators can decide whether
508
+
the worst-case wait reflects healthy long-running work or a lost
'Rollout safety contract must pin the runs wait-age row so operators can read "how long has the worst-case run been waiting at a durable resume point?" from OperatorMetrics::snapshot() without scanning workflow_run_summaries.',
0 commit comments