Skip to content

Commit 402470f

Browse files
Surface repair-needed age on operator metrics
Adds `runs.oldest_repair_needed_at` and `runs.max_repair_needed_age_ms` to `OperatorMetrics::snapshot()` so operators can read "how long has the worst-case run been stuck without progress?" — the canonical stuck-workflow duplicate-risk age indicator paired with the `durable_resume_paths` health check — from the metric alone without walking `workflow_run_summaries`. The summary's `updated_at` is sourced by `RunSummaryProjector` from `WorkflowRun::last_progress_at`, so it advances when the run made forward progress and stalls when the run stopped progressing. For runs already pinned at `liveness_state = repair_needed` it is therefore the closest available proxy for "when this run last made progress before being marked broken." Falls back to the run's `started_at` when the projection has not recorded a progress boundary (a fresh run that was projected straight into `repair_needed` without a prior progress write). The `repair_needed` predicate matches `liveness_state = 'repair_needed'` exactly, mirroring the existing count under `runs.repair_needed`. It deliberately excludes the routing-blocked variant `workflow_task_waiting_for_compatible_worker` (a wait state, not a broken state); compatibility-blocked age is already surfaced under `backlog.oldest_compatibility_blocked_started_at` / `max_compatibility_blocked_age_ms`. The new signal pairs with that routing-block age the same way `tasks.oldest_unhealthy_at` / `max_unhealthy_age_ms` rolls up the four contributing per-path ages. Pinned in `docs/architecture/rollout-safety.md` Frozen metric keys table and asserted by `RolloutSafetyDocumentationTest` (frozen-keys list plus a dedicated row regex). Covered end-to-end by two new `V2OperatorMetricsTest` cases that verify the predicate matches liveness_state exactly (compatibility-blocked rows do not contribute their older `updated_at`), that the oldest stuck-since timestamp wins across multiple repair_needed runs, and that the keys read as `null` / 0 when no run is in repair_needed.
1 parent b43d992 commit 402470f

4 files changed

Lines changed: 155 additions & 0 deletions

File tree

docs/architecture/rollout-safety.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -403,6 +403,7 @@ change.
403403
| Section | Key | Meaning |
404404
| ------- | --- | ------- |
405405
| `runs` | `repair_needed` | open runs with `liveness_state = repair_needed` |
406+
| `runs` | `oldest_repair_needed_at`, `max_repair_needed_age_ms` | earliest "stuck since" timestamp across runs whose `liveness_state` is exactly `repair_needed` and the largest stuck age in milliseconds, mirroring the `backlog.oldest_compatibility_blocked_started_at` / `max_compatibility_blocked_age_ms` shape on the routing path so operators can read "how long has the worst-case run been stuck without progress?" — the canonical stuck-workflow duplicate-risk age indicator paired with the `durable_resume_paths` health check — from the metric alone without walking `workflow_run_summaries`. The summary's `updated_at` is sourced by `RunSummaryProjector` from `WorkflowRun::last_progress_at`, so it advances with forward progress and stalls when the run stops progressing; for runs already at `repair_needed` it is the closest available proxy for "when this run last made progress before being marked broken." Falls back to the run's `started_at` when the projection has not recorded a progress boundary |
406407
| `runs` | `claim_failed` | runs whose most recent task claim failed |
407408
| `runs` | `compatibility_blocked` | runs blocked by compatibility mismatch |
408409
| `runs` | `waiting`, `oldest_wait_started_at`, `max_wait_age_ms` | running runs currently parked at a durable resume point (`status_bucket = 'running'` and `wait_started_at IS NOT NULL`), the earliest `wait_started_at` among them, and the largest wait age in milliseconds. Mirrors the `backlog.oldest_compatibility_blocked_started_at` / `max_compatibility_blocked_age_ms` and `tasks.oldest_lease_expired_at` / `max_lease_expired_age_ms` shapes so operators can answer "how long has the worst-case run been waiting at a signal, update, timer, or compatible-worker arrival?" from the metric alone. The signal is unconditional and includes compatibility-blocked waits; consumers that want the non-compatibility share can subtract `runs.compatibility_blocked` and `backlog.oldest_compatibility_blocked_started_at`. |

src/V2/Support/OperatorMetrics.php

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -93,6 +93,7 @@ private static function matchingRoleSnapshot(): array
9393
private static function runMetrics(CarbonInterface $now, ?string $namespace): array
9494
{
9595
$oldestWaitStartedAt = self::oldestRunWaitStartedAt($namespace);
96+
$oldestRepairNeededAt = self::oldestRepairNeededRunAt($namespace);
9697

9798
return [
9899
'total' => self::summaryQuery($namespace)->count(),
@@ -104,6 +105,10 @@ private static function runMetrics(CarbonInterface $now, ?string $namespace): ar
104105
'terminated' => self::summaryQuery($namespace)->where('status', RunStatus::Terminated->value)->count(),
105106
'archived' => self::summaryQuery($namespace)->whereNotNull('archived_at')->count(),
106107
'repair_needed' => self::summaryQuery($namespace)->where('liveness_state', 'repair_needed')->count(),
108+
'oldest_repair_needed_at' => $oldestRepairNeededAt?->toJSON(),
109+
'max_repair_needed_age_ms' => $oldestRepairNeededAt === null
110+
? 0
111+
: (int) $oldestRepairNeededAt->diffInMilliseconds($now),
107112
'claim_failed' => self::claimFailedRuns($namespace),
108113
'compatibility_blocked' => self::compatibilityBlockedRuns($namespace),
109114
'waiting' => self::waitingRuns($namespace),
@@ -810,6 +815,39 @@ private static function claimFailedRuns(?string $namespace): int
810815
->count();
811816
}
812817

818+
/**
819+
* Earliest "stuck since" timestamp across runs whose liveness_state is
820+
* exactly `repair_needed`. Rollout-safety surfaces this alongside
821+
* `runs.repair_needed` so operators can answer "how long has the
822+
* worst-case run been stuck without progress?" from the metric alone,
823+
* the canonical stuck-workflow age indicator paired with the
824+
* `durable_resume_paths` health check.
825+
*
826+
* The summary's `updated_at` is sourced by `RunSummaryProjector` from
827+
* `WorkflowRun::last_progress_at`, so it advances when the run made
828+
* forward progress and stalls when the run stopped progressing. For
829+
* runs already pinned at `repair_needed` it is therefore the closest
830+
* available proxy for "when this run last made progress before being
831+
* marked broken." The summary `updated_at` is preferred; the run's
832+
* `started_at` is the fallback when the projection did not record a
833+
* progress boundary (a fresh run that was projected straight into
834+
* `repair_needed` without a prior progress write).
835+
*/
836+
private static function oldestRepairNeededRunAt(?string $namespace): ?CarbonInterface
837+
{
838+
/** @var WorkflowRunSummary|null $summary */
839+
$summary = self::summaryQuery($namespace)
840+
->where('liveness_state', 'repair_needed')
841+
->orderByRaw('COALESCE(updated_at, started_at) asc')
842+
->first();
843+
844+
if (! $summary instanceof WorkflowRunSummary) {
845+
return null;
846+
}
847+
848+
return $summary->updated_at ?? $summary->started_at;
849+
}
850+
813851
/**
814852
* Open runs that are currently parked at a wait point — running runs whose
815853
* `RunSummaryProjector` has recorded a `wait_started_at` because they are

tests/Feature/V2/V2OperatorMetricsTest.php

Lines changed: 103 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1590,6 +1590,109 @@ public function testSnapshotReportsRunWaitAgeAsZeroWhenNoRunsAreWaiting(): void
15901590
$this->assertSame(0, $snapshot['runs']['max_wait_age_ms']);
15911591
}
15921592

1593+
public function testSnapshotSurfacesRepairNeededAgeFromOldestRepairNeededRun(): void
1594+
{
1595+
Carbon::setTestNow('2026-04-09 12:00:00');
1596+
$this->beforeApplicationDestroyed(static function (): void {
1597+
Carbon::setTestNow();
1598+
});
1599+
1600+
$now = Carbon::now();
1601+
1602+
// Worst-case: repair_needed run whose summary updated_at (sourced from
1603+
// last_progress_at) was 90s ago — the oldest stuck-since timestamp wins.
1604+
$stuckLongest = $this->createRunWithSummary(
1605+
instanceId: 'repair-needed-age-instance-stuck',
1606+
runId: '01JREPAIRSTKRUN0000000001',
1607+
status: 'running',
1608+
statusBucket: 'running',
1609+
livenessState: 'repair_needed',
1610+
);
1611+
WorkflowRunSummary::query()
1612+
->whereKey($stuckLongest->id)
1613+
->update([
1614+
'updated_at' => $now->copy()
1615+
->subSeconds(90),
1616+
]);
1617+
1618+
// Newer repair_needed run — must be counted, but its 30s-ago updated_at
1619+
// must not win "oldest repair-needed since".
1620+
$stuckRecent = $this->createRunWithSummary(
1621+
instanceId: 'repair-needed-age-instance-recent',
1622+
runId: '01JREPAIRRECRUN0000000002',
1623+
status: 'running',
1624+
statusBucket: 'running',
1625+
livenessState: 'repair_needed',
1626+
);
1627+
WorkflowRunSummary::query()
1628+
->whereKey($stuckRecent->id)
1629+
->update([
1630+
'updated_at' => $now->copy()
1631+
->subSeconds(30),
1632+
]);
1633+
1634+
// Compatibility-blocked run — has a repair_needed-shaped liveness state
1635+
// (`workflow_task_waiting_for_compatible_worker`) but liveness_state is
1636+
// NOT exactly `repair_needed`, so it must not be counted under
1637+
// runs.repair_needed even though its updated_at would win the oldest age.
1638+
$compatibilityBlocked = $this->createRunWithSummary(
1639+
instanceId: 'repair-needed-age-instance-compat',
1640+
runId: '01JREPAIRCMPRUN0000000003',
1641+
status: 'running',
1642+
statusBucket: 'running',
1643+
livenessState: 'workflow_task_waiting_for_compatible_worker',
1644+
);
1645+
WorkflowRunSummary::query()
1646+
->whereKey($compatibilityBlocked->id)
1647+
->update([
1648+
'updated_at' => $now->copy()
1649+
->subHours(2),
1650+
]);
1651+
1652+
// Healthy running run — must not be counted regardless of updated_at.
1653+
$this->createRunWithSummary(
1654+
instanceId: 'repair-needed-age-instance-running',
1655+
runId: '01JREPAIRRUNRUN0000000004',
1656+
status: 'running',
1657+
statusBucket: 'running',
1658+
livenessState: 'running',
1659+
);
1660+
1661+
$snapshot = OperatorMetrics::snapshot($now);
1662+
1663+
$expectedOldestRepairNeededAt = $now->copy()
1664+
->subSeconds(90)
1665+
->toJSON();
1666+
1667+
$this->assertSame(2, $snapshot['runs']['repair_needed']);
1668+
$this->assertSame($expectedOldestRepairNeededAt, $snapshot['runs']['oldest_repair_needed_at']);
1669+
$this->assertSame(90 * 1000, $snapshot['runs']['max_repair_needed_age_ms']);
1670+
}
1671+
1672+
public function testSnapshotReportsRepairNeededAgeAsZeroWhenNoRunsAreRepairNeeded(): void
1673+
{
1674+
Carbon::setTestNow('2026-04-09 12:00:00');
1675+
$this->beforeApplicationDestroyed(static function (): void {
1676+
Carbon::setTestNow();
1677+
});
1678+
1679+
$now = Carbon::now();
1680+
1681+
$this->createRunWithSummary(
1682+
instanceId: 'repair-needed-none-instance',
1683+
runId: '01JREPAIRNONRUN0000000001',
1684+
status: 'running',
1685+
statusBucket: 'running',
1686+
livenessState: 'running',
1687+
);
1688+
1689+
$snapshot = OperatorMetrics::snapshot($now);
1690+
1691+
$this->assertSame(0, $snapshot['runs']['repair_needed']);
1692+
$this->assertNull($snapshot['runs']['oldest_repair_needed_at']);
1693+
$this->assertSame(0, $snapshot['runs']['max_repair_needed_age_ms']);
1694+
}
1695+
15931696
public function testSnapshotSurfacesRetryingActivityAgeFromOldestRetryingActivity(): void
15941697
{
15951698
Carbon::setTestNow('2026-04-09 12:00:00');

tests/Unit/V2/RolloutSafetyDocumentationTest.php

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -140,6 +140,8 @@ final class RolloutSafetyDocumentationTest extends TestCase
140140
'unhealthy',
141141
'oldest_unhealthy_at',
142142
'max_unhealthy_age_ms',
143+
'oldest_repair_needed_at',
144+
'max_repair_needed_age_ms',
143145
'runnable_tasks',
144146
'delayed_tasks',
145147
'leased_tasks',
@@ -460,6 +462,17 @@ public function testContractDocumentFreezesUnhealthyAgeRollupRow(): void
460462
);
461463
}
462464

465+
public function testContractDocumentFreezesRepairNeededRunAgeRow(): void
466+
{
467+
$contents = $this->documentContents();
468+
469+
$this->assertMatchesRegularExpression(
470+
'/\|\s*`runs`\s*\|[^|]*`oldest_repair_needed_at`[^|]*`max_repair_needed_age_ms`/',
471+
$contents,
472+
'Rollout safety contract must pin the runs repair-needed age row so operators can read "how long has the worst-case run been stuck without progress?" — the canonical stuck-workflow duplicate-risk age indicator paired with the durable_resume_paths health check — from OperatorMetrics::snapshot() without walking workflow_run_summaries.',
473+
);
474+
}
475+
463476
public function testContractDocumentFreezesHealthCheckNames(): void
464477
{
465478
$contents = $this->documentContents();

0 commit comments

Comments
 (0)