Skip to content

Commit 3711c3b

Browse files
Surface run-summary projection missing-run age on operator metrics
Freezes `projections.run_summaries.oldest_missing_run_started_at` (ISO-8601 or null) and `projections.run_summaries.max_missing_run_age_ms` (integer ms) on `OperatorMetrics::snapshot()`. The pair mirrors the existing `repair.oldest_missing_run_started_at` / `repair.max_missing_run_age_ms` shape so operators can answer "how long has the worst-case run been without a run-summary projection?" — the primary projection-lag age indicator on the run-summary path — from the metric alone without walking `workflow_runs`. The selector uses `COALESCE(started_at, created_at)` so runs that have not yet recorded a `started_at` still report the backlog age they contribute to the projection lag (previously the signal would have missed them entirely). The missing-projection set is defined by `RunSummaryProjectionDrift::missingRunQuery()`: runs whose id is not present in `workflow_run_summaries`. Runs with a projection and orphaned/stale summaries are out of scope for this indicator; they stay visible on the existing `missing`, `orphaned`, `stale`, and `needs_rebuild` counts. Pins the row on `docs/architecture/rollout-safety.md` and adds a projection-lag bullet, guarded by `RolloutSafetyDocumentationTest::testContractDocumentFreezesRunSummaryProjectionMissingRunAgeRow`.
1 parent c2bb956 commit 3711c3b

4 files changed

Lines changed: 179 additions & 2 deletions

File tree

docs/architecture/rollout-safety.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -420,6 +420,7 @@ change.
420420
| `backlog` | `unhealthy_tasks`, `repair_needed_runs`, `claim_failed_runs`, `compatibility_blocked_runs` | stuck/blocked roll-ups |
421421
| `backlog` | `oldest_compatibility_blocked_started_at`, `max_compatibility_blocked_age_ms` | earliest wait-start timestamp among compatibility-blocked runs and the largest blocked age in milliseconds, mirroring the `repair.oldest_missing_run_started_at` / `max_missing_run_age_ms` shape so operators can answer "how stale is the worst mixed-build block?" from the metric alone |
422422
| `repair` | `missing_task_candidates`, `selected_missing_task_candidates`, `oldest_missing_run_started_at`, `max_missing_run_age_ms` | stuck-run detectors per `TaskRepairCandidates` |
423+
| `projections.run_summaries` | `oldest_missing_run_started_at`, `max_missing_run_age_ms` | earliest `COALESCE(workflow_runs.started_at, workflow_runs.created_at)` among runs whose id is not present in `workflow_run_summaries` and the largest missing-projection age in milliseconds, mirroring the `repair.oldest_missing_run_started_at` / `max_missing_run_age_ms` shape so operators can read "how long has the worst-case run been without a run-summary projection?" — the primary projection-lag age indicator on the run-summary path — from the metric alone without walking `workflow_runs` |
423424
| `workers` | `required_compatibility`, `active_workers`, `active_worker_scopes`, `active_workers_supporting_required` | routing-health signals per `WorkerCompatibilityFleet` |
424425
| `workers` | `fleet` | per-scope fleet entries (`worker_id`, `namespace`, `connection`, `queue`, `supported`, `supports_required`, `recorded_at`, `expires_at`, `source`, `host`, `process_id`) so mixed-build state is legible to Waterline and other consumers without reinferring it from the summary counts |
425426
| `schedules` | `active`, `paused`, `missed`, `oldest_overdue_at`, `max_overdue_ms`, `fires_total`, `failures_total` | scheduler-role health: active and paused schedules in namespace, active schedules whose `next_fire_at` is overdue at snapshot time, the earliest overdue `next_fire_at` among them, the largest overdue age in milliseconds, and running totals of fires and failures so scheduler lag and failure trends are legible without reading `workflow_schedules` directly |

src/V2/Support/OperatorMetrics.php

Lines changed: 32 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -48,7 +48,7 @@ public static function snapshot(?CarbonInterface $now = null, ?string $namespace
4848
'starts' => self::startMetrics($now, $namespace),
4949
'history' => self::historyMetrics($namespace),
5050
'command_contracts' => self::commandContractMetrics($namespace),
51-
'projections' => self::projectionMetrics($namespace),
51+
'projections' => self::projectionMetrics($now, $namespace),
5252
'schedules' => self::scheduleMetrics($now, $namespace),
5353
'workers' => self::workerMetrics(),
5454
'backend' => BackendCapabilities::snapshot($now),
@@ -446,15 +446,20 @@ private static function commandContractMetrics(?string $namespace): array
446446
/**
447447
* @return array<string, array<string, int|string|null>>
448448
*/
449-
private static function projectionMetrics(?string $namespace): array
449+
private static function projectionMetrics(CarbonInterface $now, ?string $namespace): array
450450
{
451451
$runSummaries = RunSummaryProjectionDrift::metrics($namespace);
452+
$oldestMissingRunStartedAt = self::oldestMissingRunSummaryStartedAt($namespace);
452453

453454
return [
454455
'run_summaries' => [
455456
...$runSummaries,
456457
'oldest_updated_at' => self::jsonTimestamp(self::summaryQuery($namespace)->min('updated_at')),
457458
'newest_updated_at' => self::jsonTimestamp(self::summaryQuery($namespace)->max('updated_at')),
459+
'oldest_missing_run_started_at' => $oldestMissingRunStartedAt?->toJSON(),
460+
'max_missing_run_age_ms' => $oldestMissingRunStartedAt === null
461+
? 0
462+
: (int) $oldestMissingRunStartedAt->diffInMilliseconds($now),
458463
],
459464
'run_waits' => self::runWaitProjectionMetrics($namespace),
460465
'run_timeline_entries' => self::runTimelineProjectionMetrics($namespace),
@@ -463,6 +468,31 @@ private static function projectionMetrics(?string $namespace): array
463468
];
464469
}
465470

471+
/**
472+
* Earliest `COALESCE(workflow_runs.started_at, workflow_runs.created_at)`
473+
* among runs whose id is not present in `workflow_run_summaries`.
474+
* Mirrors the `repair.oldest_missing_run_started_at` shape so
475+
* rollout-safety consumers can read "how long has the worst-case run
476+
* been without a run-summary projection?" — the primary projection-lag
477+
* age indicator on the run-summary path — from the metric alone without
478+
* walking `workflow_runs`. Falls back to `created_at` when the run has
479+
* not yet recorded a `started_at` so not-yet-started runs still report
480+
* the backlog age they contribute to the projection lag.
481+
*/
482+
private static function oldestMissingRunSummaryStartedAt(?string $namespace): ?CarbonInterface
483+
{
484+
/** @var WorkflowRun|null $run */
485+
$run = RunSummaryProjectionDrift::missingRunQuery($namespace)
486+
->orderByRaw('COALESCE(started_at, created_at) asc')
487+
->first();
488+
489+
if (! $run instanceof WorkflowRun) {
490+
return null;
491+
}
492+
493+
return $run->started_at ?? $run->created_at;
494+
}
495+
466496
/**
467497
* @return array<string, int|string|null>
468498
*/

tests/Feature/V2/V2OperatorMetricsTest.php

Lines changed: 135 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1596,6 +1596,141 @@ public function testSnapshotReportsRetryingActivityAgeAsZeroWhenNoActivitiesAreR
15961596
$this->assertSame(0, $snapshot['activities']['max_retrying_age_ms']);
15971597
}
15981598

1599+
public function testSnapshotSurfacesMissingRunSummaryProjectionAgeFromOldestMissingRun(): void
1600+
{
1601+
Carbon::setTestNow('2026-04-09 12:00:00');
1602+
$this->beforeApplicationDestroyed(static function (): void {
1603+
Carbon::setTestNow();
1604+
});
1605+
1606+
$now = Carbon::now();
1607+
1608+
// Worst-case: run with a summary already exists (not counted). Its
1609+
// started_at must NOT win the "oldest missing run" selection even
1610+
// though it is the oldest started_at overall.
1611+
$this->createRunWithSummary(
1612+
instanceId: 'missing-summary-healthy-i',
1613+
runId: '01JMISSUMHEALTHYRUN000001',
1614+
status: 'running',
1615+
statusBucket: 'running',
1616+
livenessState: 'running',
1617+
);
1618+
1619+
// Missing-summary run A — started 180s ago, created 200s ago.
1620+
// Oldest started_at among missing runs — wins the selection.
1621+
$oldestMissingInstance = WorkflowInstance::query()->create([
1622+
'id' => 'missing-summary-oldest-i',
1623+
'workflow_class' => 'WorkflowClass',
1624+
'workflow_type' => 'workflow.test',
1625+
'run_count' => 1,
1626+
]);
1627+
1628+
WorkflowRun::query()->create([
1629+
'id' => '01JMISSUMOLDESTRUN000001',
1630+
'workflow_instance_id' => $oldestMissingInstance->id,
1631+
'run_number' => 1,
1632+
'workflow_class' => 'WorkflowClass',
1633+
'workflow_type' => 'workflow.test',
1634+
'status' => 'running',
1635+
'started_at' => $now->copy()
1636+
->subSeconds(180),
1637+
'created_at' => $now->copy()
1638+
->subSeconds(200),
1639+
'updated_at' => $now->copy()
1640+
->subSeconds(180),
1641+
]);
1642+
1643+
// Missing-summary run B — started 30s ago, counted but must not
1644+
// win the "oldest at" selection.
1645+
$newerMissingInstance = WorkflowInstance::query()->create([
1646+
'id' => 'missing-summary-newer-i',
1647+
'workflow_class' => 'WorkflowClass',
1648+
'workflow_type' => 'workflow.test',
1649+
'run_count' => 1,
1650+
]);
1651+
1652+
WorkflowRun::query()->create([
1653+
'id' => '01JMISSUMNEWERRUN000001',
1654+
'workflow_instance_id' => $newerMissingInstance->id,
1655+
'run_number' => 1,
1656+
'workflow_class' => 'WorkflowClass',
1657+
'workflow_type' => 'workflow.test',
1658+
'status' => 'running',
1659+
'started_at' => $now->copy()
1660+
->subSeconds(30),
1661+
'created_at' => $now->copy()
1662+
->subSeconds(30),
1663+
'updated_at' => $now->copy()
1664+
->subSeconds(30),
1665+
]);
1666+
1667+
// Missing-summary run C — started_at NULL, created 240s ago.
1668+
// Falls back to created_at for the age signal, but the 180s started
1669+
// A is still the oldest because 180s < 240s is false... wait: 240s
1670+
// IS older than 180s, so C must win the selection via the
1671+
// COALESCE(started_at, created_at) fallback.
1672+
$nullStartedInstance = WorkflowInstance::query()->create([
1673+
'id' => 'missing-summary-null-started-i',
1674+
'workflow_class' => 'WorkflowClass',
1675+
'workflow_type' => 'workflow.test',
1676+
'run_count' => 1,
1677+
]);
1678+
1679+
WorkflowRun::query()->create([
1680+
'id' => '01JMISSUMNULLRUN00000001',
1681+
'workflow_instance_id' => $nullStartedInstance->id,
1682+
'run_number' => 1,
1683+
'workflow_class' => 'WorkflowClass',
1684+
'workflow_type' => 'workflow.test',
1685+
'status' => 'pending',
1686+
'started_at' => null,
1687+
'created_at' => $now->copy()
1688+
->subSeconds(240),
1689+
'updated_at' => $now->copy()
1690+
->subSeconds(240),
1691+
]);
1692+
1693+
$snapshot = OperatorMetrics::snapshot($now);
1694+
1695+
$expectedOldestMissingAt = $now->copy()
1696+
->subSeconds(240)
1697+
->toJSON();
1698+
1699+
$this->assertSame(3, $snapshot['projections']['run_summaries']['missing']);
1700+
$this->assertSame(
1701+
$expectedOldestMissingAt,
1702+
$snapshot['projections']['run_summaries']['oldest_missing_run_started_at'],
1703+
);
1704+
$this->assertSame(
1705+
240 * 1000,
1706+
$snapshot['projections']['run_summaries']['max_missing_run_age_ms'],
1707+
);
1708+
}
1709+
1710+
public function testSnapshotReportsMissingRunSummaryProjectionAgeAsZeroWhenNoRunsAreMissing(): void
1711+
{
1712+
Carbon::setTestNow('2026-04-09 12:00:00');
1713+
$this->beforeApplicationDestroyed(static function (): void {
1714+
Carbon::setTestNow();
1715+
});
1716+
1717+
$now = Carbon::now();
1718+
1719+
$this->createRunWithSummary(
1720+
instanceId: 'no-missing-summary-i',
1721+
runId: '01JNOMISSUMRUN0000000001',
1722+
status: 'running',
1723+
statusBucket: 'running',
1724+
livenessState: 'running',
1725+
);
1726+
1727+
$snapshot = OperatorMetrics::snapshot($now);
1728+
1729+
$this->assertSame(0, $snapshot['projections']['run_summaries']['missing']);
1730+
$this->assertNull($snapshot['projections']['run_summaries']['oldest_missing_run_started_at']);
1731+
$this->assertSame(0, $snapshot['projections']['run_summaries']['max_missing_run_age_ms']);
1732+
}
1733+
15991734
public function testSnapshotReportsInWorkerMatchingRoleShapeByDefault(): void
16001735
{
16011736
config()->set('workflows.v2.matching_role.queue_wake_enabled', true);

tests/Unit/V2/RolloutSafetyDocumentationTest.php

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -411,6 +411,17 @@ public function testContractDocumentFreezesRetryingActivityAgeRow(): void
411411
);
412412
}
413413

414+
public function testContractDocumentFreezesRunSummaryProjectionMissingRunAgeRow(): void
415+
{
416+
$contents = $this->documentContents();
417+
418+
$this->assertMatchesRegularExpression(
419+
'/\|\s*`projections\.run_summaries`\s*\|[^|]*`oldest_missing_run_started_at`[^|]*`max_missing_run_age_ms`/',
420+
$contents,
421+
'Rollout safety contract must pin the run-summary projection missing-run age row so operators can read "how long has the worst-case run been without a run-summary projection?" — the primary projection-lag age indicator on the run-summary path — from OperatorMetrics::snapshot() without walking workflow_runs.',
422+
);
423+
}
424+
414425
public function testContractDocumentFreezesHealthCheckNames(): void
415426
{
416427
$contents = $this->documentContents();

0 commit comments

Comments
 (0)