Skip to content

Commit 044ba07

Browse files
Surface unhealthy-age rollup on operator metrics
OperatorMetrics::snapshot()['tasks'] now exposes oldest_unhealthy_at and max_unhealthy_age_ms — the earliest of the four contributing per-path timestamps (oldest_dispatch_failed_at, oldest_claim_failed_at, oldest_dispatch_overdue_since, oldest_lease_expired_at) and the largest age in milliseconds across them. Operators reading the metric snapshot can now answer "how stale is the worst-case duplicate-risk task overall?" from a single key pair instead of taking a max across four per-path age fields. The rollout-safety contract pins the new keys on the tasks row, V2OperatorMetricsTest pins both the populated and no-unhealthy-tasks paths, and RolloutSafetyDocumentationTest adds a row-shape regression so future doc edits cannot drop the rollup language without tripping the doc test.
1 parent a0badea commit 044ba07

4 files changed

Lines changed: 161 additions & 1 deletion

File tree

docs/architecture/rollout-safety.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -414,7 +414,7 @@ change.
414414
| `tasks` | `oldest_dispatch_overdue_since`, `max_dispatch_overdue_age_ms` | earliest `COALESCE(last_dispatched_at, created_at)` among dispatch-overdue tasks — the timestamp the worst-case ready-but-unclaimed task has been waiting for a successful dispatch wake since (either its last attempted dispatch that didn't stick or its creation time if it was never dispatched) — and the largest age in milliseconds, mirroring the `oldest_ready_due_at` / `max_ready_due_age_ms` shape so operators can read wake-latency ("how long has the oldest ready-but-unclaimed task been waiting for a working dispatch wake?") from the metric alone without walking `workflow_tasks` |
415415
| `tasks` | `oldest_claim_failed_at`, `max_claim_failed_age_ms` | earliest `last_claim_failed_at` among claim-failed tasks (Ready tasks whose most recent claim attempt recorded an uncleared `last_claim_error`) and the largest claim-failed age in milliseconds, mirroring the `oldest_dispatch_overdue_since` / `max_dispatch_overdue_age_ms` shape for the dispatch path so operators can read "how long has the worst-case task been sitting with an uncleared claim error?" — the primary lease-conflict and duplicate-risk age indicator for the claim path — from the metric alone without walking `workflow_tasks` |
416416
| `tasks` | `oldest_dispatch_failed_at`, `max_dispatch_failed_age_ms` | earliest `last_dispatch_attempt_at` among dispatch-failed tasks (Ready tasks whose most recent dispatch attempt recorded an uncleared `last_dispatch_error` that has not been superseded by a later successful dispatch) and the largest dispatch-failed age in milliseconds, mirroring the `oldest_claim_failed_at` / `max_claim_failed_age_ms` shape for the claim path so operators can read "how long has the worst-case task been sitting with an uncleared dispatch error?" — the primary transport-failure age indicator on the dispatch path — from the metric alone without walking `workflow_tasks` |
417-
| `tasks` | `unhealthy` | sum of transport failure and lease expiry counts (the primary duplicate-risk indicator) |
417+
| `tasks` | `unhealthy`, `oldest_unhealthy_at`, `max_unhealthy_age_ms` | sum of transport failure and lease expiry counts (the primary duplicate-risk indicator), the earliest of `oldest_dispatch_failed_at` / `oldest_claim_failed_at` / `oldest_dispatch_overdue_since` / `oldest_lease_expired_at` (`null` when `unhealthy = 0`), and the largest unhealthy age in milliseconds across those four contributing paths so operators can read "how stale is my worst-case duplicate-risk task overall?" from the metric alone without taking a max over four separate per-path age fields |
418418
| `activities` | `retrying`, `oldest_retrying_started_at`, `max_retrying_age_ms` | activity executions currently in the retry window (Pending status with `attempt_count > 0`), the earliest `started_at` among them, and the largest retrying age in milliseconds, mirroring the `tasks.oldest_lease_expired_at` / `max_lease_expired_age_ms` shape on the task path so operators can answer "how long has the worst-case activity been chewing retries?" — the primary retry-rate age indicator on the activity path — from the metric alone without walking `activity_executions` |
419419
| `backlog` | `runnable_tasks`, `delayed_tasks`, `leased_tasks` | authoritative backlog counts |
420420
| `backlog` | `unhealthy_tasks`, `repair_needed_runs`, `claim_failed_runs`, `compatibility_blocked_runs` | stuck/blocked roll-ups |

src/V2/Support/OperatorMetrics.php

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -124,6 +124,12 @@ private static function taskMetrics(CarbonInterface $now, ?string $namespace): a
124124
$oldestDispatchOverdueSince = self::oldestDispatchOverdueSince($now, $namespace);
125125
$oldestClaimFailedAt = self::oldestClaimFailedAt($namespace);
126126
$oldestDispatchFailedAt = self::oldestDispatchFailedAt($namespace);
127+
$oldestUnhealthyAt = self::earliestTimestamp([
128+
$oldestDispatchFailedAt,
129+
$oldestClaimFailedAt,
130+
$oldestDispatchOverdueSince,
131+
$oldestLeaseExpiredAt,
132+
]);
127133

128134
return [
129135
'open' => self::openTasks($namespace),
@@ -159,6 +165,10 @@ private static function taskMetrics(CarbonInterface $now, ?string $namespace): a
159165
+ self::claimFailedTasks($namespace)
160166
+ self::dispatchOverdueTasks($now, $namespace)
161167
+ self::leaseExpiredTasks($now, $namespace),
168+
'oldest_unhealthy_at' => $oldestUnhealthyAt?->toJSON(),
169+
'max_unhealthy_age_ms' => $oldestUnhealthyAt === null
170+
? 0
171+
: (int) $oldestUnhealthyAt->diffInMilliseconds($now),
162172
];
163173
}
164174

@@ -380,6 +390,31 @@ private static function stringOrNull(mixed $value): ?string
380390
return is_string($value) ? $value : null;
381391
}
382392

393+
/**
394+
* Returns the earliest non-null `CarbonInterface` from the given list,
395+
* or `null` if every entry is `null`. Used to roll up multiple per-path
396+
* "oldest at" timestamps into a single worst-case duplicate-risk age the
397+
* rollout-safety contract pins on `OperatorMetrics::snapshot()`.
398+
*
399+
* @param array<int, CarbonInterface|null> $timestamps
400+
*/
401+
private static function earliestTimestamp(array $timestamps): ?CarbonInterface
402+
{
403+
$earliest = null;
404+
405+
foreach ($timestamps as $timestamp) {
406+
if ($timestamp === null) {
407+
continue;
408+
}
409+
410+
if ($earliest === null || $timestamp->lessThan($earliest)) {
411+
$earliest = $timestamp;
412+
}
413+
}
414+
415+
return $earliest;
416+
}
417+
383418
/**
384419
* @return array<string, int>
385420
*/

tests/Feature/V2/V2OperatorMetricsTest.php

Lines changed: 112 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -250,6 +250,13 @@ public function testSnapshotSummarizesDurableBacklogRepairCompatibilityAndWorker
250250
);
251251
$this->assertSame(10 * 1000, $snapshot['tasks']['max_claim_failed_age_ms']);
252252
$this->assertSame(4, $snapshot['tasks']['unhealthy']);
253+
$this->assertSame(
254+
Carbon::parse('2026-04-09 12:00:00')
255+
->subMinute()
256+
->toJSON(),
257+
$snapshot['tasks']['oldest_unhealthy_at'],
258+
);
259+
$this->assertSame(60 * 1000, $snapshot['tasks']['max_unhealthy_age_ms']);
253260
$this->assertSame(4, $snapshot['backlog']['runnable_tasks']);
254261
$this->assertSame(1, $snapshot['backlog']['delayed_tasks']);
255262
$this->assertSame(2, $snapshot['backlog']['leased_tasks']);
@@ -1454,6 +1461,111 @@ public function testSnapshotReportsDispatchFailedAgeAsZeroWhenNoTasksFailedToDis
14541461
$this->assertSame(0, $taskTransport['data']['max_dispatch_failed_age_ms']);
14551462
}
14561463

1464+
public function testSnapshotSurfacesUnhealthyAgeRollupAsEarliestOfTheFourContributingPaths(): void
1465+
{
1466+
Carbon::setTestNow('2026-04-09 12:00:00');
1467+
$this->beforeApplicationDestroyed(static function (): void {
1468+
Carbon::setTestNow();
1469+
});
1470+
1471+
$now = Carbon::now();
1472+
1473+
$run = $this->createRunWithSummary(
1474+
instanceId: 'unhealthy-age-rollup-instance',
1475+
runId: '01JUNHEALRUN00000000000001',
1476+
status: 'running',
1477+
statusBucket: 'running',
1478+
livenessState: 'running',
1479+
);
1480+
1481+
// Lease-expired task (-30s) — newer than the dispatch-failed worst case below.
1482+
$this->createTask($run, '01JUNHEALTASK0000000000001', TaskStatus::Leased->value, [
1483+
'leased_at' => $now->copy()
1484+
->subSeconds(120),
1485+
'lease_owner' => 'worker-expired',
1486+
'lease_expires_at' => $now->copy()
1487+
->subSeconds(30),
1488+
'created_at' => $now->copy()
1489+
->subSeconds(120),
1490+
]);
1491+
1492+
// Claim-failed task (-45s) — newer than the dispatch-failed worst case below.
1493+
$this->createTask($run, '01JUNHEALTASK0000000000002', TaskStatus::Ready->value, [
1494+
'available_at' => $now->copy()
1495+
->subSecond(),
1496+
'connection' => 'sync',
1497+
'last_dispatched_at' => $now->copy()
1498+
->subSeconds(60),
1499+
'last_claim_failed_at' => $now->copy()
1500+
->subSeconds(45),
1501+
'last_claim_error' => 'Workflow v2 backend capabilities are unsupported: [queue_sync_unsupported] sync.',
1502+
'created_at' => $now->copy()
1503+
->subSeconds(60),
1504+
]);
1505+
1506+
// Dispatch-overdue task (-20s) — newer than the dispatch-failed worst case below.
1507+
$this->createTask($run, '01JUNHEALTASK0000000000003', TaskStatus::Ready->value, [
1508+
'available_at' => $now->copy()
1509+
->subSeconds(20),
1510+
'last_dispatched_at' => $now->copy()
1511+
->subSeconds(20),
1512+
'created_at' => $now->copy()
1513+
->subSeconds(20),
1514+
]);
1515+
1516+
// Dispatch-failed task (-90s) — the worst case across all four paths.
1517+
$this->createTask($run, '01JUNHEALTASK0000000000004', TaskStatus::Ready->value, [
1518+
'available_at' => $now->copy()
1519+
->subSeconds(120),
1520+
'last_dispatched_at' => null,
1521+
'last_dispatch_attempt_at' => $now->copy()
1522+
->subSeconds(90),
1523+
'last_dispatch_error' => 'Connection refused while broadcasting workflow task wake.',
1524+
'created_at' => $now->copy()
1525+
->subSeconds(150),
1526+
]);
1527+
1528+
$snapshot = OperatorMetrics::snapshot($now);
1529+
1530+
$this->assertSame(4, $snapshot['tasks']['unhealthy']);
1531+
$this->assertSame($now->copy() ->subSeconds(90) ->toJSON(), $snapshot['tasks']['oldest_unhealthy_at']);
1532+
$this->assertSame(90 * 1000, $snapshot['tasks']['max_unhealthy_age_ms']);
1533+
}
1534+
1535+
public function testSnapshotReportsUnhealthyAgeRollupAsZeroWhenNoTasksAreUnhealthy(): void
1536+
{
1537+
Carbon::setTestNow('2026-04-09 12:00:00');
1538+
$this->beforeApplicationDestroyed(static function (): void {
1539+
Carbon::setTestNow();
1540+
});
1541+
1542+
$now = Carbon::now();
1543+
1544+
$run = $this->createRunWithSummary(
1545+
instanceId: 'unhealthy-age-none-instance',
1546+
runId: '01JUNHEALNONRUN00000000001',
1547+
status: 'running',
1548+
statusBucket: 'running',
1549+
livenessState: 'running',
1550+
);
1551+
1552+
// Fresh healthy ready task — no transport failure, no expired lease.
1553+
$this->createTask($run, '01JUNHEALNONTASK0000000001', TaskStatus::Ready->value, [
1554+
'available_at' => $now->copy()
1555+
->subSecond(),
1556+
'last_dispatched_at' => $now->copy()
1557+
->subSecond(),
1558+
'created_at' => $now->copy()
1559+
->subSecond(),
1560+
]);
1561+
1562+
$snapshot = OperatorMetrics::snapshot($now);
1563+
1564+
$this->assertSame(0, $snapshot['tasks']['unhealthy']);
1565+
$this->assertNull($snapshot['tasks']['oldest_unhealthy_at']);
1566+
$this->assertSame(0, $snapshot['tasks']['max_unhealthy_age_ms']);
1567+
}
1568+
14571569
public function testSnapshotReportsRunWaitAgeAsZeroWhenNoRunsAreWaiting(): void
14581570
{
14591571
Carbon::setTestNow('2026-04-09 12:00:00');

tests/Unit/V2/RolloutSafetyDocumentationTest.php

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -135,6 +135,8 @@ final class RolloutSafetyDocumentationTest extends TestCase
135135
'oldest_retrying_started_at',
136136
'max_retrying_age_ms',
137137
'unhealthy',
138+
'oldest_unhealthy_at',
139+
'max_unhealthy_age_ms',
138140
'runnable_tasks',
139141
'delayed_tasks',
140142
'leased_tasks',
@@ -433,6 +435,17 @@ public function testContractDocumentFreezesBackendSeverityRollupRow(): void
433435
);
434436
}
435437

438+
public function testContractDocumentFreezesUnhealthyAgeRollupRow(): void
439+
{
440+
$contents = $this->documentContents();
441+
442+
$this->assertMatchesRegularExpression(
443+
'/\|\s*`tasks`\s*\|[^|]*`unhealthy`[^|]*`oldest_unhealthy_at`[^|]*`max_unhealthy_age_ms`/',
444+
$contents,
445+
'Rollout safety contract must pin the tasks unhealthy-age rollup row so operators can read "how stale is my worst-case duplicate-risk task overall?" from a single OperatorMetrics::snapshot() pair instead of taking a max over the four contributing per-path age fields.',
446+
);
447+
}
448+
436449
public function testContractDocumentFreezesHealthCheckNames(): void
437450
{
438451
$contents = $this->documentContents();

0 commit comments

Comments
 (0)