Skip to content

Commit 6089777

Browse files
Surface claim-failed age on operator metrics and task_transport health
Freezes `operator_metrics.tasks.oldest_claim_failed_at` (ISO-8601 or null) and `operator_metrics.tasks.max_claim_failed_age_ms` (integer ms) on `OperatorMetrics::snapshot()`. The pair mirrors the existing `oldest_dispatch_overdue_since` / `max_dispatch_overdue_age_ms` shape for the dispatch path but on the claim path, so operators can read "how long has the worst-case task been sitting with an uncleared claim error?" — the primary lease-conflict and duplicate-risk age indicator on the claim path — from the metric alone without walking `workflow_tasks`. Forwards the pair plus `claim_failed_tasks` on `HealthCheck::taskTransportCheck()` data. Pins the row on `docs/architecture/rollout-safety.md` and adds a stuck-detectors bullet for claim-failed, guarded by `RolloutSafetyDocumentationTest::testContractDocumentFreezesClaimFailedAgeRow`.
1 parent 005f392 commit 6089777

6 files changed

Lines changed: 214 additions & 0 deletions

File tree

docs/architecture/rollout-safety.md

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -412,6 +412,7 @@ change.
412412
| `tasks` | `oldest_lease_expired_at`, `max_lease_expired_age_ms` | earliest `lease_expires_at` among leased tasks whose lease has expired at snapshot time and the largest expired-lease age in milliseconds, mirroring the `backlog.oldest_compatibility_blocked_started_at` / `max_compatibility_blocked_age_ms` shape so operators can answer "how long has the worst leased task been expired without redelivery?" (the primary stuck-lease duplicate-risk age indicator) from the metric alone |
413413
| `tasks` | `oldest_ready_due_at`, `max_ready_due_age_ms` | earliest "ready since" timestamp among ready-due tasks (the effective `COALESCE(available_at, created_at)``available_at` when the task was delayed, otherwise the creation time that made it immediately actionable) and the largest ready-age in milliseconds, mirroring the `oldest_lease_expired_at` / `max_lease_expired_age_ms` shape so operators can read queue latency ("how long has the oldest actionable task been waiting to dispatch?") from the metric alone without walking `workflow_tasks` |
414414
| `tasks` | `oldest_dispatch_overdue_since`, `max_dispatch_overdue_age_ms` | earliest `COALESCE(last_dispatched_at, created_at)` among dispatch-overdue tasks — the timestamp the worst-case ready-but-unclaimed task has been waiting for a successful dispatch wake since (either its last attempted dispatch that didn't stick or its creation time if it was never dispatched) — and the largest age in milliseconds, mirroring the `oldest_ready_due_at` / `max_ready_due_age_ms` shape so operators can read wake-latency ("how long has the oldest ready-but-unclaimed task been waiting for a working dispatch wake?") from the metric alone without walking `workflow_tasks` |
415+
| `tasks` | `oldest_claim_failed_at`, `max_claim_failed_age_ms` | earliest `last_claim_failed_at` among claim-failed tasks (Ready tasks whose most recent claim attempt recorded an uncleared `last_claim_error`) and the largest claim-failed age in milliseconds, mirroring the `oldest_dispatch_overdue_since` / `max_dispatch_overdue_age_ms` shape for the dispatch path so operators can read "how long has the worst-case task been sitting with an uncleared claim error?" — the primary lease-conflict and duplicate-risk age indicator for the claim path — from the metric alone without walking `workflow_tasks` |
415416
| `tasks` | `unhealthy` | sum of transport failure and lease expiry counts (the primary duplicate-risk indicator) |
416417
| `backlog` | `runnable_tasks`, `delayed_tasks`, `leased_tasks` | authoritative backlog counts |
417418
| `backlog` | `unhealthy_tasks`, `repair_needed_runs`, `claim_failed_runs`, `compatibility_blocked_runs` | stuck/blocked roll-ups |
@@ -497,6 +498,18 @@ are authoritative and how they surface.
497498
the redispatch decision; the age data is observability so operators
498499
can tell the difference between "dispatch wake is sporadically
499500
slow" and "dispatch wake has stalled on this task for minutes".
501+
- **Claim failed without clearing.** A ready task whose most recent
502+
claim attempt recorded an uncleared `last_claim_error` is counted
503+
under `tasks.claim_failed`, its worst-case claim-failed timestamp
504+
is surfaced through `tasks.oldest_claim_failed_at` and
505+
`tasks.max_claim_failed_age_ms`, and all three keys are forwarded
506+
on the `task_transport` health check (`claim_failed_tasks`,
507+
`oldest_claim_failed_at`, `max_claim_failed_age_ms`). The age data
508+
is observability so operators can tell the difference between "one
509+
worker briefly rejected a claim" and "the whole fleet has been
510+
rejecting this task for minutes" — a lease-conflict and
511+
duplicate-risk indicator on the claim path that mirrors the
512+
dispatch-path `dispatch_overdue` age signal.
500513
- **Repair-needed runs.** Runs whose projected state shows
501514
`liveness_state = repair_needed` are counted under
502515
`runs.repair_needed` and surface through the

src/V2/Support/HealthCheck.php

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -240,6 +240,11 @@ private static function taskTransportCheck(array $tasks, array $backlog): array
240240
? $tasks['oldest_dispatch_overdue_since']
241241
: null,
242242
'max_dispatch_overdue_age_ms' => self::integer($tasks['max_dispatch_overdue_age_ms'] ?? 0),
243+
'claim_failed_tasks' => self::integer($tasks['claim_failed'] ?? 0),
244+
'oldest_claim_failed_at' => is_string($tasks['oldest_claim_failed_at'] ?? null)
245+
? $tasks['oldest_claim_failed_at']
246+
: null,
247+
'max_claim_failed_age_ms' => self::integer($tasks['max_claim_failed_age_ms'] ?? 0),
243248
'repair_needed_runs' => self::integer($backlog['repair_needed_runs'] ?? 0),
244249
'claim_failed_runs' => self::integer($backlog['claim_failed_runs'] ?? 0),
245250
'compatibility_blocked_runs' => self::integer($backlog['compatibility_blocked_runs'] ?? 0),

src/V2/Support/OperatorMetrics.php

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -122,6 +122,7 @@ private static function taskMetrics(CarbonInterface $now, ?string $namespace): a
122122
$oldestLeaseExpiredAt = self::oldestLeaseExpiredAt($now, $namespace);
123123
$oldestReadyDueAt = self::oldestReadyDueAt($now, $namespace);
124124
$oldestDispatchOverdueSince = self::oldestDispatchOverdueSince($now, $namespace);
125+
$oldestClaimFailedAt = self::oldestClaimFailedAt($namespace);
125126

126127
return [
127128
'open' => self::openTasks($namespace),
@@ -145,6 +146,10 @@ private static function taskMetrics(CarbonInterface $now, ?string $namespace): a
145146
'max_dispatch_overdue_age_ms' => $oldestDispatchOverdueSince === null
146147
? 0
147148
: (int) $oldestDispatchOverdueSince->diffInMilliseconds($now),
149+
'oldest_claim_failed_at' => $oldestClaimFailedAt?->toJSON(),
150+
'max_claim_failed_age_ms' => $oldestClaimFailedAt === null
151+
? 0
152+
: (int) $oldestClaimFailedAt->diffInMilliseconds($now),
148153
'unhealthy' => self::dispatchFailedTasks($namespace)
149154
+ self::claimFailedTasks($namespace)
150155
+ self::dispatchOverdueTasks($now, $namespace)
@@ -982,6 +987,31 @@ private static function oldestDispatchOverdueSince(CarbonInterface $now, ?string
982987
return $task->last_dispatched_at ?? $task->created_at;
983988
}
984989

990+
/**
991+
* Earliest `last_claim_failed_at` among tasks currently counted by
992+
* `tasks.claim_failed` (Ready tasks whose most recent claim attempt
993+
* recorded an uncleared `last_claim_error`). Rollout-safety surfaces
994+
* this alongside `tasks.claim_failed` so operators can read "how long
995+
* has the worst-case task been sitting with an uncleared claim error?"
996+
* — the primary lease-conflict and duplicate-risk age indicator for
997+
* the claim path — from the metric alone, mirroring the existing
998+
* `oldest_dispatch_overdue_since` / `max_dispatch_overdue_age_ms` shape
999+
* for the dispatch path.
1000+
*/
1001+
private static function oldestClaimFailedAt(?string $namespace): ?CarbonInterface
1002+
{
1003+
/** @var WorkflowTask|null $task */
1004+
$task = self::claimFailedQuery($namespace)
1005+
->orderBy('last_claim_failed_at')
1006+
->first();
1007+
1008+
if (! $task instanceof WorkflowTask) {
1009+
return null;
1010+
}
1011+
1012+
return $task->last_claim_failed_at;
1013+
}
1014+
9851015
private static function dispatchOverdueQuery(CarbonInterface $now, ?string $namespace)
9861016
{
9871017
$cutoff = $now->copy()

tests/Feature/V2/V2OperatorMetricsTest.php

Lines changed: 150 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -240,6 +240,13 @@ public function testSnapshotSummarizesDurableBacklogRepairCompatibilityAndWorker
240240
$snapshot['tasks']['oldest_dispatch_overdue_since'],
241241
);
242242
$this->assertSame(10 * 1000, $snapshot['tasks']['max_dispatch_overdue_age_ms']);
243+
$this->assertSame(
244+
Carbon::parse('2026-04-09 12:00:00')
245+
->subSeconds(10)
246+
->toJSON(),
247+
$snapshot['tasks']['oldest_claim_failed_at'],
248+
);
249+
$this->assertSame(10 * 1000, $snapshot['tasks']['max_claim_failed_age_ms']);
243250
$this->assertSame(4, $snapshot['tasks']['unhealthy']);
244251
$this->assertSame(4, $snapshot['backlog']['runnable_tasks']);
245252
$this->assertSame(1, $snapshot['backlog']['delayed_tasks']);
@@ -1146,6 +1153,149 @@ public function testSnapshotReportsDispatchOverdueAgeAsZeroWhenNoTasksAreOverdue
11461153
$this->assertSame(0, $snapshot['tasks']['max_dispatch_overdue_age_ms']);
11471154
}
11481155

1156+
public function testSnapshotSurfacesClaimFailedAgeFromOldestClaimFailure(): void
1157+
{
1158+
Carbon::setTestNow('2026-04-09 12:00:00');
1159+
$this->beforeApplicationDestroyed(static function (): void {
1160+
Carbon::setTestNow();
1161+
});
1162+
1163+
$now = Carbon::now();
1164+
1165+
$run = $this->createRunWithSummary(
1166+
instanceId: 'claim-failed-age-instance',
1167+
runId: '01JCLMFAILRUN0000000000001',
1168+
status: 'running',
1169+
statusBucket: 'running',
1170+
livenessState: 'running',
1171+
);
1172+
1173+
// Worst-case: ready task whose last claim failed 90s ago with an
1174+
// uncleared error.
1175+
$this->createTask($run, '01JCLMFAILTASK000000000001', TaskStatus::Ready->value, [
1176+
'available_at' => $now->copy()
1177+
->subSeconds(120),
1178+
'last_dispatched_at' => $now->copy()
1179+
->subSeconds(100),
1180+
'last_claim_failed_at' => $now->copy()
1181+
->subSeconds(90),
1182+
'last_claim_error' => 'Workflow v2 backend capabilities are unsupported: [queue_sync_unsupported] sync.',
1183+
'created_at' => $now->copy()
1184+
->subSeconds(150),
1185+
]);
1186+
1187+
// Newer claim failure — counted but must not win the "oldest at".
1188+
$this->createTask($run, '01JCLMFAILTASK000000000002', TaskStatus::Ready->value, [
1189+
'available_at' => $now->copy()
1190+
->subSeconds(30),
1191+
'last_dispatched_at' => $now->copy()
1192+
->subSeconds(20),
1193+
'last_claim_failed_at' => $now->copy()
1194+
->subSeconds(15),
1195+
'last_claim_error' => 'No compatible worker available for required build id.',
1196+
'created_at' => $now->copy()
1197+
->subSeconds(30),
1198+
]);
1199+
1200+
// Healthy ready task — not counted, and its created_at must not win.
1201+
$this->createTask($run, '01JCLMFAILTASK000000000003', TaskStatus::Ready->value, [
1202+
'available_at' => $now->copy()
1203+
->subSecond(),
1204+
'last_dispatched_at' => $now->copy()
1205+
->subSecond(),
1206+
'created_at' => $now->copy()
1207+
->subSeconds(200),
1208+
]);
1209+
1210+
// Claim error cleared (empty string) — excluded by applyClaimHealthy.
1211+
$this->createTask($run, '01JCLMFAILTASK000000000004', TaskStatus::Ready->value, [
1212+
'available_at' => $now->copy()
1213+
->subSeconds(60),
1214+
'last_dispatched_at' => $now->copy()
1215+
->subSeconds(50),
1216+
'last_claim_failed_at' => $now->copy()
1217+
->subSeconds(300),
1218+
'last_claim_error' => '',
1219+
'created_at' => $now->copy()
1220+
->subSeconds(60),
1221+
]);
1222+
1223+
// Leased task with an older last_claim_failed_at — excluded because
1224+
// the claim-failed query requires status=Ready.
1225+
$this->createTask($run, '01JCLMFAILTASK000000000005', TaskStatus::Leased->value, [
1226+
'available_at' => $now->copy()
1227+
->subSeconds(60),
1228+
'leased_at' => $now->copy()
1229+
->subSeconds(5),
1230+
'lease_owner' => 'worker-leased',
1231+
'lease_expires_at' => $now->copy()
1232+
->addSeconds(10),
1233+
'last_claim_failed_at' => $now->copy()
1234+
->subSeconds(400),
1235+
'last_claim_error' => 'Previous claim attempt failed before lease grant.',
1236+
'created_at' => $now->copy()
1237+
->subSeconds(60),
1238+
]);
1239+
1240+
$snapshot = OperatorMetrics::snapshot($now);
1241+
1242+
$expectedOldestClaimFailedAt = $now->copy()
1243+
->subSeconds(90)
1244+
->toJSON();
1245+
1246+
$this->assertSame(2, $snapshot['tasks']['claim_failed']);
1247+
$this->assertSame($expectedOldestClaimFailedAt, $snapshot['tasks']['oldest_claim_failed_at']);
1248+
$this->assertSame(90 * 1000, $snapshot['tasks']['max_claim_failed_age_ms']);
1249+
1250+
$healthSnapshot = HealthCheck::snapshot($now);
1251+
$taskTransport = collect($healthSnapshot['checks'])->firstWhere('name', 'task_transport');
1252+
$this->assertNotNull($taskTransport);
1253+
$this->assertSame(2, $taskTransport['data']['claim_failed_tasks']);
1254+
$this->assertSame($expectedOldestClaimFailedAt, $taskTransport['data']['oldest_claim_failed_at']);
1255+
$this->assertSame(90 * 1000, $taskTransport['data']['max_claim_failed_age_ms']);
1256+
}
1257+
1258+
public function testSnapshotReportsClaimFailedAgeAsZeroWhenNoTasksFailedToClaim(): void
1259+
{
1260+
Carbon::setTestNow('2026-04-09 12:00:00');
1261+
$this->beforeApplicationDestroyed(static function (): void {
1262+
Carbon::setTestNow();
1263+
});
1264+
1265+
$now = Carbon::now();
1266+
1267+
$run = $this->createRunWithSummary(
1268+
instanceId: 'claim-failed-none-instance',
1269+
runId: '01JCLMFNONRUN0000000000001',
1270+
status: 'running',
1271+
statusBucket: 'running',
1272+
livenessState: 'running',
1273+
);
1274+
1275+
// Fresh healthy ready task — never failed to claim.
1276+
$this->createTask($run, '01JCLMFNONTASK000000000001', TaskStatus::Ready->value, [
1277+
'available_at' => $now->copy()
1278+
->subSecond(),
1279+
'last_dispatched_at' => $now->copy()
1280+
->subSecond(),
1281+
'created_at' => $now->copy()
1282+
->subSecond(),
1283+
]);
1284+
1285+
$snapshot = OperatorMetrics::snapshot($now);
1286+
1287+
$this->assertSame(0, $snapshot['tasks']['claim_failed']);
1288+
$this->assertNull($snapshot['tasks']['oldest_claim_failed_at']);
1289+
$this->assertSame(0, $snapshot['tasks']['max_claim_failed_age_ms']);
1290+
1291+
$healthSnapshot = HealthCheck::snapshot($now);
1292+
$taskTransport = collect($healthSnapshot['checks'])->firstWhere('name', 'task_transport');
1293+
$this->assertNotNull($taskTransport);
1294+
$this->assertSame(0, $taskTransport['data']['claim_failed_tasks']);
1295+
$this->assertNull($taskTransport['data']['oldest_claim_failed_at']);
1296+
$this->assertSame(0, $taskTransport['data']['max_claim_failed_age_ms']);
1297+
}
1298+
11491299
public function testSnapshotReportsRunWaitAgeAsZeroWhenNoRunsAreWaiting(): void
11501300
{
11511301
Carbon::setTestNow('2026-04-09 12:00:00');

tests/Unit/V2/HealthCheckTest.php

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -555,6 +555,9 @@ public function testSnapshotWarnsWhenOpenRunHasNoDurableResumePath(): void
555555
$this->assertSame(0, $taskTransport['data']['dispatch_overdue_tasks']);
556556
$this->assertNull($taskTransport['data']['oldest_dispatch_overdue_since']);
557557
$this->assertSame(0, $taskTransport['data']['max_dispatch_overdue_age_ms']);
558+
$this->assertSame(0, $taskTransport['data']['claim_failed_tasks']);
559+
$this->assertNull($taskTransport['data']['oldest_claim_failed_at']);
560+
$this->assertSame(0, $taskTransport['data']['max_claim_failed_age_ms']);
558561
$this->assertSame(1, $taskTransport['data']['repair_needed_runs']);
559562
$this->assertSame('warning', $resumePaths['status']);
560563
$this->assertSame(1, $resumePaths['data']['repair_needed_runs']);

tests/Unit/V2/RolloutSafetyDocumentationTest.php

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -128,6 +128,8 @@ final class RolloutSafetyDocumentationTest extends TestCase
128128
'max_lease_expired_age_ms',
129129
'oldest_ready_due_at',
130130
'max_ready_due_age_ms',
131+
'oldest_claim_failed_at',
132+
'max_claim_failed_age_ms',
131133
'unhealthy',
132134
'runnable_tasks',
133135
'delayed_tasks',
@@ -372,6 +374,17 @@ public function testContractDocumentFreezesDispatchOverdueAgeRow(): void
372374
);
373375
}
374376

377+
public function testContractDocumentFreezesClaimFailedAgeRow(): void
378+
{
379+
$contents = $this->documentContents();
380+
381+
$this->assertMatchesRegularExpression(
382+
'/\|\s*`tasks`\s*\|[^|]*`oldest_claim_failed_at`[^|]*`max_claim_failed_age_ms`/',
383+
$contents,
384+
'Rollout safety contract must pin the tasks claim-failed age row so operators can read "how long has the worst-case task been sitting with an uncleared claim error?" — a lease-conflict and duplicate-risk indicator on the claim path — from OperatorMetrics::snapshot() without walking workflow_tasks.',
385+
);
386+
}
387+
375388
public function testContractDocumentFreezesHealthCheckNames(): void
376389
{
377390
$contents = $this->documentContents();

0 commit comments

Comments
 (0)