Skip to content

Commit b43d992

Browse files
Surface activity timeout-overdue age on operator metrics
Adds `activities.timeout_overdue`, `activities.oldest_timeout_overdue_at`, and `activities.max_timeout_overdue_age_ms` to `OperatorMetrics::snapshot()` so operators can read "how long has the worst-case activity been past a timeout deadline without enforcement?" — the primary stuck-activity duplicate-risk age indicator on the activity path — from the metric alone without walking `activity_executions`. The `timeout_overdue` predicate mirrors `ActivityTimeoutEnforcer::expiredExecutionIds()` exactly, so the count is the operator-visible view of the same enforcement backlog: it covers open activity executions whose `schedule_deadline_at` (Pending), `close_deadline_at` (Running), `schedule_to_close_deadline_at` (Pending or Running), or `heartbeat_deadline_at` (Running) has passed and is therefore waiting for the timeout sweep to act. Sustained non-zero readings indicate the activity-timeout sweep is lagging or stalled and that worker liveness via heartbeat or start-to-close has stopped on at least one execution. The signal is the activity-path counterpart of `tasks.lease_expired` — both surface stuck work that the corresponding sweep has not yet reclaimed. Pinned in `docs/architecture/rollout-safety.md` Frozen metric keys table and asserted by `RolloutSafetyDocumentationTest` (frozen-keys list plus a dedicated row regex). Covered end-to-end by two new `V2OperatorMetricsTest` cases that verify the four-deadline predicate matches the enforcer (each enforcement-relevant deadline and status guard, status-only-applicable deadline columns excluded, Completed executions excluded) and that the signal returns to zero when no activity is overdue.
1 parent d5e1fc1 commit b43d992

4 files changed

Lines changed: 312 additions & 0 deletions

File tree

docs/architecture/rollout-safety.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -416,6 +416,7 @@ change.
416416
| `tasks` | `oldest_dispatch_failed_at`, `max_dispatch_failed_age_ms` | earliest `last_dispatch_attempt_at` among dispatch-failed tasks (Ready tasks whose most recent dispatch attempt recorded an uncleared `last_dispatch_error` that has not been superseded by a later successful dispatch) and the largest dispatch-failed age in milliseconds, mirroring the `oldest_claim_failed_at` / `max_claim_failed_age_ms` shape for the claim path so operators can read "how long has the worst-case task been sitting with an uncleared dispatch error?" — the primary transport-failure age indicator on the dispatch path — from the metric alone without walking `workflow_tasks` |
417417
| `tasks` | `unhealthy`, `oldest_unhealthy_at`, `max_unhealthy_age_ms` | sum of transport failure and lease expiry counts (the primary duplicate-risk indicator), the earliest of `oldest_dispatch_failed_at` / `oldest_claim_failed_at` / `oldest_dispatch_overdue_since` / `oldest_lease_expired_at` (`null` when `unhealthy = 0`), and the largest unhealthy age in milliseconds across those four contributing paths so operators can read "how stale is my worst-case duplicate-risk task overall?" from the metric alone without taking a max over four separate per-path age fields |
418418
| `activities` | `retrying`, `oldest_retrying_started_at`, `max_retrying_age_ms` | activity executions currently in the retry window (Pending status with `attempt_count > 0`), the earliest `started_at` among them, and the largest retrying age in milliseconds, mirroring the `tasks.oldest_lease_expired_at` / `max_lease_expired_age_ms` shape on the task path so operators can answer "how long has the worst-case activity been chewing retries?" — the primary retry-rate age indicator on the activity path — from the metric alone without walking `activity_executions` |
419+
| `activities` | `timeout_overdue`, `oldest_timeout_overdue_at`, `max_timeout_overdue_age_ms` | open activity executions whose `schedule_deadline_at` (Pending), `close_deadline_at` (Running), `schedule_to_close_deadline_at`, or `heartbeat_deadline_at` (Running) is `<= $now` and is therefore waiting for `ActivityTimeoutEnforcer` to enforce the timeout, the earliest such expired deadline timestamp across the four enforcement-relevant columns, and the largest overdue age in milliseconds. The `timeout_overdue` predicate mirrors `ActivityTimeoutEnforcer::expiredExecutionIds()` exactly, so the count is the operator-visible view of the same enforcement backlog — the activity-path counterpart of `tasks.lease_expired`. Both surface stuck work that the corresponding sweep has not yet reclaimed; sustained non-zero readings indicate the activity-timeout sweep is lagging or stalled and that worker liveness via heartbeat or start-to-close has stopped on at least one execution. The age data lets operators read "how long has the worst-case activity been past a timeout deadline without enforcement?" — the primary stuck-activity duplicate-risk age indicator on the activity path — from the metric alone without walking `activity_executions` |
419420
| `backlog` | `runnable_tasks`, `delayed_tasks`, `leased_tasks` | authoritative backlog counts |
420421
| `backlog` | `unhealthy_tasks`, `repair_needed_runs`, `claim_failed_runs`, `compatibility_blocked_runs` | stuck/blocked roll-ups |
421422
| `backlog` | `oldest_compatibility_blocked_started_at`, `max_compatibility_blocked_age_ms` | earliest wait-start timestamp among compatibility-blocked runs and the largest blocked age in milliseconds, mirroring the `repair.oldest_missing_run_started_at` / `max_missing_run_age_ms` shape so operators can answer "how stale is the worst mixed-build block?" from the metric alone |

src/V2/Support/OperatorMetrics.php

Lines changed: 115 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -206,6 +206,7 @@ private static function backlogMetrics(CarbonInterface $now, ?string $namespace)
206206
private static function activityMetrics(CarbonInterface $now, ?string $namespace): array
207207
{
208208
$oldestRetryingStartedAt = self::oldestRetryingActivityStartedAt($namespace);
209+
$oldestTimeoutOverdueAt = self::oldestActivityTimeoutOverdueAt($now, $namespace);
209210

210211
return [
211212
'open' => self::scopedRunModelQuery(self::activityExecutionModel(), $namespace)
@@ -222,6 +223,11 @@ private static function activityMetrics(CarbonInterface $now, ?string $namespace
222223
'max_retrying_age_ms' => $oldestRetryingStartedAt === null
223224
? 0
224225
: (int) $oldestRetryingStartedAt->diffInMilliseconds($now),
226+
'timeout_overdue' => self::timeoutOverdueActivities($now, $namespace),
227+
'oldest_timeout_overdue_at' => $oldestTimeoutOverdueAt?->toJSON(),
228+
'max_timeout_overdue_age_ms' => $oldestTimeoutOverdueAt === null
229+
? 0
230+
: (int) $oldestTimeoutOverdueAt->diffInMilliseconds($now),
225231
'failed_attempts' => self::scopedRunModelQuery(self::activityAttemptModel(), $namespace)
226232
->where('status', ActivityAttemptStatus::Failed->value)
227233
->count(),
@@ -856,6 +862,115 @@ private static function retryingActivities(?string $namespace): int
856862
->count();
857863
}
858864

865+
/**
866+
* Open activity executions whose schedule-to-start, start-to-close,
867+
* schedule-to-close, or heartbeat deadline has passed and is therefore
868+
* waiting for `ActivityTimeoutEnforcer` to enforce the timeout.
869+
*
870+
* The predicate mirrors `ActivityTimeoutEnforcer::expiredExecutionIds()`
871+
* exactly so the count is the operator-visible view of the same
872+
* sweep backlog: a non-zero value means at least one activity has hit
873+
* a deadline that the enforcement pass has not yet acted on. When the
874+
* sweep is healthy this count returns to zero between passes; sustained
875+
* non-zero readings indicate the activity-timeout sweep is lagging or
876+
* stalled and that worker liveness via heartbeat or start-to-close has
877+
* stopped on at least one execution. The signal is the activity-path
878+
* counterpart of `tasks.lease_expired` — both surface stuck work that
879+
* the corresponding sweep has not yet reclaimed.
880+
*/
881+
private static function timeoutOverdueActivities(CarbonInterface $now, ?string $namespace): int
882+
{
883+
return self::scopedRunModelQuery(self::activityExecutionModel(), $namespace)
884+
->whereIn('status', [ActivityStatus::Pending->value, ActivityStatus::Running->value])
885+
->where(static function ($query) use ($now): void {
886+
$query->where(static function ($schedule) use ($now): void {
887+
$schedule->where('status', ActivityStatus::Pending->value)
888+
->whereNotNull('schedule_deadline_at')
889+
->where('schedule_deadline_at', '<=', $now);
890+
})->orWhere(static function ($close) use ($now): void {
891+
$close->where('status', ActivityStatus::Running->value)
892+
->whereNotNull('close_deadline_at')
893+
->where('close_deadline_at', '<=', $now);
894+
})->orWhere(static function ($scheduleToClose) use ($now): void {
895+
$scheduleToClose->whereNotNull('schedule_to_close_deadline_at')
896+
->where('schedule_to_close_deadline_at', '<=', $now);
897+
})->orWhere(static function ($heartbeat) use ($now): void {
898+
$heartbeat->where('status', ActivityStatus::Running->value)
899+
->whereNotNull('heartbeat_deadline_at')
900+
->where('heartbeat_deadline_at', '<=', $now);
901+
});
902+
})
903+
->count();
904+
}
905+
906+
/**
907+
* Earliest deadline timestamp across activity executions whose
908+
* schedule-to-start, start-to-close, schedule-to-close, or heartbeat
909+
* deadline has already passed. Rollout-safety surfaces this alongside
910+
* `activities.timeout_overdue` so operators can answer "how long has
911+
* the worst-case activity been past a timeout deadline without
912+
* enforcement?" — the primary stuck-activity duplicate-risk age
913+
* indicator on the activity path — from the metric alone, mirroring
914+
* the `tasks.oldest_lease_expired_at` / `max_lease_expired_age_ms`
915+
* shape on the task path. The earliest expired deadline among the
916+
* four enforcement-relevant deadline columns wins.
917+
*/
918+
private static function oldestActivityTimeoutOverdueAt(CarbonInterface $now, ?string $namespace): ?CarbonInterface
919+
{
920+
return self::earliestTimestamp([
921+
self::firstActivityDeadlineAt(
922+
$namespace,
923+
'schedule_deadline_at',
924+
$now,
925+
[ActivityStatus::Pending->value],
926+
),
927+
self::firstActivityDeadlineAt($namespace, 'close_deadline_at', $now, [ActivityStatus::Running->value]),
928+
self::firstActivityDeadlineAt(
929+
$namespace,
930+
'schedule_to_close_deadline_at',
931+
$now,
932+
[ActivityStatus::Pending->value, ActivityStatus::Running->value],
933+
),
934+
self::firstActivityDeadlineAt(
935+
$namespace,
936+
'heartbeat_deadline_at',
937+
$now,
938+
[ActivityStatus::Running->value],
939+
),
940+
]);
941+
}
942+
943+
/**
944+
* Earliest expired deadline timestamp on $column among activity
945+
* executions in one of $statuses, or null when none are expired. Used
946+
* by `oldestActivityTimeoutOverdueAt()` to roll the four
947+
* enforcement-relevant deadline columns into a single worst-case age.
948+
*
949+
* @param list<string> $statuses
950+
*/
951+
private static function firstActivityDeadlineAt(
952+
?string $namespace,
953+
string $column,
954+
CarbonInterface $now,
955+
array $statuses,
956+
): ?CarbonInterface {
957+
/** @var ActivityExecution|null $execution */
958+
$execution = self::scopedRunModelQuery(self::activityExecutionModel(), $namespace)
959+
->whereIn('status', $statuses)
960+
->whereNotNull($column)
961+
->where($column, '<=', $now)
962+
->orderBy($column)
963+
->first();
964+
965+
if (! $execution instanceof ActivityExecution) {
966+
return null;
967+
}
968+
969+
$value = $execution->getAttribute($column);
970+
971+
return $value instanceof CarbonInterface ? $value : null;
972+
}
973+
859974
/**
860975
* Earliest `started_at` across activity executions currently in the
861976
* retry window — Pending status with `attempt_count > 0`. Rollout-safety

tests/Feature/V2/V2OperatorMetricsTest.php

Lines changed: 182 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1709,6 +1709,188 @@ public function testSnapshotReportsRetryingActivityAgeAsZeroWhenNoActivitiesAreR
17091709
$this->assertSame(0, $snapshot['activities']['max_retrying_age_ms']);
17101710
}
17111711

1712+
public function testSnapshotSurfacesActivityTimeoutOverdueAgeFromEarliestExpiredDeadline(): void
1713+
{
1714+
Carbon::setTestNow('2026-04-09 12:00:00');
1715+
$this->beforeApplicationDestroyed(static function (): void {
1716+
Carbon::setTestNow();
1717+
});
1718+
1719+
$now = Carbon::now();
1720+
1721+
$run = $this->createRunWithSummary(
1722+
instanceId: 'timeout-overdue-instance',
1723+
runId: '01JTIMEOUTOVDRUN000000001',
1724+
status: 'running',
1725+
statusBucket: 'running',
1726+
livenessState: 'running',
1727+
);
1728+
1729+
// Worst-case: Pending activity whose schedule_deadline_at expired
1730+
// 240s ago — earliest expired deadline across the four enforcement
1731+
// columns, so it wins both the count and the oldest selection.
1732+
$this->createActivityExecution($run, '01JTIMEOUTOVDEXEC0000001', [
1733+
'sequence' => 1,
1734+
'status' => ActivityStatus::Pending->value,
1735+
'attempt_count' => 0,
1736+
'started_at' => $now->copy()
1737+
->subSeconds(300),
1738+
'schedule_deadline_at' => $now->copy()
1739+
->subSeconds(240),
1740+
]);
1741+
1742+
// Running activity whose start-to-close (close_deadline_at) expired
1743+
// 90s ago — counted, but a newer expiry than the worst-case above.
1744+
$this->createActivityExecution($run, '01JTIMEOUTOVDEXEC0000002', [
1745+
'sequence' => 2,
1746+
'status' => ActivityStatus::Running->value,
1747+
'attempt_count' => 1,
1748+
'started_at' => $now->copy()
1749+
->subSeconds(180),
1750+
'close_deadline_at' => $now->copy()
1751+
->subSeconds(90),
1752+
]);
1753+
1754+
// Running activity whose heartbeat_deadline_at expired 60s ago —
1755+
// counted, but newer than 240s and 90s above.
1756+
$this->createActivityExecution($run, '01JTIMEOUTOVDEXEC0000003', [
1757+
'sequence' => 3,
1758+
'status' => ActivityStatus::Running->value,
1759+
'attempt_count' => 1,
1760+
'started_at' => $now->copy()
1761+
->subSeconds(120),
1762+
'heartbeat_deadline_at' => $now->copy()
1763+
->subSeconds(60),
1764+
]);
1765+
1766+
// Pending activity whose schedule_to_close_deadline_at expired 30s
1767+
// ago — counted (schedule_to_close applies to both Pending and
1768+
// Running), still newer than the worst-case.
1769+
$this->createActivityExecution($run, '01JTIMEOUTOVDEXEC0000004', [
1770+
'sequence' => 4,
1771+
'status' => ActivityStatus::Pending->value,
1772+
'attempt_count' => 0,
1773+
'started_at' => null,
1774+
'schedule_to_close_deadline_at' => $now->copy()
1775+
->subSeconds(30),
1776+
]);
1777+
1778+
// Running activity with deadlines that are all in the future — NOT
1779+
// counted; the enforcement sweep has nothing to do here.
1780+
$this->createActivityExecution($run, '01JTIMEOUTOVDEXEC0000005', [
1781+
'sequence' => 5,
1782+
'status' => ActivityStatus::Running->value,
1783+
'attempt_count' => 0,
1784+
'started_at' => $now->copy()
1785+
->subSeconds(15),
1786+
'close_deadline_at' => $now->copy()
1787+
->addSeconds(60),
1788+
'heartbeat_deadline_at' => $now->copy()
1789+
->addSeconds(30),
1790+
]);
1791+
1792+
// Closed activity whose deadlines are all in the past — Completed
1793+
// executions are not selected by the enforcer, so they are NOT
1794+
// counted even with deadlines well in the past.
1795+
$this->createActivityExecution($run, '01JTIMEOUTOVDEXEC0000006', [
1796+
'sequence' => 6,
1797+
'status' => ActivityStatus::Completed->value,
1798+
'attempt_count' => 1,
1799+
'started_at' => $now->copy()
1800+
->subSeconds(420),
1801+
'close_deadline_at' => $now->copy()
1802+
->subSeconds(360),
1803+
'closed_at' => $now->copy()
1804+
->subSeconds(300),
1805+
]);
1806+
1807+
// Running activity with a Pending-only deadline (schedule_deadline_at)
1808+
// expired in the past — schedule_deadline_at only counts when the
1809+
// status is Pending, so this row is NOT counted by the enforcer
1810+
// predicate even though the deadline column has an expired value.
1811+
$this->createActivityExecution($run, '01JTIMEOUTOVDEXEC0000007', [
1812+
'sequence' => 7,
1813+
'status' => ActivityStatus::Running->value,
1814+
'attempt_count' => 1,
1815+
'started_at' => $now->copy()
1816+
->subSeconds(150),
1817+
'schedule_deadline_at' => $now->copy()
1818+
->subSeconds(360),
1819+
]);
1820+
1821+
// Pending activity with a Running-only deadline (heartbeat_deadline_at)
1822+
// expired in the past — heartbeat_deadline_at only counts when the
1823+
// status is Running, so this row is NOT counted by the enforcer
1824+
// predicate even though the deadline column has an expired value.
1825+
$this->createActivityExecution($run, '01JTIMEOUTOVDEXEC0000008', [
1826+
'sequence' => 8,
1827+
'status' => ActivityStatus::Pending->value,
1828+
'attempt_count' => 0,
1829+
'started_at' => null,
1830+
'heartbeat_deadline_at' => $now->copy()
1831+
->subSeconds(360),
1832+
]);
1833+
1834+
$snapshot = OperatorMetrics::snapshot($now);
1835+
1836+
$expectedOldestTimeoutOverdueAt = $now->copy()
1837+
->subSeconds(240)
1838+
->toJSON();
1839+
1840+
$this->assertSame(4, $snapshot['activities']['timeout_overdue']);
1841+
$this->assertSame($expectedOldestTimeoutOverdueAt, $snapshot['activities']['oldest_timeout_overdue_at']);
1842+
$this->assertSame(240 * 1000, $snapshot['activities']['max_timeout_overdue_age_ms']);
1843+
}
1844+
1845+
public function testSnapshotReportsActivityTimeoutOverdueAgeAsZeroWhenNoActivityIsOverdue(): void
1846+
{
1847+
Carbon::setTestNow('2026-04-09 12:00:00');
1848+
$this->beforeApplicationDestroyed(static function (): void {
1849+
Carbon::setTestNow();
1850+
});
1851+
1852+
$now = Carbon::now();
1853+
1854+
$run = $this->createRunWithSummary(
1855+
instanceId: 'timeout-none-instance',
1856+
runId: '01JTIMEOUTNONERUN00000001',
1857+
status: 'running',
1858+
statusBucket: 'running',
1859+
livenessState: 'running',
1860+
);
1861+
1862+
// Pending activity with no enforcement deadlines yet recorded —
1863+
// not counted; nothing to enforce.
1864+
$this->createActivityExecution($run, '01JTIMEOUTNONEEXEC000001', [
1865+
'sequence' => 1,
1866+
'status' => ActivityStatus::Pending->value,
1867+
'attempt_count' => 0,
1868+
'started_at' => null,
1869+
]);
1870+
1871+
// Running activity with deadlines safely in the future — not
1872+
// counted; the enforcement sweep has nothing to do.
1873+
$this->createActivityExecution($run, '01JTIMEOUTNONEEXEC000002', [
1874+
'sequence' => 2,
1875+
'status' => ActivityStatus::Running->value,
1876+
'attempt_count' => 1,
1877+
'started_at' => $now->copy()
1878+
->subSeconds(15),
1879+
'close_deadline_at' => $now->copy()
1880+
->addSeconds(120),
1881+
'heartbeat_deadline_at' => $now->copy()
1882+
->addSeconds(60),
1883+
'schedule_to_close_deadline_at' => $now->copy()
1884+
->addSeconds(900),
1885+
]);
1886+
1887+
$snapshot = OperatorMetrics::snapshot($now);
1888+
1889+
$this->assertSame(0, $snapshot['activities']['timeout_overdue']);
1890+
$this->assertNull($snapshot['activities']['oldest_timeout_overdue_at']);
1891+
$this->assertSame(0, $snapshot['activities']['max_timeout_overdue_age_ms']);
1892+
}
1893+
17121894
public function testSnapshotSurfacesMissingRunSummaryProjectionAgeFromOldestMissingRun(): void
17131895
{
17141896
Carbon::setTestNow('2026-04-09 12:00:00');

tests/Unit/V2/RolloutSafetyDocumentationTest.php

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -134,6 +134,9 @@ final class RolloutSafetyDocumentationTest extends TestCase
134134
'max_dispatch_failed_age_ms',
135135
'oldest_retrying_started_at',
136136
'max_retrying_age_ms',
137+
'timeout_overdue',
138+
'oldest_timeout_overdue_at',
139+
'max_timeout_overdue_age_ms',
137140
'unhealthy',
138141
'oldest_unhealthy_at',
139142
'max_unhealthy_age_ms',
@@ -413,6 +416,17 @@ public function testContractDocumentFreezesRetryingActivityAgeRow(): void
413416
);
414417
}
415418

419+
public function testContractDocumentFreezesActivityTimeoutOverdueAgeRow(): void
420+
{
421+
$contents = $this->documentContents();
422+
423+
$this->assertMatchesRegularExpression(
424+
'/\|\s*`activities`\s*\|[^|]*`timeout_overdue`[^|]*`oldest_timeout_overdue_at`[^|]*`max_timeout_overdue_age_ms`/',
425+
$contents,
426+
'Rollout safety contract must pin the activities timeout-overdue age row so operators can read "how long has the worst-case activity been past a timeout deadline without enforcement?" — the primary stuck-activity duplicate-risk age indicator on the activity path — from OperatorMetrics::snapshot() without walking activity_executions.',
427+
);
428+
}
429+
416430
public function testContractDocumentFreezesRunSummaryProjectionMissingRunAgeRow(): void
417431
{
418432
$contents = $this->documentContents();

0 commit comments

Comments
 (0)