Skip to content

Commit f067a68

Browse files
Classify v2 health checks as correctness or acceleration
Every entry returned by Workflow\V2\Support\HealthCheck::snapshot() now carries an explicit category of correctness or acceleration, and the snapshot exposes a categories rollup so operator surfaces can answer two questions independently: is work being discovered from the durable dispatch substrate, and is the acceleration layer propagating through the configured cache or wake backend. Add a dedicated long_poll_wake_acceleration check that reports the configured cache backend's multi-node capability via LongPollCacheValidator. Per the scheduler correctness contract, this check never escalates above warning even when the wake backend is unreachable: the acceleration layer is optional, and durable discovery continues via bounded polling when wake signalling is degraded. The existing eight checks retain their semantics and stay in the correctness category. Pin the new contract language in docs/architecture/scheduler-correctness.md and docs/architecture/operational-liveness.md, and extend SchedulerCorrectnessDocumentationTest plus OperationalLivenessDocumentationTest so the category field, the acceleration check name, and the no-escalation rule cannot silently drift. HealthCheckTest gains three scenarios that cover every check being categorized, the category rollup being consistent with the check list, and the acceleration check reporting warning rather than error when the cache backend is incompatible with the configured multi-node topology.
1 parent f1efe0d commit f067a68

6 files changed

Lines changed: 269 additions & 5 deletions

File tree

docs/architecture/operational-liveness.md

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -660,7 +660,7 @@ liveness (sub-keys under the named groups):
660660
surfaced as metrics so a dashboard can read them without reading
661661
config.
662662

663-
The eight frozen health check names under `HealthCheck::snapshot()`:
663+
The nine frozen health check names under `HealthCheck::snapshot()`:
664664

665665
- `backend_capabilities`
666666
- `run_summary_projection`
@@ -670,6 +670,14 @@ The eight frozen health check names under `HealthCheck::snapshot()`:
670670
- `task_transport`
671671
- `durable_resume_paths`
672672
- `worker_compatibility`
673+
- `long_poll_wake_acceleration`
674+
675+
Every check entry also carries a `category` of `correctness` or
676+
`acceleration`. `long_poll_wake_acceleration` is the only entry
677+
whose `category` is `acceleration`; it reports wake-layer
678+
reachability and the acceleration contract (see
679+
`docs/architecture/scheduler-correctness.md`) forbids it from
680+
escalating above `warning`.
673681

674682
Rules:
675683

docs/architecture/scheduler-correctness.md

Lines changed: 16 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -434,9 +434,14 @@ through the following surfaces.
434434
correctness healthy?"
435435
- `Workflow\V2\Support\HealthCheck::snapshot()` — reports
436436
`backend_capabilities`, `task_transport`, `durable_resume_paths`,
437-
and `worker_compatibility` check status. Acceleration-layer
438-
issues appear here as `warning`; correctness-substrate issues
439-
appear as `error`.
437+
`worker_compatibility`, and `long_poll_wake_acceleration` check
438+
status. Each check carries an explicit `category` field whose
439+
value is either `correctness` or `acceleration`. The snapshot
440+
also exposes a `categories` map with a rolled-up `status` per
441+
category so a single response answers the two operator
442+
questions below without re-aggregating the check list.
443+
Acceleration-layer issues appear as `warning`;
444+
correctness-substrate issues appear as `error`.
440445
- `Workflow\V2\Support\OperatorQueueVisibility::forNamespace()`
441446
and `::forQueue()` — per-partition depth and claim state
442447
derived from `workflow_tasks`.
@@ -458,6 +463,14 @@ Guarantees:
458463
- Operators MUST be able to answer "is work being discovered?"
459464
and "is the acceleration layer propagating?" as separate
460465
questions, from separate metrics.
466+
- Every `HealthCheck::snapshot()` check entry MUST carry a
467+
`category` of `correctness` or `acceleration`. The
468+
`long_poll_wake_acceleration` check is the acceleration-layer
469+
surface and MUST NOT raise its status above `warning` even
470+
when the configured backend is unreachable, because the
471+
acceleration layer is optional by contract. Correctness
472+
checks remain free to report `error` when the durable
473+
substrate is broken.
461474

462475
## Test strategy alignment
463476

src/V2/Support/HealthCheck.php

Lines changed: 125 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,9 +5,15 @@
55
namespace Workflow\V2\Support;
66

77
use Carbon\CarbonInterface;
8+
use Illuminate\Contracts\Cache\Repository as CacheRepository;
9+
use Illuminate\Support\Facades\App;
810

911
final class HealthCheck
1012
{
13+
public const CATEGORY_CORRECTNESS = 'correctness';
14+
15+
public const CATEGORY_ACCELERATION = 'acceleration';
16+
1117
/**
1218
* @return array<string, mixed>
1319
*/
@@ -24,6 +30,7 @@ public static function snapshot(?CarbonInterface $now = null): array
2430
self::taskTransportCheck($metrics['tasks'] ?? [], $metrics['backlog'] ?? []),
2531
self::durableResumePathCheck($metrics['backlog'] ?? [], $metrics['repair'] ?? []),
2632
self::workerCompatibilityCheck($metrics['workers'] ?? []),
33+
self::longPollWakeAccelerationCheck(),
2734
];
2835
$status = self::status($checks);
2936

@@ -32,6 +39,7 @@ public static function snapshot(?CarbonInterface $now = null): array
3239
'status' => $status,
3340
'healthy' => $status !== 'error',
3441
'checks' => $checks,
42+
'categories' => self::categorySummary($checks),
3543
'operator_metrics' => $metrics,
3644
'structural_limits' => StructuralLimits::snapshot(),
3745
];
@@ -60,6 +68,7 @@ private static function backendCheck(array $backend): array
6068
$supported
6169
? 'The configured database, queue, and cache backends satisfy the v2 capability contract.'
6270
: 'One or more configured v2 backend capabilities are unsupported.',
71+
self::CATEGORY_CORRECTNESS,
6372
[
6473
'issue_count' => count($issues),
6574
'issues' => $issues,
@@ -81,6 +90,7 @@ private static function runSummaryProjectionCheck(array $projection): array
8190
$needsRebuild === 0
8291
? 'Run-summary projections are aligned with durable v2 runs.'
8392
: 'Run-summary projections are missing, stale, schema-outdated, or orphaned; rebuild them before trusting Waterline lists.',
93+
self::CATEGORY_CORRECTNESS,
8494
[
8595
'needs_rebuild' => $needsRebuild,
8696
'missing' => self::integer($projection['missing'] ?? 0),
@@ -120,6 +130,7 @@ private static function selectedRunProjectionCheck(array $projections): array
120130
$needsRebuild === 0
121131
? 'Selected-run wait, timeline, timer, and lineage projections are aligned with durable v2 detail.'
122132
: 'Selected-run wait, timeline, timer, or lineage projections need rebuild before trusting Waterline detail.',
133+
self::CATEGORY_CORRECTNESS,
123134
[
124135
'needs_rebuild' => $needsRebuild,
125136
'run_waits_needs_rebuild' => $waitNeedsRebuild,
@@ -160,6 +171,7 @@ private static function historyRetentionInvariantCheck(array $history): array
160171
$orphaned === 0
161172
? 'Workflow history events all reference retained workflow runs.'
162173
: 'Workflow history events exist without retained workflow runs; retention cleanup must reconcile them.',
174+
self::CATEGORY_CORRECTNESS,
163175
[
164176
'history_orphan_total' => $orphaned,
165177
'events' => self::integer($history['events'] ?? 0),
@@ -181,6 +193,7 @@ private static function commandContractCheck(array $metrics): array
181193
$needed === 0
182194
? 'WorkflowStarted command-contract snapshots are complete.'
183195
: 'Some WorkflowStarted command-contract snapshots need backfill before operators can trust command forms.',
196+
self::CATEGORY_CORRECTNESS,
184197
[
185198
'backfill_needed_runs' => $needed,
186199
'backfill_available_runs' => self::integer($metrics['backfill_available_runs'] ?? 0),
@@ -204,6 +217,7 @@ private static function taskTransportCheck(array $tasks, array $backlog): array
204217
$unhealthyTasks === 0
205218
? 'No unhealthy durable task transport state is currently projected.'
206219
: 'One or more durable tasks have unhealthy transport, claim, dispatch, or lease state.',
220+
self::CATEGORY_CORRECTNESS,
207221
[
208222
'unhealthy_tasks' => $unhealthyTasks,
209223
'repair_needed_runs' => self::integer($backlog['repair_needed_runs'] ?? 0),
@@ -228,6 +242,7 @@ private static function durableResumePathCheck(array $backlog, array $repair): a
228242
$repairNeededRuns === 0
229243
? 'Every open v2 run has a projected durable resume path.'
230244
: 'One or more open v2 runs are missing their durable next-resume source and need repair.',
245+
self::CATEGORY_CORRECTNESS,
231246
[
232247
'repair_needed_runs' => $repairNeededRuns,
233248
'missing_task_candidates' => self::integer($repair['missing_task_candidates'] ?? 0),
@@ -258,6 +273,7 @@ private static function workerCompatibilityCheck(array $workers): array
258273
$required === null
259274
? 'No current v2 compatibility marker is required.'
260275
: 'At least one active worker heartbeat advertises the current v2 compatibility marker.',
276+
self::CATEGORY_CORRECTNESS,
261277
[
262278
'required_compatibility' => $required,
263279
'active_workers' => self::integer($workers['active_workers'] ?? 0),
@@ -271,6 +287,7 @@ private static function workerCompatibilityCheck(array $workers): array
271287
'worker_compatibility',
272288
'warning',
273289
'No active worker heartbeat advertises the current v2 compatibility marker.',
290+
self::CATEGORY_CORRECTNESS,
274291
[
275292
'required_compatibility' => $required,
276293
'active_workers' => self::integer($workers['active_workers'] ?? 0),
@@ -280,6 +297,112 @@ private static function workerCompatibilityCheck(array $workers): array
280297
);
281298
}
282299

300+
/**
301+
* Acceleration-layer health for the long-poll wake surface.
302+
*
303+
* The wake layer is optional by contract: correctness continues even
304+
* when this check reports `warning`. The check exists so operators
305+
* can answer "is the acceleration layer propagating?" as a separate
306+
* question from "is work being discovered?".
307+
*
308+
* @return array<string, mixed>
309+
*/
310+
private static function longPollWakeAccelerationCheck(): array
311+
{
312+
$multiNode = (bool) config('workflows.v2.long_poll.multi_node', false);
313+
$data = [
314+
'multi_node' => $multiNode,
315+
'backend' => null,
316+
'capable' => null,
317+
'safe' => null,
318+
'reason' => null,
319+
];
320+
321+
$cache = self::resolveCacheRepository();
322+
323+
if ($cache === null) {
324+
return self::check(
325+
'long_poll_wake_acceleration',
326+
'warning',
327+
'Cache repository is not resolvable; wake acceleration may be disabled. Durable discovery continues via bounded polling.',
328+
self::CATEGORY_ACCELERATION,
329+
$data,
330+
);
331+
}
332+
333+
$validator = new LongPollCacheValidator();
334+
$capability = $validator->validateMultiNodeCapable($cache);
335+
$safety = $validator->checkMultiNodeSafety($cache, $multiNode);
336+
337+
$data['backend'] = is_string($capability['backend'] ?? null) ? $capability['backend'] : null;
338+
$data['capable'] = (bool) ($capability['capable'] ?? false);
339+
$data['safe'] = (bool) ($safety['safe'] ?? true);
340+
$data['reason'] = is_string($safety['message'] ?? null)
341+
? $safety['message']
342+
: (is_string($capability['reason'] ?? null) ? $capability['reason'] : null);
343+
344+
if ($data['safe'] === true) {
345+
return self::check(
346+
'long_poll_wake_acceleration',
347+
'ok',
348+
$multiNode
349+
? 'Wake acceleration backend is multi-node capable; dispatch discovery benefits from sub-second signalling.'
350+
: 'Wake acceleration backend is configured; dispatch discovery benefits from sub-second signalling.',
351+
self::CATEGORY_ACCELERATION,
352+
$data,
353+
);
354+
}
355+
356+
return self::check(
357+
'long_poll_wake_acceleration',
358+
'warning',
359+
$data['reason'] ?? 'Wake acceleration layer is degraded; durable discovery continues via bounded polling.',
360+
self::CATEGORY_ACCELERATION,
361+
$data,
362+
);
363+
}
364+
365+
private static function resolveCacheRepository(): ?CacheRepository
366+
{
367+
try {
368+
return App::make(CacheRepository::class);
369+
} catch (\Throwable) {
370+
return null;
371+
}
372+
}
373+
374+
/**
375+
* Summarize check status per category so operators can answer
376+
* "is work being discovered?" (correctness) and "is the
377+
* acceleration layer propagating?" (acceleration) as separate
378+
* questions without re-aggregating the check list.
379+
*
380+
* @param list<array<string, mixed>> $checks
381+
* @return array<string, array<string, mixed>>
382+
*/
383+
private static function categorySummary(array $checks): array
384+
{
385+
$categories = [
386+
self::CATEGORY_CORRECTNESS => [],
387+
self::CATEGORY_ACCELERATION => [],
388+
];
389+
390+
foreach ($checks as $check) {
391+
$category = $check['category'] ?? self::CATEGORY_CORRECTNESS;
392+
$categories[$category][] = $check;
393+
}
394+
395+
$summaries = [];
396+
foreach ($categories as $name => $entries) {
397+
$summaries[$name] = [
398+
'status' => self::status($entries),
399+
'check_count' => count($entries),
400+
];
401+
}
402+
403+
return $summaries;
404+
}
405+
283406
/**
284407
* @param list<array<string, mixed>> $checks
285408
*/
@@ -302,11 +425,12 @@ private static function status(array $checks): string
302425
* @param array<string, mixed> $data
303426
* @return array<string, mixed>
304427
*/
305-
private static function check(string $name, string $status, string $message, array $data): array
428+
private static function check(string $name, string $status, string $message, string $category, array $data): array
306429
{
307430
return [
308431
'name' => $name,
309432
'status' => $status,
433+
'category' => $category,
310434
'message' => $message,
311435
'data' => $data,
312436
];

tests/Unit/V2/HealthCheckTest.php

Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -617,4 +617,86 @@ public function testSnapshotWarnsWhenRunSummaryProjectionSchemaIsOutdated(): voi
617617
$this->assertSame(1, $projection['data']['schema_outdated']);
618618
$this->assertSame(RunSummaryProjector::SCHEMA_VERSION, $projection['data']['projection_schema_version']);
619619
}
620+
621+
public function testSnapshotClassifiesEveryCheckAsCorrectnessOrAcceleration(): void
622+
{
623+
config()->set('queue.default', 'redis');
624+
config()->set('queue.connections.redis.driver', 'redis');
625+
config()->set('cache.default', 'array');
626+
config()->set('cache.stores.array.driver', 'array');
627+
628+
$snapshot = HealthCheck::snapshot();
629+
630+
$this->assertNotEmpty($snapshot['checks']);
631+
$allowed = [HealthCheck::CATEGORY_CORRECTNESS, HealthCheck::CATEGORY_ACCELERATION];
632+
633+
foreach ($snapshot['checks'] as $check) {
634+
$this->assertArrayHasKey('category', $check, sprintf(
635+
'HealthCheck entry %s must expose a category field so operators can separate correctness from acceleration.',
636+
$check['name'] ?? 'unknown',
637+
));
638+
$this->assertContains($check['category'], $allowed, sprintf(
639+
'HealthCheck entry %s has invalid category %s; must be correctness or acceleration.',
640+
$check['name'] ?? 'unknown',
641+
(string) ($check['category'] ?? ''),
642+
));
643+
}
644+
645+
$this->assertArrayHasKey('categories', $snapshot);
646+
$this->assertArrayHasKey('correctness', $snapshot['categories']);
647+
$this->assertArrayHasKey('acceleration', $snapshot['categories']);
648+
649+
$correctnessChecks = collect($snapshot['checks'])
650+
->where('category', HealthCheck::CATEGORY_CORRECTNESS)
651+
->count();
652+
$accelerationChecks = collect($snapshot['checks'])
653+
->where('category', HealthCheck::CATEGORY_ACCELERATION)
654+
->count();
655+
656+
$this->assertSame($correctnessChecks, $snapshot['categories']['correctness']['check_count']);
657+
$this->assertSame($accelerationChecks, $snapshot['categories']['acceleration']['check_count']);
658+
$this->assertGreaterThan(0, $correctnessChecks);
659+
$this->assertGreaterThan(0, $accelerationChecks);
660+
661+
$wake = collect($snapshot['checks'])->firstWhere('name', 'long_poll_wake_acceleration');
662+
$this->assertNotNull($wake, 'HealthCheck must expose a long_poll_wake_acceleration check.');
663+
$this->assertSame(HealthCheck::CATEGORY_ACCELERATION, $wake['category']);
664+
}
665+
666+
public function testLongPollWakeAccelerationCheckNeverEscalatesAboveWarning(): void
667+
{
668+
config()->set('queue.default', 'redis');
669+
config()->set('queue.connections.redis.driver', 'redis');
670+
config()->set('cache.default', 'file');
671+
config()->set('cache.stores.file.driver', 'file');
672+
config()->set('workflows.v2.long_poll.multi_node', true);
673+
674+
$snapshot = HealthCheck::snapshot();
675+
$wake = collect($snapshot['checks'])->firstWhere('name', 'long_poll_wake_acceleration');
676+
677+
$this->assertNotNull($wake);
678+
$this->assertContains(
679+
$wake['status'],
680+
['ok', 'warning'],
681+
'Acceleration-layer check must never escalate to error; correctness remains independent of the acceleration layer.',
682+
);
683+
$this->assertSame('warning', $wake['status']);
684+
$this->assertFalse($wake['data']['safe']);
685+
$this->assertNotNull($wake['data']['reason']);
686+
}
687+
688+
public function testLongPollWakeAccelerationReportsOkWhenBackendIsCapable(): void
689+
{
690+
config()->set('cache.default', 'array');
691+
config()->set('cache.stores.array.driver', 'array');
692+
config()->set('workflows.v2.long_poll.multi_node', false);
693+
694+
$snapshot = HealthCheck::snapshot();
695+
$wake = collect($snapshot['checks'])->firstWhere('name', 'long_poll_wake_acceleration');
696+
697+
$this->assertNotNull($wake);
698+
$this->assertSame('ok', $wake['status']);
699+
$this->assertSame(HealthCheck::CATEGORY_ACCELERATION, $wake['category']);
700+
$this->assertArrayHasKey('backend', $wake['data']);
701+
}
620702
}

tests/Unit/V2/OperationalLivenessDocumentationTest.php

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -191,6 +191,7 @@ final class OperationalLivenessDocumentationTest extends TestCase
191191
'task_transport',
192192
'durable_resume_paths',
193193
'worker_compatibility',
194+
'long_poll_wake_acceleration',
194195
];
195196

196197
private const REQUIRED_MIGRATIONS = [

0 commit comments

Comments
 (0)