Skip to content

Commit 44c357f

Browse files
GitHub #581: server + workflow: v2 architecture Phase 3: dedicated task matching and dispatch (#513)
1 parent f57ffc3 commit 44c357f

2 files changed

Lines changed: 117 additions & 0 deletions

File tree

docs/architecture/task-matching.md

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -481,6 +481,30 @@ layer healthy" without reading task rows directly:
481481
poller heartbeats, and stale-worker detection.
482482
- The standalone server exposes the metrics snapshot at
483483
`GET /api/system/metrics`.
484+
- `Workflow\V2\Support\HealthCheck::snapshot()` returns the
485+
`routing_health` check as the matching role's aggregate health
486+
surface. The check rolls `backlog.compatibility_blocked_runs`,
487+
`tasks.dispatch_overdue`, and `tasks.claim_failed` together with
488+
the matching-role shape fields above (`queue_wake_enabled`,
489+
`matching_shape`, `wake_owner`, `task_dispatch_mode`) and the
490+
worker-fleet coverage triple (`required_compatibility`,
491+
`active_workers`, `active_workers_supporting_required`) plus the
492+
derived `fleet_supports_required` flag (true when no marker is
493+
required, or when at least one active worker advertises the
494+
required marker). Operators read this single check to tell
495+
whether work is blocked on compatibility coverage, dispatch wake
496+
latency, or claim churn — and whether the active fleet can take
497+
the required marker at all — without re-aggregating the
498+
matching-role metrics by hand. The check escalates only on
499+
visible drain signals (`compatibility_blocked_runs`,
500+
`dispatch_overdue`, `claim_failed`); the worker-coverage triple
501+
is observability so the dedicated `worker_compatibility` check
502+
stays the sole owner of fleet-admission escalation. An
503+
`active_workers_supporting_required = 0` reading paired with a
504+
non-zero `compatibility_blocked_runs` count is the canonical
505+
"the fleet cannot take this work" reading on the matching
506+
surface, mirroring the dedicated `worker_compatibility` check
507+
without duplicating its escalation logic.
484508

485509
Guarantees:
486510

@@ -496,6 +520,21 @@ Guarantees:
496520
populated `last_dispatch_error` is reported through
497521
`tasks.dispatch_failed` and remains discoverable to pollers, so
498522
a transport outage does not drop work.
523+
- Wake-latency on the dispatch path is visible without walking
524+
task rows. `tasks.oldest_dispatch_overdue_since` /
525+
`tasks.max_dispatch_overdue_age_ms` and the symmetric
526+
`tasks.oldest_claim_failed_at` / `tasks.max_claim_failed_age_ms`
527+
pair are forwarded onto `routing_health` as
528+
`oldest_dispatch_overdue_since` /
529+
`max_dispatch_overdue_age_ms` and `oldest_claim_failed_at` /
530+
`max_claim_failed_age_ms` so a stuck dispatch wake or stuck
531+
claim error is observable as an age, not just a count.
532+
- Compatibility-block age on the routing path is visible the same
533+
way. `backlog.oldest_compatibility_blocked_started_at` and
534+
`backlog.max_compatibility_blocked_age_ms` are forwarded onto
535+
`routing_health` as `oldest_compatibility_blocked_started_at`
536+
and `max_compatibility_blocked_age_ms` so operators can size
537+
the block by age, not only by count.
499538

500539
## Coupling boundaries with durable history
501540

tests/Unit/V2/TaskMatchingDocumentationTest.php

Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -102,6 +102,7 @@ final class TaskMatchingDocumentationTest extends TestCase
102102
'CacheLongPollWakeStore',
103103
'OperatorMetrics',
104104
'OperatorQueueVisibility',
105+
'HealthCheck',
105106
'RunSummaryProjector',
106107
'LifecycleEventDispatcher',
107108
'ActivityLease',
@@ -495,6 +496,83 @@ public function testContractDocumentExposesMatchingRoleShapeOnOperatorSnapshot()
495496
}
496497
}
497498

499+
public function testContractDocumentNamesRoutingHealthAggregate(): void
500+
{
501+
$contents = $this->documentContents();
502+
503+
$this->assertStringContainsString(
504+
'`routing_health`',
505+
$contents,
506+
'Task matching contract must name the routing_health health check as the matching role aggregate health surface.',
507+
);
508+
$this->assertStringContainsString(
509+
'HealthCheck::snapshot()',
510+
$contents,
511+
'Task matching contract must cite HealthCheck::snapshot() as the source of the routing_health aggregate.',
512+
);
513+
514+
foreach ([
515+
'`backlog.compatibility_blocked_runs`',
516+
'`tasks.dispatch_overdue`',
517+
'`tasks.claim_failed`',
518+
] as $rolledKey) {
519+
$this->assertStringContainsString(
520+
$rolledKey,
521+
$contents,
522+
sprintf(
523+
'Task matching contract must name the %s metric key rolled into routing_health so the drain signals are reviewable as a single aggregate.',
524+
$rolledKey,
525+
),
526+
);
527+
}
528+
529+
foreach ([
530+
'`required_compatibility`',
531+
'`active_workers`',
532+
'`active_workers_supporting_required`',
533+
'`fleet_supports_required`',
534+
] as $coverageField) {
535+
$this->assertStringContainsString(
536+
$coverageField,
537+
$contents,
538+
sprintf(
539+
'Task matching contract must name the %s field on the routing_health worker-coverage triple so operators can read fleet capability without re-aggregating worker_compatibility.',
540+
$coverageField,
541+
),
542+
);
543+
}
544+
545+
$this->assertMatchesRegularExpression(
546+
'/active_workers_supporting_required = 0[\s\S]{0,300}compatibility_blocked_runs[\s\S]{0,300}canonical[\s\S]{0,300}fleet cannot take this work/i',
547+
$contents,
548+
'Task matching contract must name the canonical "the fleet cannot take this work" reading on the routing surface (active_workers_supporting_required = 0 paired with a non-zero compatibility_blocked_runs count).',
549+
);
550+
551+
$this->assertMatchesRegularExpression(
552+
'/sole owner of fleet-admission escalation/i',
553+
$contents,
554+
'Task matching contract must state that worker_compatibility remains the sole owner of fleet-admission escalation so routing_health does not duplicate escalation logic.',
555+
);
556+
557+
foreach ([
558+
'`tasks.oldest_dispatch_overdue_since`',
559+
'`tasks.max_dispatch_overdue_age_ms`',
560+
'`tasks.oldest_claim_failed_at`',
561+
'`tasks.max_claim_failed_age_ms`',
562+
'`backlog.oldest_compatibility_blocked_started_at`',
563+
'`backlog.max_compatibility_blocked_age_ms`',
564+
] as $ageKey) {
565+
$this->assertStringContainsString(
566+
$ageKey,
567+
$contents,
568+
sprintf(
569+
'Task matching contract must name the %s age metric key forwarded onto routing_health so wake-latency is observable as an age, not just a count.',
570+
$ageKey,
571+
),
572+
);
573+
}
574+
}
575+
498576
private function documentContents(): string
499577
{
500578
$path = dirname(__DIR__, 3) . '/' . self::DOCUMENT;

0 commit comments

Comments
 (0)