Commit 7bc33d1
committed
Freeze v2 scheduler correctness and cache independence contract
Freeze v2 scheduler correctness and cache independence contract
Issue #583 opens Phase 5 of the v2 multi-node architecture roadmap,
which removes shared cache from the scheduler correctness path so
notifier or wake propagation can improve latency without being
required for correct multi-node scheduling. Today the cache-backed
wake store is conventionally acceleration-only, but "acceleration
vs correctness" is not a documented contract; any subsystem that
treats a cache read as a correctness signal is silently out of
contract and any future backend swap risks regressing a behavior
no one has frozen.
This lands the contract doc and a pinning test.
docs/architecture/scheduler-correctness.md:
- scopes the contract to durable dispatch state, the acceleration
layer, degraded-mode behavior, bounded discovery latency,
scheduler fire evaluation, lease expiry and redelivery, backend
classification, misconfiguration detection, config surface,
migration path, operator-visible state, and test strategy
alignment
- freezes the correctness substrate as six durable tables
(workflow_tasks, workflow_runs, workflow_instances,
activity_executions, activity_attempts, workflow_schedules)
whose contents are necessary and sufficient for scheduler
correctness; forbids gating reads of these rows on
acceleration-layer state and forbids skipping or deferring
writes based on acceleration-signal presence
- classifies Laravel cache stores (redis, database, memcached)
as permitted acceleration-only backends and file/array as
unsupported for multi-node acceleration, pinning the
LongPollCacheValidator classification to the contract
- classifies the database drivers (mysql, pgsql, sqlite, sqlsrv)
as the only permitted correctness substrate backends and
pins BackendCapabilities validation as the authority
- enumerates five degraded-mode scenarios (wake backend
unreachable, partitioned, lost-signal, compatibility
heartbeat cache down, cache permanently unavailable) with
explicit guaranteed outcomes; forbids nodes from refusing to
make progress or fabricating state when acceleration is
degraded
- pins bounded discovery latency to durable knobs
(DEFAULT_LONG_POLL_TIMEOUT, MAX_LONG_POLL_TIMEOUT,
task_repair.redispatch_after_seconds,
task_repair.loop_throttle_seconds, and
ActivityLease::DURATION_MINUTES) so the guaranteed
discovery-latency ceiling holds whether or not wake
propagation is healthy
- freezes schedule fire evaluation as a pure durable-state
read of workflow_schedules.next_fire_at and
buffered_actions; the scheduler does not read cache to
decide which schedules are due, and the per-schedule row
lock is the only correctness seam for multi-replica
scheduling until Phase 6 (#584) hardens leader election
- freezes lease expiry and redelivery as DB-only via
TaskRepairPolicy::leaseExpired() and TaskRepair, so a
cache outage never prevents redelivery or masks stuck
leases
- pins the detection surfaces: boot-time validation via
LongPollCacheValidator, backend health via
BackendCapabilities and HealthCheck, dispatch and repair
metrics via OperatorMetrics, and partition depth via
OperatorQueueVisibility, with an explicit rule that
acceleration-layer issues MUST NOT escalate to error on
the correctness path and MUST NOT mask a correctness
failure
- names the config/env surface that tunes validation and
redelivery cadence (DW_V2_MULTI_NODE,
DW_V2_VALIDATE_CACHE_BACKEND,
DW_V2_CACHE_VALIDATION_MODE, DW_V2_TASK_DISPATCH_MODE,
and the four DW_V2_TASK_REPAIR_* bounds)
- documents a reversible five-step migration path that
audits cache-as-correctness callers, turns on validation
in warn mode, verifies degraded-mode behavior in staging,
introduces a stronger acceleration primitive through the
LongPollWakeStore contract, and tightens validation to
fail mode so future misconfiguration surfaces at boot
- explicitly defers Phase 6 (#584) scheduler leader
election and rollout-safety enforcement so Phase 5
is not forced to solve both at once
tests/Unit/V2/SchedulerCorrectnessDocumentationTest.php:
- asserts every named heading, term, referenced class,
durable table, config key, validation mode, acceleration
backend, correctness driver, and degraded-mode scenario
is present so operator, CLI, Waterline, cloud, and SDK
coverage can rely on the contract's vocabulary
- pins the cache-not-correctness rule, the
no-refuse-to-make-progress rule, the
no-fabrication-from-acceleration-signals rule, the
after-commit-deferral rule, the 60-second signal TTL
upper bound, the 5-minute activity lease duration, the
snapshot/changed/signal wake primitive names, the
file/array unsupported-for-multi-node classification,
the "split is a topology change, not a correctness
change" guarantee from Phase 4, and the cite of
Phases 1-3 as foundation
- requires Phase 6 (#584) deferral so future phases
extend this contract instead of silently redefining it
The acceleration vs correctness split is a contract
change, not a code change: the CacheLongPollWakeStore
binding, the BackendCapabilities health surface, and the
TaskRepair redelivery loop are preserved verbatim, and
this document adds the rules that govern what each is
allowed to be responsible for.
Verified:
- bash scripts/check-public-boundary.sh (exit 0)
- vendor/bin/phpunit tests/Unit/V2/SchedulerCorrectnessDocumentationTest.php
(23 tests, 124 assertions, OK) against PHP 8.2
- vendor/bin/ecs check docs/architecture/scheduler-correctness.md
tests/Unit/V2/SchedulerCorrectnessDocumentationTest.php
(no errors)1 parent 2f2bc91 commit 7bc33d1
2 files changed
Lines changed: 1002 additions & 0 deletions
File tree
- docs/architecture
- tests/Unit/V2
0 commit comments