Skip to content

Commit 7bc33d1

Browse files
Freeze v2 scheduler correctness and cache independence contract
Freeze v2 scheduler correctness and cache independence contract Issue #583 opens Phase 5 of the v2 multi-node architecture roadmap, which removes shared cache from the scheduler correctness path so notifier or wake propagation can improve latency without being required for correct multi-node scheduling. Today the cache-backed wake store is conventionally acceleration-only, but "acceleration vs correctness" is not a documented contract; any subsystem that treats a cache read as a correctness signal is silently out of contract and any future backend swap risks regressing a behavior no one has frozen. This lands the contract doc and a pinning test. docs/architecture/scheduler-correctness.md: - scopes the contract to durable dispatch state, the acceleration layer, degraded-mode behavior, bounded discovery latency, scheduler fire evaluation, lease expiry and redelivery, backend classification, misconfiguration detection, config surface, migration path, operator-visible state, and test strategy alignment - freezes the correctness substrate as six durable tables (workflow_tasks, workflow_runs, workflow_instances, activity_executions, activity_attempts, workflow_schedules) whose contents are necessary and sufficient for scheduler correctness; forbids gating reads of these rows on acceleration-layer state and forbids skipping or deferring writes based on acceleration-signal presence - classifies Laravel cache stores (redis, database, memcached) as permitted acceleration-only backends and file/array as unsupported for multi-node acceleration, pinning the LongPollCacheValidator classification to the contract - classifies the database drivers (mysql, pgsql, sqlite, sqlsrv) as the only permitted correctness substrate backends and pins BackendCapabilities validation as the authority - enumerates five degraded-mode scenarios (wake backend unreachable, partitioned, lost-signal, compatibility heartbeat cache down, cache permanently unavailable) with explicit guaranteed outcomes; forbids nodes from refusing to make progress or fabricating state when acceleration is degraded - pins bounded discovery latency to durable knobs (DEFAULT_LONG_POLL_TIMEOUT, MAX_LONG_POLL_TIMEOUT, task_repair.redispatch_after_seconds, task_repair.loop_throttle_seconds, and ActivityLease::DURATION_MINUTES) so the guaranteed discovery-latency ceiling holds whether or not wake propagation is healthy - freezes schedule fire evaluation as a pure durable-state read of workflow_schedules.next_fire_at and buffered_actions; the scheduler does not read cache to decide which schedules are due, and the per-schedule row lock is the only correctness seam for multi-replica scheduling until Phase 6 (#584) hardens leader election - freezes lease expiry and redelivery as DB-only via TaskRepairPolicy::leaseExpired() and TaskRepair, so a cache outage never prevents redelivery or masks stuck leases - pins the detection surfaces: boot-time validation via LongPollCacheValidator, backend health via BackendCapabilities and HealthCheck, dispatch and repair metrics via OperatorMetrics, and partition depth via OperatorQueueVisibility, with an explicit rule that acceleration-layer issues MUST NOT escalate to error on the correctness path and MUST NOT mask a correctness failure - names the config/env surface that tunes validation and redelivery cadence (DW_V2_MULTI_NODE, DW_V2_VALIDATE_CACHE_BACKEND, DW_V2_CACHE_VALIDATION_MODE, DW_V2_TASK_DISPATCH_MODE, and the four DW_V2_TASK_REPAIR_* bounds) - documents a reversible five-step migration path that audits cache-as-correctness callers, turns on validation in warn mode, verifies degraded-mode behavior in staging, introduces a stronger acceleration primitive through the LongPollWakeStore contract, and tightens validation to fail mode so future misconfiguration surfaces at boot - explicitly defers Phase 6 (#584) scheduler leader election and rollout-safety enforcement so Phase 5 is not forced to solve both at once tests/Unit/V2/SchedulerCorrectnessDocumentationTest.php: - asserts every named heading, term, referenced class, durable table, config key, validation mode, acceleration backend, correctness driver, and degraded-mode scenario is present so operator, CLI, Waterline, cloud, and SDK coverage can rely on the contract's vocabulary - pins the cache-not-correctness rule, the no-refuse-to-make-progress rule, the no-fabrication-from-acceleration-signals rule, the after-commit-deferral rule, the 60-second signal TTL upper bound, the 5-minute activity lease duration, the snapshot/changed/signal wake primitive names, the file/array unsupported-for-multi-node classification, the "split is a topology change, not a correctness change" guarantee from Phase 4, and the cite of Phases 1-3 as foundation - requires Phase 6 (#584) deferral so future phases extend this contract instead of silently redefining it The acceleration vs correctness split is a contract change, not a code change: the CacheLongPollWakeStore binding, the BackendCapabilities health surface, and the TaskRepair redelivery loop are preserved verbatim, and this document adds the rules that govern what each is allowed to be responsible for. Verified: - bash scripts/check-public-boundary.sh (exit 0) - vendor/bin/phpunit tests/Unit/V2/SchedulerCorrectnessDocumentationTest.php (23 tests, 124 assertions, OK) against PHP 8.2 - vendor/bin/ecs check docs/architecture/scheduler-correctness.md tests/Unit/V2/SchedulerCorrectnessDocumentationTest.php (no errors)
1 parent 2f2bc91 commit 7bc33d1

2 files changed

Lines changed: 1002 additions & 0 deletions

File tree

0 commit comments

Comments
 (0)