Skip to content

Commit abc11d5

Browse files
Freeze v2 rollout-safety and coordination-health contract
Freeze v2 rollout-safety and coordination-health contract Issue #584 opens Phase 6 of the v2 multi-node architecture roadmap, which replaces rollout discipline with in-system enforcement for boot-time admission, mixed-build safety, routing drains, stuck-work detection, and coordination health. Today those surfaces exist as independent classes (BackendCapabilities, LongPollCacheValidator, WorkflowModeGuard, OperatorMetrics, HealthCheck, WorkerCompatibilityFleet, TaskRepair, WorkflowDefinitionFingerprint, build-id rollouts), but "what counts as rollout-safe," "which metric keys operators can depend on," "which Waterline surfaces render which signal," and "which env vars are the canonical safety knobs" are not frozen as one product contract. Any subsystem that treats these signals as advisory, renames a metric key, or swaps a fail-closed check for a warn-only check is silently out of contract; any future role split or rollout-tooling change risks regressing a behavior no one has pinned. This lands the contract doc and a pinning test. docs/architecture/rollout-safety.md: - scopes the contract to admission checks, mixed-build safety, schema fencing, routing drains, coordination health metrics, Waterline observability, config surface, and the migration path - freezes the four boot-time admission authorities (BackendCapabilities::snapshot, StructuralLimits::snapshot, LongPollCacheValidator::checkMultiNodeSafety, WorkflowModeGuard::check, ReadinessContract::definition) and states that an error-severity check MUST NOT be silently downgraded at runtime even if the configured mode is warn - freezes the mixed-build enforcement contract: WorkerCompatibilityFleet::detailsForNamespace is the authority on live fleet compatibility, and the matching role MUST refuse a claim whose worker WorkflowDefinitionFingerprint does not match the run's recorded fingerprint under DW_V2_PIN_TO_RECORDED_FINGERPRINT - freezes the schema-fencing surface: WaterlineEngineSource::status and the ReadinessContract v2_operator_surface_available readiness key are the two authorities, the workflow_definition_fingerprints migration is named, the workflow_worker_build_id_rollouts migration is named with drain_intent / drained_at as the drain observability surface, and every v2 migration must be reversible via the standard Laravel down() path - freezes routing safety: ActivityTaskClaimer::claimDetailed is the authority on claim decisions, the reason codes compatibility_unsupported / backend_unsupported / compatibility_blocked are pinned, and a ready task whose required compatibility has no live worker MUST remain ready (no fabricated claimer, no dropped task, no exhausted retries) - freezes coordination health: the frozen OperatorMetrics::snapshot keys (runs.claim_failed, runs.compatibility_blocked, tasks.ready, tasks.unhealthy, backlog.repair_needed_runs, backlog.compatibility_blocked_runs, repair.missing_task_candidates, repair.oldest_missing_run_started_at, workers.required_compatibility, workers.active_workers_supporting_required, repair_policy.*), the eight frozen HealthCheck check names (backend_capabilities, run_summary_projection, selected_run_projection, history_retention_invariant, command_contract, task_transport, durable_resume_paths, worker_compatibility), and the per-partition OperatorQueueVisibility shape - freezes stuck-detector behavior: TaskRepairCandidates is the authority on missing_task_candidates, TaskRepair::leaseExpired on redelivery, TaskRepairPolicy::readyTaskNeedsRedispatch on ready-but-unclaimed redispatch, liveness_state = repair_needed on repair-needed runs, with the explicit rules that a stuck condition is never silent and the repair path never fabricates work - freezes the Waterline surfaces that render rollout-safety signals (dashboard, workers, flows/index, flows/flow, WorkerHealth, ScheduleView) and the rule that every frozen metric key is rendered somewhere in Waterline so coordination health stays observable to humans - freezes the config surface (DW_V2_NAMESPACE, compatibility and fingerprint pin knobs, DW_V2_GUARDRAILS_BOOT modes, DW_V2_MULTI_NODE, DW_V2_VALIDATE_CACHE_BACKEND, DW_V2_CACHE_VALIDATION_MODE, DW_V2_TASK_DISPATCH_MODE, four DW_V2_TASK_REPAIR_* bounds) and preserves the legacy WORKFLOW_V2_* env var names through Env::dw() so existing deployments continue to resolve - describes a reversible six-step migration path from today's rollout-discipline posture (warn everywhere) to in-system fail-closed enforcement without a hard cutover - preserves the Phase 5 (#583) "cache-not-correctness" guarantee explicitly: rollout safety MUST NOT reintroduce cache as a correctness dependency, and Phase 4's (#582) role split is preserved unchanged (Phase 6 does not move authority between roles) - defers scheduler leader election across replicas, cross-region coordination, automatic migration rollback, and client-side rollout tooling (blue/green, canary, Helm overlays) so future roadmap phases extend this contract instead of silently redefining it tests/Unit/V2/RolloutSafetyDocumentationTest.php: - asserts every named heading, term, referenced class, HTTP route, config variable, validation mode, frozen metric key, frozen health check name, Waterline surface, rollout-safety migration, claim reason code, and migration-path step is present so operator, CLI, Waterline, cloud, and SDK coverage can rely on the contract's vocabulary - pins the no-silent-downgrade admission rule, the never-lose-tasks routing rule, the no-fabricated-claimer rule, the stuck-condition-is-never-silent rule, the repair-never-fabricates rule, the Waterline-renders-every-frozen-key rule, the reads-through-API-not-DB rule, the fingerprint-pinning refuse- the-claim rule, the migration-reversibility rule, the protocol- version single-step enforcement rule, the Env::dw legacy name preservation rule, and the Phase 5 cache-not-correctness rule - requires cites of the Phase 1 (#579) execution-guarantees, Phase 2 (#580) worker-compatibility, Phase 3 (#581) task-matching, Phase 4 (#582) control-plane-split, and Phase 5 (#583) scheduler-correctness contracts, and requires all six roadmap issue numbers (#579-#584) to be named so the phase lineage cannot silently decouple - requires explicit deferral of scheduler leader election so a future phase extends this contract rather than redefining it The in-system enforcement shift is a contract change, not a code change: BackendCapabilities, LongPollCacheValidator, WorkflowModeGuard, ReadinessContract, OperatorMetrics, HealthCheck, WorkerCompatibilityFleet, ActivityTaskClaimer, TaskRepair, and the workflow_definition_fingerprints and workflow_worker_build_id_rollouts tables are preserved verbatim, and this document adds the rules that govern which of their signals are contract, which are advisory, and how they combine into the rollout-safety envelope. Verified: - bash scripts/check-public-boundary.sh (exit 0) - vendor/bin/phpunit tests/Unit/V2/RolloutSafetyDocumentationTest.php (25 tests, 184 assertions, OK) against PHP 8.1, 8.2, and 8.4 - vendor/bin/ecs check docs/architecture/rollout-safety.md tests/Unit/V2/RolloutSafetyDocumentationTest.php (no errors)
1 parent 7bc33d1 commit abc11d5

2 files changed

Lines changed: 1245 additions & 0 deletions

File tree

0 commit comments

Comments
 (0)