Commit abc11d5
committed
Freeze v2 rollout-safety and coordination-health contract
Freeze v2 rollout-safety and coordination-health contract
Issue #584 opens Phase 6 of the v2 multi-node architecture roadmap,
which replaces rollout discipline with in-system enforcement for
boot-time admission, mixed-build safety, routing drains, stuck-work
detection, and coordination health. Today those surfaces exist as
independent classes (BackendCapabilities, LongPollCacheValidator,
WorkflowModeGuard, OperatorMetrics, HealthCheck,
WorkerCompatibilityFleet, TaskRepair, WorkflowDefinitionFingerprint,
build-id rollouts), but "what counts as rollout-safe," "which
metric keys operators can depend on," "which Waterline surfaces
render which signal," and "which env vars are the canonical
safety knobs" are not frozen as one product contract. Any
subsystem that treats these signals as advisory, renames a metric
key, or swaps a fail-closed check for a warn-only check is
silently out of contract; any future role split or rollout-tooling
change risks regressing a behavior no one has pinned.
This lands the contract doc and a pinning test.
docs/architecture/rollout-safety.md:
- scopes the contract to admission checks, mixed-build safety,
schema fencing, routing drains, coordination health metrics,
Waterline observability, config surface, and the migration path
- freezes the four boot-time admission authorities
(BackendCapabilities::snapshot, StructuralLimits::snapshot,
LongPollCacheValidator::checkMultiNodeSafety,
WorkflowModeGuard::check, ReadinessContract::definition) and
states that an error-severity check MUST NOT be silently
downgraded at runtime even if the configured mode is warn
- freezes the mixed-build enforcement contract:
WorkerCompatibilityFleet::detailsForNamespace is the authority
on live fleet compatibility, and the matching role MUST refuse
a claim whose worker WorkflowDefinitionFingerprint does not
match the run's recorded fingerprint under
DW_V2_PIN_TO_RECORDED_FINGERPRINT
- freezes the schema-fencing surface: WaterlineEngineSource::status
and the ReadinessContract v2_operator_surface_available readiness
key are the two authorities, the workflow_definition_fingerprints
migration is named, the workflow_worker_build_id_rollouts
migration is named with drain_intent / drained_at as the drain
observability surface, and every v2 migration must be
reversible via the standard Laravel down() path
- freezes routing safety: ActivityTaskClaimer::claimDetailed is
the authority on claim decisions, the reason codes
compatibility_unsupported / backend_unsupported /
compatibility_blocked are pinned, and a ready task whose
required compatibility has no live worker MUST remain ready
(no fabricated claimer, no dropped task, no exhausted retries)
- freezes coordination health: the frozen OperatorMetrics::snapshot
keys (runs.claim_failed, runs.compatibility_blocked, tasks.ready,
tasks.unhealthy, backlog.repair_needed_runs,
backlog.compatibility_blocked_runs, repair.missing_task_candidates,
repair.oldest_missing_run_started_at, workers.required_compatibility,
workers.active_workers_supporting_required, repair_policy.*),
the eight frozen HealthCheck check names
(backend_capabilities, run_summary_projection,
selected_run_projection, history_retention_invariant,
command_contract, task_transport, durable_resume_paths,
worker_compatibility), and the per-partition
OperatorQueueVisibility shape
- freezes stuck-detector behavior: TaskRepairCandidates is the
authority on missing_task_candidates, TaskRepair::leaseExpired
on redelivery, TaskRepairPolicy::readyTaskNeedsRedispatch on
ready-but-unclaimed redispatch, liveness_state = repair_needed
on repair-needed runs, with the explicit rules that a stuck
condition is never silent and the repair path never fabricates
work
- freezes the Waterline surfaces that render rollout-safety signals
(dashboard, workers, flows/index, flows/flow, WorkerHealth,
ScheduleView) and the rule that every frozen metric key is
rendered somewhere in Waterline so coordination health stays
observable to humans
- freezes the config surface (DW_V2_NAMESPACE, compatibility and
fingerprint pin knobs, DW_V2_GUARDRAILS_BOOT modes,
DW_V2_MULTI_NODE, DW_V2_VALIDATE_CACHE_BACKEND,
DW_V2_CACHE_VALIDATION_MODE, DW_V2_TASK_DISPATCH_MODE, four
DW_V2_TASK_REPAIR_* bounds) and preserves the legacy
WORKFLOW_V2_* env var names through Env::dw() so existing
deployments continue to resolve
- describes a reversible six-step migration path from today's
rollout-discipline posture (warn everywhere) to in-system
fail-closed enforcement without a hard cutover
- preserves the Phase 5 (#583) "cache-not-correctness" guarantee
explicitly: rollout safety MUST NOT reintroduce cache as a
correctness dependency, and Phase 4's (#582) role split is
preserved unchanged (Phase 6 does not move authority between
roles)
- defers scheduler leader election across replicas, cross-region
coordination, automatic migration rollback, and client-side
rollout tooling (blue/green, canary, Helm overlays) so future
roadmap phases extend this contract instead of silently
redefining it
tests/Unit/V2/RolloutSafetyDocumentationTest.php:
- asserts every named heading, term, referenced class, HTTP route,
config variable, validation mode, frozen metric key, frozen
health check name, Waterline surface, rollout-safety migration,
claim reason code, and migration-path step is present so
operator, CLI, Waterline, cloud, and SDK coverage can rely on
the contract's vocabulary
- pins the no-silent-downgrade admission rule, the never-lose-tasks
routing rule, the no-fabricated-claimer rule, the
stuck-condition-is-never-silent rule, the repair-never-fabricates
rule, the Waterline-renders-every-frozen-key rule, the
reads-through-API-not-DB rule, the fingerprint-pinning refuse-
the-claim rule, the migration-reversibility rule, the protocol-
version single-step enforcement rule, the Env::dw legacy name
preservation rule, and the Phase 5 cache-not-correctness rule
- requires cites of the Phase 1 (#579) execution-guarantees,
Phase 2 (#580) worker-compatibility, Phase 3 (#581)
task-matching, Phase 4 (#582) control-plane-split, and Phase 5
(#583) scheduler-correctness contracts, and requires all six
roadmap issue numbers (#579-#584) to be named so the phase
lineage cannot silently decouple
- requires explicit deferral of scheduler leader election so a
future phase extends this contract rather than redefining it
The in-system enforcement shift is a contract change, not a code
change: BackendCapabilities, LongPollCacheValidator,
WorkflowModeGuard, ReadinessContract, OperatorMetrics, HealthCheck,
WorkerCompatibilityFleet, ActivityTaskClaimer, TaskRepair, and the
workflow_definition_fingerprints and workflow_worker_build_id_rollouts
tables are preserved verbatim, and this document adds the rules
that govern which of their signals are contract, which are
advisory, and how they combine into the rollout-safety envelope.
Verified:
- bash scripts/check-public-boundary.sh (exit 0)
- vendor/bin/phpunit tests/Unit/V2/RolloutSafetyDocumentationTest.php
(25 tests, 184 assertions, OK) against PHP 8.1, 8.2, and 8.4
- vendor/bin/ecs check docs/architecture/rollout-safety.md
tests/Unit/V2/RolloutSafetyDocumentationTest.php
(no errors)1 parent 7bc33d1 commit abc11d5
2 files changed
Lines changed: 1245 additions & 0 deletions
0 commit comments