Skip to content

Commit 8342eb8

Browse files
Freeze v2 operational liveness and transport repair contract
Issue #68 closes the last open piece of the v2 architecture foundation: freezing the liveness contract for operator tooling, documenting transport repair behavior, and pinning health-check integration for stuck-task detection as one product contract. Today those surfaces exist as independent classes (TaskRepair, TaskRepairPolicy, TaskRepairCandidates, ActivityTaskClaimer, ActivityLease, TaskDispatcher, RunWorkflowTask / RunActivityTask / RunTimerTask, RunSummaryProjector, OperatorMetrics, HealthCheck, OperatorQueueVisibility, RepairBlockedReason), but "what counts as liveness-safe", "which transport payload is legal", "when redelivery applies vs when repair applies", "which liveness_state values are reachable", and "which env vars tune repair cadence" are not frozen as one product contract. Any subsystem that widens a transport payload, quietly renames a liveness_state, skips the run-level lock for a bulk ingress, or lets a sweeper interpret workflow code is silently out of contract. This lands the contract doc and a pinning test. docs/architecture/operational-liveness.md: - scopes the contract to bootstrap, transport-job shape, lease management, redelivery vs repair, repair cadence, heartbeat renewal, durable-next-resume source, worker-loss recovery, sweeper scope, compatibility preservation, ingress serialization, stuck-state observability, and the config surface - freezes the bootstrap rule: the start command materialises workflow_instances, workflow_runs, workflow_tasks, and the RunSummaryProjector row atomically and dispatches the first transport job via DB::afterCommit(), so no second daemon is required for the start path to make progress - freezes the transport-job shape: RunWorkflowTask, RunActivityTask, and RunTimerTask each accept only a task_id and load the durable row under the fresh lease; carrying the task row, run summary, workflow snapshot, or authoring closure is a protocol change - freezes the claim decision surface: ActivityTaskClaimer:: claimDetailed is the authority on claim transitions and pins the reason codes task_not_found, task_not_activity, task_not_ready, task_not_due, activity_execution_missing, activity_execution_not_found, workflow_run_missing, backend_unsupported, compatibility_unsupported - freezes the lease authority: ActivityLease::DURATION_MINUTES is pinned at 5 minutes and ActivityLease::expiresAt is the sole authority on activity attempt lease duration - freezes redelivery vs repair as two distinct flows: redelivery reuses the task row through TaskRepair::recoverExistingTask without writing a new row; repair re-creates a missing task through TaskRepair::repairRun under the run-level lock; neither fabricates history - freezes the RepairBlockedReason catalog (unsupported_history, waiting_for_compatible_worker, selected_run_not_current, run_closed, repair_not_needed) so a blocked repair is always diagnosable - freezes repair cadence: the four TaskRepairPolicy knobs (redispatch_after_seconds default 3, loop_throttle_seconds default 5, scan_limit default 25, failure_backoff_max_seconds default 60) and the two strategy strings (SCAN_STRATEGY = scope_fair_round_robin, FAILURE_BACKOFF_STRATEGY = exponential_by_repair_count) with their DW_V2_TASK_REPAIR_* env vars preserved through Env::dw() - freezes the heartbeat rule: a heartbeat renews lease_expires_at on the owning attempt and never creates a new activity_attempts row; a heartbeat from a non-owning worker is rejected - freezes the durable next-resume source enumeration: every non-terminal run projects exactly one liveness_state from the 17 frozen values (closed, repair_needed, workflow_replay_blocked, activity_running_without_task, waiting_for_condition, waiting_for_signal, waiting_for_child, activity_task_waiting_for_compatible_worker, activity_task_claim_failed, activity_task_leased, activity_task_ready, workflow_task_waiting_for_compatible_worker, workflow_task_claim_failed, workflow_task_leased, workflow_task_ready, timer_task_leased, timer_scheduled) and at most one next_task_id, with resume_source_kind and resume_source_id consistent with liveness_state - freezes worker-loss recovery as lease-driven: lease expiry and reassignment never mutate run status; a replacement worker claims the same activity_execution_id under a new activity_attempt_id subject to Phase 1's at-least-once contract - bounds sweeper scope to TaskRepair entry points: a sweeper never instantiates a WorkflowExecutor, never interprets workflow code, and respects TaskRepairPolicy::loopThrottleSeconds; a deployment with no sweeper at all is still correct - requires repair-driven redispatch to preserve the Phase 2 compatibility markers (WorkflowDefinitionFingerprint, required_compatibility, TaskCompatibility-managed backend markers) and forbids widening a redispatched task to a different compatibility scope - freezes ingress serialization: every run-mutating call takes lockForUpdate on workflow_runs; bulk cross-run operations take the locks in stable run-id order - restates stuck-state observability under OperatorMetrics:: snapshot (runs.repair_needed, runs.claim_failed, runs.compatibility_blocked, tasks.ready, tasks.ready_due, tasks.delayed, tasks.leased, tasks.dispatch_failed, tasks.claim_failed, tasks.dispatch_overdue, tasks.lease_expired, tasks.unhealthy, backlog.*, repair.*, workers.*, repair_policy.*) and the eight frozen HealthCheck names (backend_capabilities, run_summary_projection, selected_run_projections, history_retention_invariant, command_contract_snapshots, task_transport, durable_resume_paths, worker_compatibility), and requires Waterline to render every frozen signal somewhere in its operator UI - names the two admin HTTP routes load-bearing for liveness (/api/system/repair/pass and /api/system/activity-timeouts/pass) and the bounded semantics (scan_limit, loop_throttle_seconds, idempotent response shape) - names the four v2 migrations that own liveness columns and requires each to be reversible by the standard Laravel down() path (workflow_tasks, workflow_run_summaries, activity_attempts, worker_compatibility_heartbeats) - describes a six-step migration path from lease-only recovery to the full liveness-and-repair envelope without a hard cutover - defers scheduler leader election, cross-region coordination, automatic migration rollback, client-side rollout tooling, and queue-backend-specific repair optimisations to follow-on roadmap issues so future work extends this contract rather than silently redefining it tests/Unit/V2/OperationalLivenessDocumentationTest.php: - asserts every heading, term, referenced class, admin HTTP route, config variable, repair policy knob, claim reason code, repair blocked reason, liveness_state value, frozen metric key, frozen health check name, migration, and migration-path step is present so operator, CLI, Waterline, cloud, and SDK coverage can rely on the contract's vocabulary - pins the transactional-bootstrap rule, the transport-payload- is-only-task-id rule, the redelivery-reuses-row rule, the repair-is-concurrency-safe rule, the repair-never-fabricates- history rule, the heartbeat-never-creates-a-new-attempt rule, the rejected-non-owning-heartbeat rule, the every-non-terminal- run-projects-exactly-one-liveness_state rule, the resume_source_kind/id consistency rule, the worker-loss-through- lease-only rule, the sweeper-never-runs-authoring-layer rule, the deployment-with-no-sweeper-is-correct rule, the repair-preserves-compatibility rule, the no-widening-on- redispatch rule, the ingress-serialises-through-run-level-lock rule, the stable-lock-order rule, the stuck-is-never-silent rule, the ActivityLease-5-minutes rule, the scope_fair_round_robin fairness rule, the TaskRepairPolicy::snapshot exposes-full- cadence rule, the migration-reversibility rule, the legacy WORKFLOW_V2_TASK_REPAIR_* preservation rule, and the Phase 5 cache-not-correctness rule - requires cites of the Phase 1 (#579) execution-guarantees, Phase 2 (#580) worker-compatibility, Phase 3 (#581) task-matching, Phase 4 (#582) control-plane-split, Phase 5 (#583) scheduler-correctness, and Phase 6 (#584) rollout-safety contracts, and requires all six roadmap issue numbers (#579-584) to be named so the phase lineage cannot silently decouple - requires explicit deferral of scheduler leader election so a future phase extends this contract rather than redefining it The liveness enforcement shift is a contract change, not a code change: TaskRepair, TaskRepairPolicy, TaskRepairCandidates, ActivityTaskClaimer, ActivityLease, TaskDispatcher, RunWorkflowTask, RunActivityTask, RunTimerTask, RunSummaryProjector, OperatorMetrics, HealthCheck, OperatorQueueVisibility, RepairBlockedReason, and the workflow_tasks, workflow_run_summaries, activity_attempts, and worker_compatibility_heartbeats tables are preserved verbatim, and this document adds the rules that govern which of their signals are contract, which are advisory, and how they combine into the operational liveness envelope. Verified: - bash scripts/check-public-boundary.sh (exit 0) - vendor/bin/phpunit tests/Unit/V2/OperationalLivenessDocumentationTest.php (32 tests, 221 assertions, OK) against PHP 8.4 - vendor/bin/phpunit tests/Unit/V2/OperationalLivenessDocumentationTest.php tests/Unit/V2/RolloutSafetyDocumentationTest.php (57 tests, 405 assertions, OK) - vendor/bin/ecs check docs/architecture/operational-liveness.md tests/Unit/V2/OperationalLivenessDocumentationTest.php (no errors) Refs: #68
1 parent abc11d5 commit 8342eb8

2 files changed

Lines changed: 1622 additions & 0 deletions

File tree

0 commit comments

Comments
 (0)