Commit 8342eb8
committed
Freeze v2 operational liveness and transport repair contract
Issue #68 closes the last open piece of the v2 architecture
foundation: freezing the liveness contract for operator tooling,
documenting transport repair behavior, and pinning health-check
integration for stuck-task detection as one product contract. Today
those surfaces exist as independent classes (TaskRepair,
TaskRepairPolicy, TaskRepairCandidates, ActivityTaskClaimer,
ActivityLease, TaskDispatcher, RunWorkflowTask / RunActivityTask /
RunTimerTask, RunSummaryProjector, OperatorMetrics, HealthCheck,
OperatorQueueVisibility, RepairBlockedReason), but "what counts as
liveness-safe", "which transport payload is legal", "when
redelivery applies vs when repair applies", "which liveness_state
values are reachable", and "which env vars tune repair cadence" are
not frozen as one product contract. Any subsystem that widens a
transport payload, quietly renames a liveness_state, skips the
run-level lock for a bulk ingress, or lets a sweeper interpret
workflow code is silently out of contract.
This lands the contract doc and a pinning test.
docs/architecture/operational-liveness.md:
- scopes the contract to bootstrap, transport-job shape, lease
management, redelivery vs repair, repair cadence, heartbeat
renewal, durable-next-resume source, worker-loss recovery,
sweeper scope, compatibility preservation, ingress serialization,
stuck-state observability, and the config surface
- freezes the bootstrap rule: the start command materialises
workflow_instances, workflow_runs, workflow_tasks, and the
RunSummaryProjector row atomically and dispatches the first
transport job via DB::afterCommit(), so no second daemon is
required for the start path to make progress
- freezes the transport-job shape: RunWorkflowTask,
RunActivityTask, and RunTimerTask each accept only a task_id
and load the durable row under the fresh lease; carrying the
task row, run summary, workflow snapshot, or authoring closure
is a protocol change
- freezes the claim decision surface: ActivityTaskClaimer::
claimDetailed is the authority on claim transitions and pins
the reason codes task_not_found, task_not_activity,
task_not_ready, task_not_due, activity_execution_missing,
activity_execution_not_found, workflow_run_missing,
backend_unsupported, compatibility_unsupported
- freezes the lease authority: ActivityLease::DURATION_MINUTES
is pinned at 5 minutes and ActivityLease::expiresAt is the
sole authority on activity attempt lease duration
- freezes redelivery vs repair as two distinct flows: redelivery
reuses the task row through TaskRepair::recoverExistingTask
without writing a new row; repair re-creates a missing task
through TaskRepair::repairRun under the run-level lock; neither
fabricates history
- freezes the RepairBlockedReason catalog (unsupported_history,
waiting_for_compatible_worker, selected_run_not_current,
run_closed, repair_not_needed) so a blocked repair is always
diagnosable
- freezes repair cadence: the four TaskRepairPolicy knobs
(redispatch_after_seconds default 3, loop_throttle_seconds
default 5, scan_limit default 25, failure_backoff_max_seconds
default 60) and the two strategy strings
(SCAN_STRATEGY = scope_fair_round_robin,
FAILURE_BACKOFF_STRATEGY = exponential_by_repair_count) with
their DW_V2_TASK_REPAIR_* env vars preserved through Env::dw()
- freezes the heartbeat rule: a heartbeat renews
lease_expires_at on the owning attempt and never creates a new
activity_attempts row; a heartbeat from a non-owning worker is
rejected
- freezes the durable next-resume source enumeration: every
non-terminal run projects exactly one liveness_state from the
17 frozen values (closed, repair_needed, workflow_replay_blocked,
activity_running_without_task, waiting_for_condition,
waiting_for_signal, waiting_for_child,
activity_task_waiting_for_compatible_worker,
activity_task_claim_failed, activity_task_leased,
activity_task_ready, workflow_task_waiting_for_compatible_worker,
workflow_task_claim_failed, workflow_task_leased,
workflow_task_ready, timer_task_leased, timer_scheduled) and
at most one next_task_id, with resume_source_kind and
resume_source_id consistent with liveness_state
- freezes worker-loss recovery as lease-driven: lease expiry and
reassignment never mutate run status; a replacement worker
claims the same activity_execution_id under a new
activity_attempt_id subject to Phase 1's at-least-once contract
- bounds sweeper scope to TaskRepair entry points: a sweeper
never instantiates a WorkflowExecutor, never interprets
workflow code, and respects
TaskRepairPolicy::loopThrottleSeconds; a deployment with no
sweeper at all is still correct
- requires repair-driven redispatch to preserve the Phase 2
compatibility markers (WorkflowDefinitionFingerprint,
required_compatibility, TaskCompatibility-managed backend
markers) and forbids widening a redispatched task to a
different compatibility scope
- freezes ingress serialization: every run-mutating call takes
lockForUpdate on workflow_runs; bulk cross-run operations take
the locks in stable run-id order
- restates stuck-state observability under OperatorMetrics::
snapshot (runs.repair_needed, runs.claim_failed,
runs.compatibility_blocked, tasks.ready, tasks.ready_due,
tasks.delayed, tasks.leased, tasks.dispatch_failed,
tasks.claim_failed, tasks.dispatch_overdue, tasks.lease_expired,
tasks.unhealthy, backlog.*, repair.*, workers.*,
repair_policy.*) and the eight frozen HealthCheck names
(backend_capabilities, run_summary_projection,
selected_run_projections, history_retention_invariant,
command_contract_snapshots, task_transport, durable_resume_paths,
worker_compatibility), and requires Waterline to render every
frozen signal somewhere in its operator UI
- names the two admin HTTP routes load-bearing for liveness
(/api/system/repair/pass and /api/system/activity-timeouts/pass)
and the bounded semantics (scan_limit, loop_throttle_seconds,
idempotent response shape)
- names the four v2 migrations that own liveness columns and
requires each to be reversible by the standard Laravel down()
path (workflow_tasks, workflow_run_summaries,
activity_attempts, worker_compatibility_heartbeats)
- describes a six-step migration path from lease-only recovery to
the full liveness-and-repair envelope without a hard cutover
- defers scheduler leader election, cross-region coordination,
automatic migration rollback, client-side rollout tooling, and
queue-backend-specific repair optimisations to follow-on
roadmap issues so future work extends this contract rather than
silently redefining it
tests/Unit/V2/OperationalLivenessDocumentationTest.php:
- asserts every heading, term, referenced class, admin HTTP route,
config variable, repair policy knob, claim reason code,
repair blocked reason, liveness_state value, frozen metric key,
frozen health check name, migration, and migration-path step is
present so operator, CLI, Waterline, cloud, and SDK coverage
can rely on the contract's vocabulary
- pins the transactional-bootstrap rule, the transport-payload-
is-only-task-id rule, the redelivery-reuses-row rule, the
repair-is-concurrency-safe rule, the repair-never-fabricates-
history rule, the heartbeat-never-creates-a-new-attempt rule,
the rejected-non-owning-heartbeat rule, the every-non-terminal-
run-projects-exactly-one-liveness_state rule, the
resume_source_kind/id consistency rule, the worker-loss-through-
lease-only rule, the sweeper-never-runs-authoring-layer rule,
the deployment-with-no-sweeper-is-correct rule, the
repair-preserves-compatibility rule, the no-widening-on-
redispatch rule, the ingress-serialises-through-run-level-lock
rule, the stable-lock-order rule, the stuck-is-never-silent rule,
the ActivityLease-5-minutes rule, the scope_fair_round_robin
fairness rule, the TaskRepairPolicy::snapshot exposes-full-
cadence rule, the migration-reversibility rule, the legacy
WORKFLOW_V2_TASK_REPAIR_* preservation rule, and the Phase 5
cache-not-correctness rule
- requires cites of the Phase 1 (#579) execution-guarantees,
Phase 2 (#580) worker-compatibility, Phase 3 (#581)
task-matching, Phase 4 (#582) control-plane-split, Phase 5
(#583) scheduler-correctness, and Phase 6 (#584) rollout-safety
contracts, and requires all six roadmap issue numbers (#579-584)
to be named so the phase lineage cannot silently decouple
- requires explicit deferral of scheduler leader election so a
future phase extends this contract rather than redefining it
The liveness enforcement shift is a contract change, not a code
change: TaskRepair, TaskRepairPolicy, TaskRepairCandidates,
ActivityTaskClaimer, ActivityLease, TaskDispatcher, RunWorkflowTask,
RunActivityTask, RunTimerTask, RunSummaryProjector, OperatorMetrics,
HealthCheck, OperatorQueueVisibility, RepairBlockedReason, and the
workflow_tasks, workflow_run_summaries, activity_attempts, and
worker_compatibility_heartbeats tables are preserved verbatim, and
this document adds the rules that govern which of their signals
are contract, which are advisory, and how they combine into the
operational liveness envelope.
Verified:
- bash scripts/check-public-boundary.sh (exit 0)
- vendor/bin/phpunit tests/Unit/V2/OperationalLivenessDocumentationTest.php
(32 tests, 221 assertions, OK) against PHP 8.4
- vendor/bin/phpunit tests/Unit/V2/OperationalLivenessDocumentationTest.php
tests/Unit/V2/RolloutSafetyDocumentationTest.php
(57 tests, 405 assertions, OK)
- vendor/bin/ecs check docs/architecture/operational-liveness.md
tests/Unit/V2/OperationalLivenessDocumentationTest.php
(no errors)
Refs: #681 parent abc11d5 commit 8342eb8
2 files changed
Lines changed: 1622 additions & 0 deletions
File tree
- docs/architecture
- tests/Unit/V2
0 commit comments