|
| 1 | +# Plan: execution_id node state (TEMPORARY — do not commit) |
| 2 | + |
| 3 | +> This plan file is scratch. Delete `docs/exec-id-plan.md` before the feature |
| 4 | +> branch is finalized (last step below). It must not land in history. |
| 5 | +
|
| 6 | +## Goal |
| 7 | + |
| 8 | +Replace the `node-state-model` PR's stored `status_reason` approach with an |
| 9 | +`execution_id`-based model where **physical** statuses are written by the owning |
| 10 | +orchestration and **implicit** statuses (`skipped`, `cancelled`) are *derived at |
| 11 | +read time* in `df.instance_nodes` from each node's ancestors — with no duroxide |
| 12 | +dependency in the read path. |
| 13 | + |
| 14 | +--- |
| 15 | + |
| 16 | +## Step 1 — Schema: `status_details JSONB` on `df.nodes` |
| 17 | + |
| 18 | +- Add `status_details JSONB` (nullable) to `df.nodes` in [src/lib.rs](src/lib.rs) |
| 19 | + `create_tables` DDL. |
| 20 | +- Stores the writer's `execution_id` (the ordered path; see Step 4) plus room |
| 21 | + for small read-path payloads we already need (the IF decision; the RACE |
| 22 | + winner — see Step 6). |
| 23 | +- Incidental cleanup agreed earlier: **drop the dead `error` column** and stop |
| 24 | + overloading `result` for non-result data. |
| 25 | +- Keep `nodes_status_chk` limited to the **physical** statuses |
| 26 | + (`pending`/`running`/`completed`/`failed`). `skipped`/`cancelled` are never |
| 27 | + stored, so they must NOT be added to the check constraint. |
| 28 | +- Grants: add `status_details` to nothing in the user INSERT column list (worker |
| 29 | + writes it); remove `error` from any column lists in `df.grant_usage` / |
| 30 | + `df.revoke_usage`. |
| 31 | +- **Upgrade script** `sql/pg_durable--0.2.4--0.2.5.sql` (next version): `ADD |
| 32 | + COLUMN status_details`, `DROP COLUMN error`, constraint + grant adjustments. |
| 33 | +- **Binary backward-compat (B1):** the new `.so` must still run against older |
| 34 | + schemas that lack `status_details`/still have `error`. The write path (Step 2) |
| 35 | + and read path (Step 3) must degrade gracefully when the column is absent |
| 36 | + (runtime column detection, or a guarded query) — verify with |
| 37 | + `scripts/test-upgrade.sh`. |
| 38 | + |
| 39 | +## Step 2 — Write path: fence status updates on `execution_id` |
| 40 | + |
| 41 | +- In [src/activities/update_node_status.rs](src/activities/update_node_status.rs): |
| 42 | + accept the caller's `execution_id`, write it to `status_details`, and make the |
| 43 | + `UPDATE` **conditional**: apply only when the incoming `execution_id` is **not |
| 44 | + superseded** by the stored one (`incoming >= stored`, equal allowed for |
| 45 | + running→terminal within one execution). |
| 46 | +- Ordering: JSONB does not compare the way we need. Store/compare via a |
| 47 | + **comparable form** (e.g. an `int[]` of the execution numbers) so Postgres' |
| 48 | + native lexicographic array order gives "first differing segment, larger number |
| 49 | + wins". The JSONB can carry the human-readable path; the gate compares the |
| 50 | + array form (column, generated column, or extracted in the `WHERE`). |
| 51 | +- **Fire-and-forget / determinism:** the activity returns `Ok` regardless of |
| 52 | + whether the row was updated; the orchestration must never branch on the |
| 53 | + outcome (keep `let _ = ctx.schedule_activity(...)`). |
| 54 | + |
| 55 | +## Step 3 — Read path: `df.instance_nodes` infers status from `df.nodes` only |
| 56 | + |
| 57 | +- Rewrite [src/monitoring.rs](src/monitoring.rs) `instance_nodes` to: |
| 58 | + 1. **Drop the duroxide dependency** — read solely from `df.nodes` (one |
| 59 | + `SELECT` of the instance's rows). Remove `list_executions` / the fabricated |
| 60 | + `execution_id` cross join. |
| 61 | + 2. For each node, **leave `status` and the stored `execution_id` untouched** |
| 62 | + and instead **add inferred fields to the returned `status_details`**: |
| 63 | + - `inferred_status` — the *effective* state at read time (`skipped`, |
| 64 | + `cancelled`, `pending`, or the node's own physical status when no |
| 65 | + ancestor overrides it). |
| 66 | + - `inferred_status_from_ancestor_id` — the node whose state *determined* |
| 67 | + `inferred_status` (the first ancestor whose `execution_id` is not a |
| 68 | + prefix of this node's). Makes the derivation explainable per row. |
| 69 | + - We deliberately do NOT add `inferred_status_execution_id`: it is just the |
| 70 | + `execution_id` of `inferred_status_from_ancestor_id`'s row, so a reader |
| 71 | + can join to that node if needed. |
| 72 | + - Rationale: the read path **must not compete with the write path**. By |
| 73 | + augmenting (not overwriting) `status_details`, the helper never races the |
| 74 | + orchestration's fenced writes. (If we later materialize this at the DB |
| 75 | + level it would compete — defer that decision.) |
| 76 | + 3. **Derivation (leaf→root climb):** find the first ancestor whose |
| 77 | + `execution_id` is not a prefix of the node's; that ancestor's state sets |
| 78 | + `inferred_status`. Ancestor `failed` ⇒ `skipped`; `running`/newer execution |
| 79 | + ⇒ `pending`; `completed` but branch/iteration not taken ⇒ `skipped`; RACE |
| 80 | + loser ⇒ `cancelled`. Use the IF decision and RACE winner recorded in |
| 81 | + `status_details` (Step 6) to know which branch/iteration is "taken". |
| 82 | + 4. Children are reachable via `left_node`/`right_node`; the tree is immutable |
| 83 | + after `df.start()`, so no parent column is required (optional safe |
| 84 | + denormalization only). |
| 85 | +- Output columns: `status` and `status_details` keep their stored values; the |
| 86 | + inferred fields live inside the returned `status_details` JSONB only, never in |
| 87 | + the base table. |
| 88 | + |
| 89 | +## Step 4 — Document the `execution_id` |
| 90 | + |
| 91 | +- Fold the structure from [docs/execution-id.md](docs/execution-id.md) into the |
| 92 | + permanent doc (see Step 7): ordered path `root:n₀ : seg₁:n₁ : … : segₖ:nₖ`; |
| 93 | + segment added only at sub-orchestration boundaries (loop iteration, join/race |
| 94 | + branch); branch numbers = 1, loop numbers = 1..N; same node across executions |
| 95 | + shares segment identities+length, differing only in numbers ⇒ total order. |
| 96 | + |
| 97 | +## Step 5 — Document the state-transition graph + helper |
| 98 | + |
| 99 | +- Add a diagram of node states with the **physical** transitions |
| 100 | + (`pending→running→completed|failed`) drawn solid and the **implicit, never |
| 101 | + materialized** ones (`→skipped`, `→cancelled`, re-entry `→pending` on a new |
| 102 | + execution) drawn dashed, with a clear note that the dashed ones exist only in |
| 103 | + `df.instance_nodes` output. |
| 104 | +- Document the inference helper used by `instance_nodes` (the ancestor-climb / |
| 105 | + derivation function): inputs, the prefix rule, each case, and that it emits |
| 106 | + `inferred_status` + `inferred_status_from_ancestor_id` into `status_details` |
| 107 | + rather than mutating the stored `status`/`execution_id`. |
| 108 | + |
| 109 | +## Step 6 — RACE winner recording (small behavioral add) |
| 110 | + |
| 111 | +- The root→leaves / branch-taken inference needs the RACE node to record its |
| 112 | + **winning branch** in `status_details`. Add this write in the race |
| 113 | + orchestration if not already present. |
| 114 | + |
| 115 | +## Step 7 — Docs cleanup |
| 116 | + |
| 117 | +- Delete [docs/execution-id.md](docs/execution-id.md) and |
| 118 | + [docs/node-state-model.md](docs/node-state-model.md); their content is folded |
| 119 | + into the permanent doc/diagram (Steps 4–5). |
| 120 | +- Delete this plan file `docs/exec-id-plan.md` (must not be committed on the |
| 121 | + feature branch). |
| 122 | + |
| 123 | +--- |
| 124 | + |
| 125 | +## Open dependency you may be assuming implicitly |
| 126 | + |
| 127 | +**Loop boundary (prerequisite, not in your list).** Clean per-iteration |
| 128 | +`execution_id` segments require a nested loop to run as a **sub-orchestration |
| 129 | +rooted at the loop node**. Today the loop uses `ctx.continue_as_new` from |
| 130 | +`graph.root_node_id` (the only `continue_as_new` caller), which both produces the |
| 131 | +existing nested-loop "restarts wrong root" bug and prevents a nested loop from |
| 132 | +owning its own segment. This must be addressed (or explicitly deferred with |
| 133 | +top-level-loops-only scope) for Steps 2–3 to be correct. Track it as a separate |
| 134 | +work item. |
0 commit comments