Skip to content

Commit 1a216ef

Browse files
authored
Merge pull request #925 from Pipelex/release/v0.29.0
Release v0.29.0
2 parents 6d81b1d + bfcb011 commit 1a216ef

567 files changed

Lines changed: 46332 additions & 8044 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.badges/tests.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
{
22
"schemaVersion": 1,
33
"label": "tests",
4-
"message": "5112",
4+
"message": "6089",
55
"color": "blue",
66
"cacheSeconds": 300
77
}

.claude/settings.json

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -41,5 +41,8 @@
4141
]
4242
}
4343
]
44+
},
45+
"enabledPlugins": {
46+
"temporal@temporal-marketplace": true
4447
}
45-
}
48+
}

.claude/skills/temporal-e2e-validate/SKILL.md

Lines changed: 170 additions & 603 deletions
Large diffs are not rendered by default.
Lines changed: 154 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,154 @@
1+
# Temporal E2E — Mode 2 setup: server + worker processes
2+
3+
> Reference file for the **temporal-e2e-validate** skill — Mode 2, Steps 1–2. **Read this first** before any other Mode 2 reference.
4+
> The **Timeouts policy** and the **surface-results-immediately** rule in `SKILL.md` apply to every command here.
5+
>
6+
> Mode 2 reference files (read as the work requires):
7+
>
8+
> - `mode-2-setup.md` — Steps 1–2: Temporal server + worker processes (this file)
9+
> - `mode-2-tiers.md` — Steps 3–7: Tiers 1–14, graph, isolation, codec, final report
10+
> - `routing-battery.md` — Step 8: v1 `activity_queues` routing battery
11+
> - `queue-options-battery.md` — Step 9: v2 queue options + worker-runtime profiles
12+
13+
## Why Mode 2: True 3-Process Validation
14+
15+
This is the real deployment test. Three separate OS processes — Temporal server, worker,
16+
and submitter — with no shared memory. The worker has its own Python runtime, its own
17+
`sys.modules`, its own ClassRegistry. This is the only way to catch bugs like:
18+
19+
- The worker can't deserialize the PipeJob because concept classes aren't registered yet
20+
- The Kajson decoder bypasses class lookup when `__module__="builtins"`
21+
- The Temporal data converter silently drops fields during encoding/decoding
22+
23+
Each command runs a pipeline through Temporal with `--graph`, which also validates that
24+
the worker emits NDJSON trace events and the submitter assembles them into a GraphSpec
25+
with an interactive ReactFlow HTML visualization.
26+
27+
### Run mode: dry-run vs live
28+
29+
By default, Mode 2 runs in **dry-run** mode (`--dry-run --mock-inputs`) — fast, no LLM
30+
costs, validates serialization and crate propagation. But dry-run produces tiny mock
31+
data, so it **cannot catch payload size issues** (e.g. large image payloads blowing up
32+
Temporal's data converter).
33+
34+
Ask the user which mode they want. If they say "live", simply omit `--dry-run --mock-inputs`
35+
from all commands below. The `pipelex run bundle` CLI has no `--pipe-run-mode` flag — live
36+
is the default when neither `--dry-run` nor `--mock-inputs` is specified (note: pytest in
37+
Mode 1 does accept `--pipe-run-mode live`, but the CLI does not). Live mode makes real LLM and image
38+
generation calls — it costs money and is slower, but it's the only way to validate that
39+
real-sized payloads (especially images) flow correctly through Temporal.
40+
41+
**This matters most for Tiers 4 and 5** (image generation and image flow). In dry-run,
42+
mock images are trivially small, so payload size bugs don't surface. In live mode,
43+
generated images are hundreds of KB to several MB — exactly the size that breaks Temporal
44+
if activity-level storage isn't implemented.
45+
46+
### Step 1: Ensure Temporal server is running
47+
48+
Same as Mode 1 Step 1 in `SKILL.md` — start `temporal server start-dev` in a `temporal-server` tmux session if `curl -s http://localhost:8233` shows it is not already up.
49+
50+
**Search-attribute registration on a fresh dev server.** Worker boot performs a
51+
hard-fail audit (`check_required_search_attributes`) and refuses to start if any
52+
of the Pipelex custom attributes (defined in `BUILTIN_SEARCH_ATTRIBUTES`) is
53+
missing from the namespace. The error message includes the exact
54+
`pipelex setup-temporal-namespace` command to run. For a freshly started
55+
`temporal server start-dev`, register them once (idempotent):
56+
57+
```bash
58+
.venv/bin/pipelex setup-temporal-namespace
59+
```
60+
61+
This wraps the same helper the test conftest uses, so the set stays in sync
62+
with `BUILTIN_SEARCH_ATTRIBUTES` automatically. (If you need to invoke the raw
63+
Temporal CLI for some reason — e.g. an environment where the Pipelex CLI is
64+
unavailable — the command shape is
65+
`temporal operator search-attribute create --namespace <ns> --name <Name> --type Keyword`,
66+
one per attribute name from `BUILTIN_SEARCH_ATTRIBUTES`.)
67+
68+
Mode 1 (pytest) handles this automatically via the test conftest. Mode 2 needs
69+
it done before worker startup.
70+
71+
### Step 2: Start the worker processes
72+
73+
The workers should NOT have test bundles in their PIPELEXPATH — the whole point is
74+
that bundles arrive via the LibraryCrate in the PipeJob.
75+
76+
**Two scoped workers (cross-process regression setup — recommended).**
77+
Splits responsibilities so workflows and activities run in different processes:
78+
- `router` worker registers all workflows (`WfPipeRouter`, `WfPipeRun`, …),
79+
`disable_all_activities = true`.
80+
- `runner` worker registers all activities (`act_deliver`, `act_llm_*`,
81+
`act_assemble_graph`, `act_flush_trace_events`, …), `disable_all_workflows = true`.
82+
83+
This forces every activity to be picked up by a *different* Python process than the
84+
workflow that scheduled it. The runner process never executes `WfPipeRouter`, so it
85+
never loads the LibraryCrate and its global `ClassRegistry` stays cold for any
86+
dynamic concept defined in the bundle.
87+
88+
> ⚠️ **Important — what this setup does NOT reproduce on its own.**
89+
> Plain `pipelex run bundle` (Tiers 1–3 below) does NOT trigger the runner-side
90+
> registry decode bug, because:
91+
> - `WfPipeRouter` dehydrates `pipe_output` via `prepare_for_temporal()` before
92+
> returning, so workflow-level transit carries raw dicts (no class lookup).
93+
> - `WfPipeRun` rehydrates back on the *router* (same process that loaded the crate).
94+
> - Activities the runner actually executes in dry-run (`act_assemble_graph`,
95+
> `act_flush_trace_events`) operate on raw event records — no dynamic class needed.
96+
> - `act_deliver` is **only scheduled when `delivery_assignment is not None`**
97+
> (`wf_pipe_run.py:79`), and `pipelex run bundle` does not pass one.
98+
>
99+
> To deterministically force a hydrated `pipe_output` across the process boundary,
100+
> run **Tier 2b** below (mirrors the cloud / `start_pipeline` + webhook path).
101+
102+
```bash
103+
tmux has-session -t temporal-worker-router 2>/dev/null && tmux kill-session -t temporal-worker-router
104+
tmux has-session -t temporal-worker-runner 2>/dev/null && tmux kill-session -t temporal-worker-runner
105+
tmux new-session -d -c "$PWD" -s temporal-worker-router \
106+
'.venv/bin/python -m pipelex.temporal.worker_cli --is-not-sandboxed --scope router'
107+
tmux new-session -d -c "$PWD" -s temporal-worker-runner \
108+
'.venv/bin/python -m pipelex.temporal.worker_cli --is-not-sandboxed --scope runner'
109+
# Bounded wait: fail fast with captured logs if a worker can't start within 30s
110+
# (boot is normally <5s — anything past that means a stuck config validator,
111+
# missing search attributes, or the worker crashed early).
112+
for session in temporal-worker-router temporal-worker-runner; do
113+
for attempt in $(seq 1 30); do
114+
if tmux capture-pane -t "$session" -p 2>/dev/null | grep -q "Temporal Worker started"; then break; fi
115+
if [ "$attempt" -eq 30 ]; then
116+
echo "TIMEOUT: $session did not report 'Temporal Worker started' within 30s — last 50 lines:"
117+
tmux capture-pane -t "$session" -p -S -50
118+
exit 1
119+
fi
120+
sleep 1
121+
done
122+
done
123+
tmux capture-pane -t temporal-worker-router -p -S -30 | grep -E "scope|started for|search-attribute" | head -5
124+
tmux capture-pane -t temporal-worker-runner -p -S -30 | grep -E "scope|started for|search-attribute" | head -5
125+
```
126+
127+
Look for `Temporal Worker started for 'temporal_task_queue'` in each session, plus
128+
`profile='default' scope='router'` and `'runner'` respectively. If a worker fails with
129+
`SearchAttributeRegistrationError`, register the missing attributes per Step 1 and
130+
restart the worker — do **not** bump the timeout and retry blind.
131+
132+
**Alternative — single full worker** (simpler, but masks distributed-execution bugs;
133+
use only when you don't need the regression coverage):
134+
135+
```bash
136+
tmux has-session -t temporal-worker 2>/dev/null && tmux kill-session -t temporal-worker
137+
tmux new-session -d -c "$PWD" -s temporal-worker \
138+
'.venv/bin/python -m pipelex.temporal.worker_cli --is-not-sandboxed'
139+
# Bounded wait: fail fast with captured logs if the worker can't start within 30s
140+
for attempt in $(seq 1 30); do
141+
if tmux capture-pane -t temporal-worker -p 2>/dev/null | grep -q "Temporal Worker started"; then break; fi
142+
if [ "$attempt" -eq 30 ]; then
143+
echo "TIMEOUT: temporal-worker did not report 'Temporal Worker started' within 30s — last 50 lines:"
144+
tmux capture-pane -t temporal-worker -p -S -50
145+
exit 1
146+
fi
147+
sleep 1
148+
done
149+
tmux capture-pane -t temporal-worker -p -S -30
150+
```
151+
152+
When using two scoped workers, replace any later capture commands like
153+
`tmux capture-pane -t temporal-worker ...` with the appropriate session name
154+
(`temporal-worker-router` or `temporal-worker-runner`).

0 commit comments

Comments
 (0)