|
| 1 | +# Temporal E2E — Mode 2 setup: server + worker processes |
| 2 | + |
| 3 | +> Reference file for the **temporal-e2e-validate** skill — Mode 2, Steps 1–2. **Read this first** before any other Mode 2 reference. |
| 4 | +> The **Timeouts policy** and the **surface-results-immediately** rule in `SKILL.md` apply to every command here. |
| 5 | +> |
| 6 | +> Mode 2 reference files (read as the work requires): |
| 7 | +> |
| 8 | +> - `mode-2-setup.md` — Steps 1–2: Temporal server + worker processes (this file) |
| 9 | +> - `mode-2-tiers.md` — Steps 3–7: Tiers 1–14, graph, isolation, codec, final report |
| 10 | +> - `routing-battery.md` — Step 8: v1 `activity_queues` routing battery |
| 11 | +> - `queue-options-battery.md` — Step 9: v2 queue options + worker-runtime profiles |
| 12 | +
|
| 13 | +## Why Mode 2: True 3-Process Validation |
| 14 | + |
| 15 | +This is the real deployment test. Three separate OS processes — Temporal server, worker, |
| 16 | +and submitter — with no shared memory. The worker has its own Python runtime, its own |
| 17 | +`sys.modules`, its own ClassRegistry. This is the only way to catch bugs like: |
| 18 | + |
| 19 | +- The worker can't deserialize the PipeJob because concept classes aren't registered yet |
| 20 | +- The Kajson decoder bypasses class lookup when `__module__="builtins"` |
| 21 | +- The Temporal data converter silently drops fields during encoding/decoding |
| 22 | + |
| 23 | +Each command runs a pipeline through Temporal with `--graph`, which also validates that |
| 24 | +the worker emits NDJSON trace events and the submitter assembles them into a GraphSpec |
| 25 | +with an interactive ReactFlow HTML visualization. |
| 26 | + |
| 27 | +### Run mode: dry-run vs live |
| 28 | + |
| 29 | +By default, Mode 2 runs in **dry-run** mode (`--dry-run --mock-inputs`) — fast, no LLM |
| 30 | +costs, validates serialization and crate propagation. But dry-run produces tiny mock |
| 31 | +data, so it **cannot catch payload size issues** (e.g. large image payloads blowing up |
| 32 | +Temporal's data converter). |
| 33 | + |
| 34 | +Ask the user which mode they want. If they say "live", simply omit `--dry-run --mock-inputs` |
| 35 | +from all commands below. The `pipelex run bundle` CLI has no `--pipe-run-mode` flag — live |
| 36 | +is the default when neither `--dry-run` nor `--mock-inputs` is specified (note: pytest in |
| 37 | +Mode 1 does accept `--pipe-run-mode live`, but the CLI does not). Live mode makes real LLM and image |
| 38 | +generation calls — it costs money and is slower, but it's the only way to validate that |
| 39 | +real-sized payloads (especially images) flow correctly through Temporal. |
| 40 | + |
| 41 | +**This matters most for Tiers 4 and 5** (image generation and image flow). In dry-run, |
| 42 | +mock images are trivially small, so payload size bugs don't surface. In live mode, |
| 43 | +generated images are hundreds of KB to several MB — exactly the size that breaks Temporal |
| 44 | +if activity-level storage isn't implemented. |
| 45 | + |
| 46 | +### Step 1: Ensure Temporal server is running |
| 47 | + |
| 48 | +Same as Mode 1 Step 1 in `SKILL.md` — start `temporal server start-dev` in a `temporal-server` tmux session if `curl -s http://localhost:8233` shows it is not already up. |
| 49 | + |
| 50 | +**Search-attribute registration on a fresh dev server.** Worker boot performs a |
| 51 | +hard-fail audit (`check_required_search_attributes`) and refuses to start if any |
| 52 | +of the Pipelex custom attributes (defined in `BUILTIN_SEARCH_ATTRIBUTES`) is |
| 53 | +missing from the namespace. The error message includes the exact |
| 54 | +`pipelex setup-temporal-namespace` command to run. For a freshly started |
| 55 | +`temporal server start-dev`, register them once (idempotent): |
| 56 | + |
| 57 | +```bash |
| 58 | +.venv/bin/pipelex setup-temporal-namespace |
| 59 | +``` |
| 60 | + |
| 61 | +This wraps the same helper the test conftest uses, so the set stays in sync |
| 62 | +with `BUILTIN_SEARCH_ATTRIBUTES` automatically. (If you need to invoke the raw |
| 63 | +Temporal CLI for some reason — e.g. an environment where the Pipelex CLI is |
| 64 | +unavailable — the command shape is |
| 65 | +`temporal operator search-attribute create --namespace <ns> --name <Name> --type Keyword`, |
| 66 | +one per attribute name from `BUILTIN_SEARCH_ATTRIBUTES`.) |
| 67 | + |
| 68 | +Mode 1 (pytest) handles this automatically via the test conftest. Mode 2 needs |
| 69 | +it done before worker startup. |
| 70 | + |
| 71 | +### Step 2: Start the worker processes |
| 72 | + |
| 73 | +The workers should NOT have test bundles in their PIPELEXPATH — the whole point is |
| 74 | +that bundles arrive via the LibraryCrate in the PipeJob. |
| 75 | + |
| 76 | +**Two scoped workers (cross-process regression setup — recommended).** |
| 77 | +Splits responsibilities so workflows and activities run in different processes: |
| 78 | +- `router` worker registers all workflows (`WfPipeRouter`, `WfPipeRun`, …), |
| 79 | + `disable_all_activities = true`. |
| 80 | +- `runner` worker registers all activities (`act_deliver`, `act_llm_*`, |
| 81 | + `act_assemble_graph`, `act_flush_trace_events`, …), `disable_all_workflows = true`. |
| 82 | + |
| 83 | +This forces every activity to be picked up by a *different* Python process than the |
| 84 | +workflow that scheduled it. The runner process never executes `WfPipeRouter`, so it |
| 85 | +never loads the LibraryCrate and its global `ClassRegistry` stays cold for any |
| 86 | +dynamic concept defined in the bundle. |
| 87 | + |
| 88 | +> ⚠️ **Important — what this setup does NOT reproduce on its own.** |
| 89 | +> Plain `pipelex run bundle` (Tiers 1–3 below) does NOT trigger the runner-side |
| 90 | +> registry decode bug, because: |
| 91 | +> - `WfPipeRouter` dehydrates `pipe_output` via `prepare_for_temporal()` before |
| 92 | +> returning, so workflow-level transit carries raw dicts (no class lookup). |
| 93 | +> - `WfPipeRun` rehydrates back on the *router* (same process that loaded the crate). |
| 94 | +> - Activities the runner actually executes in dry-run (`act_assemble_graph`, |
| 95 | +> `act_flush_trace_events`) operate on raw event records — no dynamic class needed. |
| 96 | +> - `act_deliver` is **only scheduled when `delivery_assignment is not None`** |
| 97 | +> (`wf_pipe_run.py:79`), and `pipelex run bundle` does not pass one. |
| 98 | +> |
| 99 | +> To deterministically force a hydrated `pipe_output` across the process boundary, |
| 100 | +> run **Tier 2b** below (mirrors the cloud / `start_pipeline` + webhook path). |
| 101 | +
|
| 102 | +```bash |
| 103 | +tmux has-session -t temporal-worker-router 2>/dev/null && tmux kill-session -t temporal-worker-router |
| 104 | +tmux has-session -t temporal-worker-runner 2>/dev/null && tmux kill-session -t temporal-worker-runner |
| 105 | +tmux new-session -d -c "$PWD" -s temporal-worker-router \ |
| 106 | + '.venv/bin/python -m pipelex.temporal.worker_cli --is-not-sandboxed --scope router' |
| 107 | +tmux new-session -d -c "$PWD" -s temporal-worker-runner \ |
| 108 | + '.venv/bin/python -m pipelex.temporal.worker_cli --is-not-sandboxed --scope runner' |
| 109 | +# Bounded wait: fail fast with captured logs if a worker can't start within 30s |
| 110 | +# (boot is normally <5s — anything past that means a stuck config validator, |
| 111 | +# missing search attributes, or the worker crashed early). |
| 112 | +for session in temporal-worker-router temporal-worker-runner; do |
| 113 | + for attempt in $(seq 1 30); do |
| 114 | + if tmux capture-pane -t "$session" -p 2>/dev/null | grep -q "Temporal Worker started"; then break; fi |
| 115 | + if [ "$attempt" -eq 30 ]; then |
| 116 | + echo "TIMEOUT: $session did not report 'Temporal Worker started' within 30s — last 50 lines:" |
| 117 | + tmux capture-pane -t "$session" -p -S -50 |
| 118 | + exit 1 |
| 119 | + fi |
| 120 | + sleep 1 |
| 121 | + done |
| 122 | +done |
| 123 | +tmux capture-pane -t temporal-worker-router -p -S -30 | grep -E "scope|started for|search-attribute" | head -5 |
| 124 | +tmux capture-pane -t temporal-worker-runner -p -S -30 | grep -E "scope|started for|search-attribute" | head -5 |
| 125 | +``` |
| 126 | + |
| 127 | +Look for `Temporal Worker started for 'temporal_task_queue'` in each session, plus |
| 128 | +`profile='default' scope='router'` and `'runner'` respectively. If a worker fails with |
| 129 | +`SearchAttributeRegistrationError`, register the missing attributes per Step 1 and |
| 130 | +restart the worker — do **not** bump the timeout and retry blind. |
| 131 | + |
| 132 | +**Alternative — single full worker** (simpler, but masks distributed-execution bugs; |
| 133 | +use only when you don't need the regression coverage): |
| 134 | + |
| 135 | +```bash |
| 136 | +tmux has-session -t temporal-worker 2>/dev/null && tmux kill-session -t temporal-worker |
| 137 | +tmux new-session -d -c "$PWD" -s temporal-worker \ |
| 138 | + '.venv/bin/python -m pipelex.temporal.worker_cli --is-not-sandboxed' |
| 139 | +# Bounded wait: fail fast with captured logs if the worker can't start within 30s |
| 140 | +for attempt in $(seq 1 30); do |
| 141 | + if tmux capture-pane -t temporal-worker -p 2>/dev/null | grep -q "Temporal Worker started"; then break; fi |
| 142 | + if [ "$attempt" -eq 30 ]; then |
| 143 | + echo "TIMEOUT: temporal-worker did not report 'Temporal Worker started' within 30s — last 50 lines:" |
| 144 | + tmux capture-pane -t temporal-worker -p -S -50 |
| 145 | + exit 1 |
| 146 | + fi |
| 147 | + sleep 1 |
| 148 | +done |
| 149 | +tmux capture-pane -t temporal-worker -p -S -30 |
| 150 | +``` |
| 151 | + |
| 152 | +When using two scoped workers, replace any later capture commands like |
| 153 | +`tmux capture-pane -t temporal-worker ...` with the appropriate session name |
| 154 | +(`temporal-worker-router` or `temporal-worker-runner`). |
0 commit comments