feat(journeys): journey-execution queue-health observability + dispatch fail-fast (EVO-1764) by daniloleonecarneiro · Pull Request #80 · evolution-foundation/evo-flow-community

daniloleonecarneiro · 2026-06-23T13:03:12Z

Summary

When the journey-execution Temporal task queue has no executor (worker not deployed, RUN_MODE misconfigured, queue-name drift), triggered journeys silently sit unstarted until the 30d workflowExecutionTimeout — with no signal and no guard. Split from EVO-1758 (which owns the worker reconnection fix; the worker already auto-recovers from transient Temporal restarts). This card adds the detection signal and the fail-fast guard, recording the fail-fast-vs-silent-timeout decision.

The real target is a genuinely worker-less queue, not a benign restart (which self-heals), so both the indicator and the guard gate on sustained zero pollers, and the guard tolerates the reconnect window.

Changes

Poller — single source of truth (temporal/services/journey-execution-poller.service.ts, new):

Periodic describeTaskQueue (WORKFLOW + ACTIVITY poller types). A workflow needs a WORKFLOW poller to run its first line, so that count is authoritative. Tracks a zeroSince clock; stale (Temporal unreachable) is never read as "no worker".
Lives in a lean TemporalQueueHealthModule (imports only ProcessingModule), imported by both the always-on HealthModule and JourneysModule → one shared singleton, no heavy/conditional TemporalModule, no DI cycle.
isQueueUnexecutable is cached-first: trusts the background snapshot and only confirms live when it already suspects zero (zero RPC on a healthy dispatch); no-op when not monitoring (e.g. campaign worker). A forceLive mode serves the low-frequency manual path. poll() de-duplicates in-flight calls so the interval + on-demand callers never race the shared connection.

Readiness + metrics:

TemporalTaskQueueIndicator on /ready for temporal-worker only (sustained zero → 503). shouldServeHttp opts the dedicated worker into the probe listener so /ready + /metrics are reachable there (it was listener-less). single is intentionally excluded so a journey-queue dip cannot 503 the co-located API; it still gets the gauges.
Gauges evo_temporal_task_queue_pollers{task_queue,poller_type} and evo_temporal_task_queue_zero_poller_seconds{task_queue}.

Fail-fast guard — both dispatch paths:

Kafka trigger (journey-trigger-processor) and manual startJourney: when the queue has no executor, terminate the just-started workflow and persist a failed session (an upsert via set(), not the update-only status path — the row does not exist yet on the trigger path) so the journey is visible/auditable instead of a phantom ACTIVE that would permanently block future triggers. Kafka path commits the offset (no retry storm); manual path returns started:false.

Cleanup: centralize the journey-execution literal behind TEMPORAL_TASK_QUEUES.

Test plan

npx jest src/modules/temporal/services/journey-execution-poller.service.spec.ts src/health src/modules/journeys/services — 94 tests pass (new: poller state machine — sustained-zero/recovery/stale/forceLive/cached-first/in-flight; indicator up/down/never-throws; dispatch guard on both Kafka and manual paths; per-mode indicator gating).
tsc -b clean (project refs).
Config (env-overridable): poll 15s / sustained-zero 60s / dispatch-grace 45s.
Manual smoke (SINGLE): boot → /metrics gauge > 0; scale journey worker to 0 → after 60s the gauge climbs and a triggered journey is marked failed (not active), workflow terminated; restore worker → new triggers dispatch.

Notas

Card said 24h, code is 30d (workflowExecutionTimeout). We surface fast via the guard rather than shorten the global timeout (legit long waited journeys rely on it).
Worker reconnection/watchdog is out of scope → EVO-1758. Infra-side Prometheus/Grafana alert-rule authoring on the new gauges/503 is an infra handoff.
Accepted residual risk (timing-dependent, narrowed by the cached-first confirm): a saturated/rolling-restart worker reporting zero pollers > grace could drop a real journey without retry on the Kafka path.
Follow-ups in EVO-1859: pre-existing dispatch-side 'active' status is update-only/best-effort; whether /ready should also reflect Temporal connectivity; the infra alert-rule.

Closes EVO-1764.

…ch fail-fast (EVO-1764) Surface the "no executor" condition on the journey-execution Temporal task queue and fail triggered-but-unexecutable journeys fast, instead of letting them sit unstarted until the 30d execution timeout. - Poller (JourneyExecutionPollerService): periodic describeTaskQueue (WORKFLOW + ACTIVITY) tracking sustained-zero WORKFLOW pollers — the single source of truth for the readiness indicator, the dispatch guard, and the Prometheus gauges. Lives in a lean TemporalQueueHealthModule (no heavy, conditional TemporalModule; no DI cycle). isQueueUnexecutable is cached-first (no per-dispatch RPC on the hot Kafka path; only confirms live when the background poller already suspects zero) with a forceLive mode for the manual path; poll() de-duplicates in-flight calls so the interval and on-demand callers never race the shared connection. Stale (Temporal unreachable) is never read as "no worker". - Readiness: TemporalTaskQueueIndicator on /ready for temporal-worker only (sustained zero WORKFLOW pollers -> 503). shouldServeHttp opts the dedicated worker into the probe listener so /ready + /metrics are reachable there. SINGLE deliberately excludes the indicator (keeps /metrics gauges) so a journey-queue dip cannot 503 the co-located API. - Metrics: evo_temporal_task_queue_pollers{task_queue,poller_type} and evo_temporal_task_queue_zero_poller_seconds{task_queue} gauges. - Fail-fast guard on both dispatch paths (Kafka trigger processor + manual startJourney): when the queue has no executor, terminate the just-started workflow and persist a failed session (an upsert via set(), not the update-only status path) so the journey is visible and auditable instead of a phantom ACTIVE that silently blocks every future trigger for that contact. The Kafka path commits the offset (no retry storm); the manual path returns started:false so the API surfaces the error. - Centralize the journey-execution task-queue name behind TEMPORAL_TASK_QUEUES. Tests: poller, indicator, and dispatch-guard specs (both paths); tsc -b clean.

sourcery-ai

Sorry @daniloleonecarneiro, you have reached your weekly rate limit of 500000 diff characters.

Please try again later or upgrade to continue using Sourcery

dpaes

Review — EVO-1764: journey-execution queue-health observability + dispatch fail-fast

Verdict: Request changes. The branch is clean/mergeable on develop and the design is sound, but the PR ships a deterministically-failing committed test and a dispatch-guard coverage gap that must be fixed before merge.

Local verification (this repo's CI runs neither jest nor tsc -b, so neither is gated):

tsc -b --noEmit → clean (exit 0).
jest on the touched files → 1 failed / 51 passed (see B1). The "94 tests green" claim does not hold.

🔴 Blocker

B1 — A committed test is RED, and it is the canonical guard for the exact behavior this PR changes.
src/app-factory.ts:35 adds RunMode.TEMPORAL_WORKER to shouldServeHttp() (correct & intentional — the dedicated worker must expose /ready + /metrics). But src/app-factory.serve-http.spec.ts:29 still lists TEMPORAL_WORKER under noServe, and :37-38 asserts shouldServeHttp() === false:

FAIL src/app-factory.serve-http.spec.ts
  ● AppFactory.shouldServeHttp › temporal-worker stays listener-less
    Expected: false  Received: true
Tests: 1 failed, 51 passed, 52 total

Not inherited from develop (the PR touched only the source, not the spec; branch is 0 commits behind; the spec passed on develop). Because CI does not run jest, the red is invisible in CI — a sign the full suite was not run locally.
Fix: move TEMPORAL_WORKER from noServe to serve in the spec, then run the full jest suite and reconcile the real count. The 6 new specs all pass; only this pre-existing one fails.

🟠 Required change

A1 — CAMPAIGN_WORKER: the fail-fast guard is a silent no-op (split-brain dispatch).
The journey-triggers Kafka consumer starts under shouldStartTemporalWorker() = {SINGLE, TEMPORAL_WORKER, CAMPAIGN_WORKER} (journey-trigger-processor.service.ts:124-127), but the poller's monitoring flag is set only under shouldStartJourneyWorker() = {SINGLE, TEMPORAL_WORKER} (app-factory.ts:57-64,75-81). On the hot path, isQueueUnexecutable() returns {unexecutable:false} immediately when !monitoring (journey-execution-poller.service.ts:170-173).

Both modes share the consumer group temporal-workers (:147), so when a TEMPORAL_WORKER and a CAMPAIGN_WORKER run concurrently, Kafka splits the journey-triggers partitions across them → the share landing on CAMPAIGN_WORKER is dispatched without the guard. During a real zero-poller window, that workflow sits unstarted until the 30d execution timeout with no app-side FAILED session (the Kafka path does not pre-create a row; the residue is an unstarted workflow that Temporal eventually marks TimedOut) — the exact silent stall the guard exists to eliminate, for that partition share.

describeTaskQueue is cluster-wide, so the fix is cheap — either (a) gate monitoring on shouldStartTemporalWorker() (same set that starts the consumer), or (b) gate the journey-triggers consumer subscription on shouldStartJourneyWorker() (cleaner — a campaign worker has no business consuming journey triggers). Please add a spec exercising triggerJourneyExecution with monitoring=false to lock the chosen behavior.

🟡 Non-blocking

M1 AC1 ships the two gauges + the /ready 503 but no committed Prometheus alert rule, runbook, or env-var docs. The infra alert rule is already deferred to EVO-1859 (noted), so AC1's "an alert fires" rests on the /ready 503 + gauge for now. Please document the 3 new env vars (TEMPORAL_QUEUE_POLL_INTERVAL_MS, TEMPORAL_ZERO_POLLER_SUSTAINED_MS, TEMPORAL_DISPATCH_GRACE_MS) and the 2 gauges in README/.env.example to match the repo's 1-rule+1-runbook-per-signal convention.
L1 AC1's schedule_to_start_latency arm is not implemented (only the zero-pollers arm). Defensible as met by the stronger arm (a workflow needs a WORKFLOW poller to run its first line) — please just confirm the partial coverage is intentional.
L2 / L3 No test covers the CAMPAIGN_WORKER cross-mode no-op (A1), and the guard wiring is tested only via {} as any constructors + overwritten privates (no DI/module test). The module graph resolves correctly (no cycle; ProcessingModule exports PrometheusMetrics); these are coverage nits.
L4 onModuleInit does await this.poll() with no connectTimeout (SDK default 10s) → a journey-worker cold start can block up to ~10s when Temporal is unreachable at boot. Cosmetic.

✅ Verified sound

Poller is a single DI singleton (no double-instantiation / no cycle); /ready + /metrics are reachable in TEMPORAL_WORKER (main.ts:472, @Public controllers, bare paths since the global prefix is skipped in worker mode); the IndicatorResult shape matches the controller's 503 aggregation; the stale ≠ down rule is correct (a Temporal outage is not read as "no worker"); the SINGLE exclusion from /ready is deliberate (the gauge still publishes); AC3 (no duplication of the EVO-1226 readiness surface) is a clean extension, not a parallel path; the manual startJourney path is correct (forceLive is independent of monitoring, and the caller throws BadRequestException on !started).

The Linear card is moved In Review → Todo. This PR stays open for the changes above.

…ttp spec (EVO-1764) Review fixes for PR #80: B1: app-factory.serve-http.spec kept TEMPORAL_WORKER in noServe while shouldServeHttp() now serves it (probe listener for /ready + /metrics) — the committed spec was red. Move it to the serve list. A1: the journey-triggers consumer gated on shouldStartTemporalWorker() (SINGLE, TEMPORAL_WORKER, CAMPAIGN_WORKER) while the fail-fast guard's queue-health poller only runs in shouldStartJourneyWorker() modes. A CAMPAIGN_WORKER sharing the temporal-workers consumer group would take its share of journey-triggers partitions and dispatch them with the guard permanently no-op (monitoring=false → executable) — a silent split-brain stall. Gate the consumer on shouldStartJourneyWorker(); the campaign worker has no reason to consume journey-triggers. getProcessorStatus isRunning mirrors the same gate. New spec locks the gating choice.

sourcery-ai Bot reviewed Jun 23, 2026

View reviewed changes

dpaes requested changes Jun 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(journeys): journey-execution queue-health observability + dispatch fail-fast (EVO-1764)#80

feat(journeys): journey-execution queue-health observability + dispatch fail-fast (EVO-1764)#80
daniloleonecarneiro wants to merge 2 commits into
developfrom
danilocarneiro/evo-1764-journey-runtime-alert-on-journey-execution-zero-pollers-high

daniloleonecarneiro commented Jun 23, 2026

Uh oh!

sourcery-ai Bot left a comment

Uh oh!

dpaes left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

daniloleonecarneiro commented Jun 23, 2026

Summary

Changes

Test plan

Notas

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

dpaes left a comment

Choose a reason for hiding this comment

Review — EVO-1764: journey-execution queue-health observability + dispatch fail-fast

🔴 Blocker

🟠 Required change

🟡 Non-blocking

✅ Verified sound

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants