feat(journeys): journey-execution queue-health observability + dispatch fail-fast (EVO-1764)#80
Conversation
…ch fail-fast (EVO-1764)
Surface the "no executor" condition on the journey-execution Temporal task
queue and fail triggered-but-unexecutable journeys fast, instead of letting
them sit unstarted until the 30d execution timeout.
- Poller (JourneyExecutionPollerService): periodic describeTaskQueue
(WORKFLOW + ACTIVITY) tracking sustained-zero WORKFLOW pollers — the single
source of truth for the readiness indicator, the dispatch guard, and the
Prometheus gauges. Lives in a lean TemporalQueueHealthModule (no heavy,
conditional TemporalModule; no DI cycle). isQueueUnexecutable is cached-first
(no per-dispatch RPC on the hot Kafka path; only confirms live when the
background poller already suspects zero) with a forceLive mode for the manual
path; poll() de-duplicates in-flight calls so the interval and on-demand
callers never race the shared connection. Stale (Temporal unreachable) is
never read as "no worker".
- Readiness: TemporalTaskQueueIndicator on /ready for temporal-worker only
(sustained zero WORKFLOW pollers -> 503). shouldServeHttp opts the dedicated
worker into the probe listener so /ready + /metrics are reachable there.
SINGLE deliberately excludes the indicator (keeps /metrics gauges) so a
journey-queue dip cannot 503 the co-located API.
- Metrics: evo_temporal_task_queue_pollers{task_queue,poller_type} and
evo_temporal_task_queue_zero_poller_seconds{task_queue} gauges.
- Fail-fast guard on both dispatch paths (Kafka trigger processor + manual
startJourney): when the queue has no executor, terminate the just-started
workflow and persist a failed session (an upsert via set(), not the
update-only status path) so the journey is visible and auditable instead of a
phantom ACTIVE that silently blocks every future trigger for that contact.
The Kafka path commits the offset (no retry storm); the manual path returns
started:false so the API surfaces the error.
- Centralize the journey-execution task-queue name behind TEMPORAL_TASK_QUEUES.
Tests: poller, indicator, and dispatch-guard specs (both paths); tsc -b clean.
There was a problem hiding this comment.
Sorry @daniloleonecarneiro, you have reached your weekly rate limit of 500000 diff characters.
Please try again later or upgrade to continue using Sourcery
dpaes
left a comment
There was a problem hiding this comment.
Review — EVO-1764: journey-execution queue-health observability + dispatch fail-fast
Verdict: Request changes. The branch is clean/mergeable on develop and the design is sound, but the PR ships a deterministically-failing committed test and a dispatch-guard coverage gap that must be fixed before merge.
Local verification (this repo's CI runs neither jest nor tsc -b, so neither is gated):
tsc -b --noEmit→ clean (exit 0).jeston the touched files → 1 failed / 51 passed (see B1). The "94 tests green" claim does not hold.
🔴 Blocker
B1 — A committed test is RED, and it is the canonical guard for the exact behavior this PR changes.
src/app-factory.ts:35 adds RunMode.TEMPORAL_WORKER to shouldServeHttp() (correct & intentional — the dedicated worker must expose /ready + /metrics). But src/app-factory.serve-http.spec.ts:29 still lists TEMPORAL_WORKER under noServe, and :37-38 asserts shouldServeHttp() === false:
FAIL src/app-factory.serve-http.spec.ts
● AppFactory.shouldServeHttp › temporal-worker stays listener-less
Expected: false Received: true
Tests: 1 failed, 51 passed, 52 total
Not inherited from develop (the PR touched only the source, not the spec; branch is 0 commits behind; the spec passed on develop). Because CI does not run jest, the red is invisible in CI — a sign the full suite was not run locally.
Fix: move TEMPORAL_WORKER from noServe to serve in the spec, then run the full jest suite and reconcile the real count. The 6 new specs all pass; only this pre-existing one fails.
🟠 Required change
A1 — CAMPAIGN_WORKER: the fail-fast guard is a silent no-op (split-brain dispatch).
The journey-triggers Kafka consumer starts under shouldStartTemporalWorker() = {SINGLE, TEMPORAL_WORKER, CAMPAIGN_WORKER} (journey-trigger-processor.service.ts:124-127), but the poller's monitoring flag is set only under shouldStartJourneyWorker() = {SINGLE, TEMPORAL_WORKER} (app-factory.ts:57-64,75-81). On the hot path, isQueueUnexecutable() returns {unexecutable:false} immediately when !monitoring (journey-execution-poller.service.ts:170-173).
Both modes share the consumer group temporal-workers (:147), so when a TEMPORAL_WORKER and a CAMPAIGN_WORKER run concurrently, Kafka splits the journey-triggers partitions across them → the share landing on CAMPAIGN_WORKER is dispatched without the guard. During a real zero-poller window, that workflow sits unstarted until the 30d execution timeout with no app-side FAILED session (the Kafka path does not pre-create a row; the residue is an unstarted workflow that Temporal eventually marks TimedOut) — the exact silent stall the guard exists to eliminate, for that partition share.
describeTaskQueue is cluster-wide, so the fix is cheap — either (a) gate monitoring on shouldStartTemporalWorker() (same set that starts the consumer), or (b) gate the journey-triggers consumer subscription on shouldStartJourneyWorker() (cleaner — a campaign worker has no business consuming journey triggers). Please add a spec exercising triggerJourneyExecution with monitoring=false to lock the chosen behavior.
🟡 Non-blocking
- M1 AC1 ships the two gauges + the
/ready503 but no committed Prometheus alert rule, runbook, or env-var docs. The infra alert rule is already deferred to EVO-1859 (noted), so AC1's "an alert fires" rests on the/ready503 + gauge for now. Please document the 3 new env vars (TEMPORAL_QUEUE_POLL_INTERVAL_MS,TEMPORAL_ZERO_POLLER_SUSTAINED_MS,TEMPORAL_DISPATCH_GRACE_MS) and the 2 gauges in README/.env.example to match the repo's 1-rule+1-runbook-per-signal convention. - L1 AC1's
schedule_to_start_latencyarm is not implemented (only the zero-pollers arm). Defensible as met by the stronger arm (a workflow needs a WORKFLOW poller to run its first line) — please just confirm the partial coverage is intentional. - L2 / L3 No test covers the CAMPAIGN_WORKER cross-mode no-op (A1), and the guard wiring is tested only via
{} as anyconstructors + overwritten privates (no DI/module test). The module graph resolves correctly (no cycle;ProcessingModuleexportsPrometheusMetrics); these are coverage nits. - L4
onModuleInitdoesawait this.poll()with noconnectTimeout(SDK default 10s) → a journey-worker cold start can block up to ~10s when Temporal is unreachable at boot. Cosmetic.
✅ Verified sound
Poller is a single DI singleton (no double-instantiation / no cycle); /ready + /metrics are reachable in TEMPORAL_WORKER (main.ts:472, @Public controllers, bare paths since the global prefix is skipped in worker mode); the IndicatorResult shape matches the controller's 503 aggregation; the stale ≠ down rule is correct (a Temporal outage is not read as "no worker"); the SINGLE exclusion from /ready is deliberate (the gauge still publishes); AC3 (no duplication of the EVO-1226 readiness surface) is a clean extension, not a parallel path; the manual startJourney path is correct (forceLive is independent of monitoring, and the caller throws BadRequestException on !started).
The Linear card is moved In Review → Todo. This PR stays open for the changes above.
…ttp spec (EVO-1764) Review fixes for PR #80: B1: app-factory.serve-http.spec kept TEMPORAL_WORKER in noServe while shouldServeHttp() now serves it (probe listener for /ready + /metrics) — the committed spec was red. Move it to the serve list. A1: the journey-triggers consumer gated on shouldStartTemporalWorker() (SINGLE, TEMPORAL_WORKER, CAMPAIGN_WORKER) while the fail-fast guard's queue-health poller only runs in shouldStartJourneyWorker() modes. A CAMPAIGN_WORKER sharing the temporal-workers consumer group would take its share of journey-triggers partitions and dispatch them with the guard permanently no-op (monitoring=false → executable) — a silent split-brain stall. Gate the consumer on shouldStartJourneyWorker(); the campaign worker has no reason to consume journey-triggers. getProcessorStatus isRunning mirrors the same gate. New spec locks the gating choice.
Summary
When the
journey-executionTemporal task queue has no executor (worker not deployed,RUN_MODEmisconfigured, queue-name drift), triggered journeys silently sit unstarted until the 30dworkflowExecutionTimeout— with no signal and no guard. Split from EVO-1758 (which owns the worker reconnection fix; the worker already auto-recovers from transient Temporal restarts). This card adds the detection signal and the fail-fast guard, recording the fail-fast-vs-silent-timeout decision.The real target is a genuinely worker-less queue, not a benign restart (which self-heals), so both the indicator and the guard gate on sustained zero pollers, and the guard tolerates the reconnect window.
Changes
Poller — single source of truth (
temporal/services/journey-execution-poller.service.ts, new):describeTaskQueue(WORKFLOW + ACTIVITY poller types). A workflow needs a WORKFLOW poller to run its first line, so that count is authoritative. Tracks azeroSinceclock;stale(Temporal unreachable) is never read as "no worker".TemporalQueueHealthModule(imports onlyProcessingModule), imported by both the always-onHealthModuleandJourneysModule→ one shared singleton, no heavy/conditionalTemporalModule, no DI cycle.isQueueUnexecutableis cached-first: trusts the background snapshot and only confirms live when it already suspects zero (zero RPC on a healthy dispatch); no-op when not monitoring (e.g. campaign worker). AforceLivemode serves the low-frequency manual path.poll()de-duplicates in-flight calls so the interval + on-demand callers never race the shared connection.Readiness + metrics:
TemporalTaskQueueIndicatoron/readyfor temporal-worker only (sustained zero → 503).shouldServeHttpopts the dedicated worker into the probe listener so/ready+/metricsare reachable there (it was listener-less).singleis intentionally excluded so a journey-queue dip cannot 503 the co-located API; it still gets the gauges.evo_temporal_task_queue_pollers{task_queue,poller_type}andevo_temporal_task_queue_zero_poller_seconds{task_queue}.Fail-fast guard — both dispatch paths:
journey-trigger-processor) and manualstartJourney: when the queue has no executor, terminate the just-started workflow and persist a failed session (an upsert viaset(), not the update-only status path — the row does not exist yet on the trigger path) so the journey is visible/auditable instead of a phantom ACTIVE that would permanently block future triggers. Kafka path commits the offset (no retry storm); manual path returnsstarted:false.Cleanup: centralize the
journey-executionliteral behindTEMPORAL_TASK_QUEUES.Test plan
npx jest src/modules/temporal/services/journey-execution-poller.service.spec.ts src/health src/modules/journeys/services— 94 tests pass (new: poller state machine — sustained-zero/recovery/stale/forceLive/cached-first/in-flight; indicator up/down/never-throws; dispatch guard on both Kafka and manual paths; per-mode indicator gating).tsc -bclean (project refs)./metricsgauge > 0; scale journey worker to 0 → after 60s the gauge climbs and a triggered journey is markedfailed(notactive), workflow terminated; restore worker → new triggers dispatch.Notas
30d(workflowExecutionTimeout). We surface fast via the guard rather than shorten the global timeout (legit long waited journeys rely on it).'active'status is update-only/best-effort; whether/readyshould also reflect Temporal connectivity; the infra alert-rule.Closes EVO-1764.