Skip to content

feat(journeys): journey-execution queue-health observability + dispatch fail-fast (EVO-1764)#80

Open
daniloleonecarneiro wants to merge 2 commits into
developfrom
danilocarneiro/evo-1764-journey-runtime-alert-on-journey-execution-zero-pollers-high
Open

feat(journeys): journey-execution queue-health observability + dispatch fail-fast (EVO-1764)#80
daniloleonecarneiro wants to merge 2 commits into
developfrom
danilocarneiro/evo-1764-journey-runtime-alert-on-journey-execution-zero-pollers-high

Conversation

@daniloleonecarneiro

Copy link
Copy Markdown

Summary

When the journey-execution Temporal task queue has no executor (worker not deployed, RUN_MODE misconfigured, queue-name drift), triggered journeys silently sit unstarted until the 30d workflowExecutionTimeout — with no signal and no guard. Split from EVO-1758 (which owns the worker reconnection fix; the worker already auto-recovers from transient Temporal restarts). This card adds the detection signal and the fail-fast guard, recording the fail-fast-vs-silent-timeout decision.

The real target is a genuinely worker-less queue, not a benign restart (which self-heals), so both the indicator and the guard gate on sustained zero pollers, and the guard tolerates the reconnect window.

Changes

Poller — single source of truth (temporal/services/journey-execution-poller.service.ts, new):

  • Periodic describeTaskQueue (WORKFLOW + ACTIVITY poller types). A workflow needs a WORKFLOW poller to run its first line, so that count is authoritative. Tracks a zeroSince clock; stale (Temporal unreachable) is never read as "no worker".
  • Lives in a lean TemporalQueueHealthModule (imports only ProcessingModule), imported by both the always-on HealthModule and JourneysModule → one shared singleton, no heavy/conditional TemporalModule, no DI cycle.
  • isQueueUnexecutable is cached-first: trusts the background snapshot and only confirms live when it already suspects zero (zero RPC on a healthy dispatch); no-op when not monitoring (e.g. campaign worker). A forceLive mode serves the low-frequency manual path. poll() de-duplicates in-flight calls so the interval + on-demand callers never race the shared connection.

Readiness + metrics:

  • TemporalTaskQueueIndicator on /ready for temporal-worker only (sustained zero → 503). shouldServeHttp opts the dedicated worker into the probe listener so /ready + /metrics are reachable there (it was listener-less). single is intentionally excluded so a journey-queue dip cannot 503 the co-located API; it still gets the gauges.
  • Gauges evo_temporal_task_queue_pollers{task_queue,poller_type} and evo_temporal_task_queue_zero_poller_seconds{task_queue}.

Fail-fast guard — both dispatch paths:

  • Kafka trigger (journey-trigger-processor) and manual startJourney: when the queue has no executor, terminate the just-started workflow and persist a failed session (an upsert via set(), not the update-only status path — the row does not exist yet on the trigger path) so the journey is visible/auditable instead of a phantom ACTIVE that would permanently block future triggers. Kafka path commits the offset (no retry storm); manual path returns started:false.

Cleanup: centralize the journey-execution literal behind TEMPORAL_TASK_QUEUES.

Test plan

  • npx jest src/modules/temporal/services/journey-execution-poller.service.spec.ts src/health src/modules/journeys/services94 tests pass (new: poller state machine — sustained-zero/recovery/stale/forceLive/cached-first/in-flight; indicator up/down/never-throws; dispatch guard on both Kafka and manual paths; per-mode indicator gating).
  • tsc -b clean (project refs).
  • Config (env-overridable): poll 15s / sustained-zero 60s / dispatch-grace 45s.
  • Manual smoke (SINGLE): boot → /metrics gauge > 0; scale journey worker to 0 → after 60s the gauge climbs and a triggered journey is marked failed (not active), workflow terminated; restore worker → new triggers dispatch.

Notas

  • Card said 24h, code is 30d (workflowExecutionTimeout). We surface fast via the guard rather than shorten the global timeout (legit long waited journeys rely on it).
  • Worker reconnection/watchdog is out of scope → EVO-1758. Infra-side Prometheus/Grafana alert-rule authoring on the new gauges/503 is an infra handoff.
  • Accepted residual risk (timing-dependent, narrowed by the cached-first confirm): a saturated/rolling-restart worker reporting zero pollers > grace could drop a real journey without retry on the Kafka path.
  • Follow-ups in EVO-1859: pre-existing dispatch-side 'active' status is update-only/best-effort; whether /ready should also reflect Temporal connectivity; the infra alert-rule.

Closes EVO-1764.

…ch fail-fast (EVO-1764)

Surface the "no executor" condition on the journey-execution Temporal task
queue and fail triggered-but-unexecutable journeys fast, instead of letting
them sit unstarted until the 30d execution timeout.

- Poller (JourneyExecutionPollerService): periodic describeTaskQueue
  (WORKFLOW + ACTIVITY) tracking sustained-zero WORKFLOW pollers — the single
  source of truth for the readiness indicator, the dispatch guard, and the
  Prometheus gauges. Lives in a lean TemporalQueueHealthModule (no heavy,
  conditional TemporalModule; no DI cycle). isQueueUnexecutable is cached-first
  (no per-dispatch RPC on the hot Kafka path; only confirms live when the
  background poller already suspects zero) with a forceLive mode for the manual
  path; poll() de-duplicates in-flight calls so the interval and on-demand
  callers never race the shared connection. Stale (Temporal unreachable) is
  never read as "no worker".

- Readiness: TemporalTaskQueueIndicator on /ready for temporal-worker only
  (sustained zero WORKFLOW pollers -> 503). shouldServeHttp opts the dedicated
  worker into the probe listener so /ready + /metrics are reachable there.
  SINGLE deliberately excludes the indicator (keeps /metrics gauges) so a
  journey-queue dip cannot 503 the co-located API.

- Metrics: evo_temporal_task_queue_pollers{task_queue,poller_type} and
  evo_temporal_task_queue_zero_poller_seconds{task_queue} gauges.

- Fail-fast guard on both dispatch paths (Kafka trigger processor + manual
  startJourney): when the queue has no executor, terminate the just-started
  workflow and persist a failed session (an upsert via set(), not the
  update-only status path) so the journey is visible and auditable instead of a
  phantom ACTIVE that silently blocks every future trigger for that contact.
  The Kafka path commits the offset (no retry storm); the manual path returns
  started:false so the API surfaces the error.

- Centralize the journey-execution task-queue name behind TEMPORAL_TASK_QUEUES.

Tests: poller, indicator, and dispatch-guard specs (both paths); tsc -b clean.

@sourcery-ai sourcery-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @daniloleonecarneiro, you have reached your weekly rate limit of 500000 diff characters.

Please try again later or upgrade to continue using Sourcery

@dpaes dpaes left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — EVO-1764: journey-execution queue-health observability + dispatch fail-fast

Verdict: Request changes. The branch is clean/mergeable on develop and the design is sound, but the PR ships a deterministically-failing committed test and a dispatch-guard coverage gap that must be fixed before merge.

Local verification (this repo's CI runs neither jest nor tsc -b, so neither is gated):

  • tsc -b --noEmitclean (exit 0).
  • jest on the touched files → 1 failed / 51 passed (see B1). The "94 tests green" claim does not hold.

🔴 Blocker

B1 — A committed test is RED, and it is the canonical guard for the exact behavior this PR changes.
src/app-factory.ts:35 adds RunMode.TEMPORAL_WORKER to shouldServeHttp() (correct & intentional — the dedicated worker must expose /ready + /metrics). But src/app-factory.serve-http.spec.ts:29 still lists TEMPORAL_WORKER under noServe, and :37-38 asserts shouldServeHttp() === false:

FAIL src/app-factory.serve-http.spec.ts
  ● AppFactory.shouldServeHttp › temporal-worker stays listener-less
    Expected: false  Received: true
Tests: 1 failed, 51 passed, 52 total

Not inherited from develop (the PR touched only the source, not the spec; branch is 0 commits behind; the spec passed on develop). Because CI does not run jest, the red is invisible in CI — a sign the full suite was not run locally.
Fix: move TEMPORAL_WORKER from noServe to serve in the spec, then run the full jest suite and reconcile the real count. The 6 new specs all pass; only this pre-existing one fails.

🟠 Required change

A1 — CAMPAIGN_WORKER: the fail-fast guard is a silent no-op (split-brain dispatch).
The journey-triggers Kafka consumer starts under shouldStartTemporalWorker() = {SINGLE, TEMPORAL_WORKER, CAMPAIGN_WORKER} (journey-trigger-processor.service.ts:124-127), but the poller's monitoring flag is set only under shouldStartJourneyWorker() = {SINGLE, TEMPORAL_WORKER} (app-factory.ts:57-64,75-81). On the hot path, isQueueUnexecutable() returns {unexecutable:false} immediately when !monitoring (journey-execution-poller.service.ts:170-173).

Both modes share the consumer group temporal-workers (:147), so when a TEMPORAL_WORKER and a CAMPAIGN_WORKER run concurrently, Kafka splits the journey-triggers partitions across them → the share landing on CAMPAIGN_WORKER is dispatched without the guard. During a real zero-poller window, that workflow sits unstarted until the 30d execution timeout with no app-side FAILED session (the Kafka path does not pre-create a row; the residue is an unstarted workflow that Temporal eventually marks TimedOut) — the exact silent stall the guard exists to eliminate, for that partition share.

describeTaskQueue is cluster-wide, so the fix is cheap — either (a) gate monitoring on shouldStartTemporalWorker() (same set that starts the consumer), or (b) gate the journey-triggers consumer subscription on shouldStartJourneyWorker() (cleaner — a campaign worker has no business consuming journey triggers). Please add a spec exercising triggerJourneyExecution with monitoring=false to lock the chosen behavior.

🟡 Non-blocking

  • M1 AC1 ships the two gauges + the /ready 503 but no committed Prometheus alert rule, runbook, or env-var docs. The infra alert rule is already deferred to EVO-1859 (noted), so AC1's "an alert fires" rests on the /ready 503 + gauge for now. Please document the 3 new env vars (TEMPORAL_QUEUE_POLL_INTERVAL_MS, TEMPORAL_ZERO_POLLER_SUSTAINED_MS, TEMPORAL_DISPATCH_GRACE_MS) and the 2 gauges in README/.env.example to match the repo's 1-rule+1-runbook-per-signal convention.
  • L1 AC1's schedule_to_start_latency arm is not implemented (only the zero-pollers arm). Defensible as met by the stronger arm (a workflow needs a WORKFLOW poller to run its first line) — please just confirm the partial coverage is intentional.
  • L2 / L3 No test covers the CAMPAIGN_WORKER cross-mode no-op (A1), and the guard wiring is tested only via {} as any constructors + overwritten privates (no DI/module test). The module graph resolves correctly (no cycle; ProcessingModule exports PrometheusMetrics); these are coverage nits.
  • L4 onModuleInit does await this.poll() with no connectTimeout (SDK default 10s) → a journey-worker cold start can block up to ~10s when Temporal is unreachable at boot. Cosmetic.

✅ Verified sound

Poller is a single DI singleton (no double-instantiation / no cycle); /ready + /metrics are reachable in TEMPORAL_WORKER (main.ts:472, @Public controllers, bare paths since the global prefix is skipped in worker mode); the IndicatorResult shape matches the controller's 503 aggregation; the stale ≠ down rule is correct (a Temporal outage is not read as "no worker"); the SINGLE exclusion from /ready is deliberate (the gauge still publishes); AC3 (no duplication of the EVO-1226 readiness surface) is a clean extension, not a parallel path; the manual startJourney path is correct (forceLive is independent of monitoring, and the caller throws BadRequestException on !started).


The Linear card is moved In Review → Todo. This PR stays open for the changes above.

…ttp spec (EVO-1764)

Review fixes for PR #80:

B1: app-factory.serve-http.spec kept TEMPORAL_WORKER in noServe while
shouldServeHttp() now serves it (probe listener for /ready + /metrics) —
the committed spec was red. Move it to the serve list.

A1: the journey-triggers consumer gated on shouldStartTemporalWorker()
(SINGLE, TEMPORAL_WORKER, CAMPAIGN_WORKER) while the fail-fast guard's
queue-health poller only runs in shouldStartJourneyWorker() modes. A
CAMPAIGN_WORKER sharing the temporal-workers consumer group would take
its share of journey-triggers partitions and dispatch them with the
guard permanently no-op (monitoring=false → executable) — a silent
split-brain stall. Gate the consumer on shouldStartJourneyWorker(); the
campaign worker has no reason to consume journey-triggers. getProcessorStatus
isRunning mirrors the same gate. New spec locks the gating choice.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants