Skip to content

feat(url-inspector-refresh): worker handler + PostgREST client (LLMO-4563)#280

Draft
JayKid wants to merge 1 commit into
mainfrom
feat/ongoing-refresh-strategy-for-url-LLMO-4563
Draft

feat(url-inspector-refresh): worker handler + PostgREST client (LLMO-4563)#280
JayKid wants to merge 1 commit into
mainfrom
feat/ongoing-refresh-strategy-for-url-LLMO-4563

Conversation

@JayKid
Copy link
Copy Markdown

@JayKid JayKid commented May 18, 2026

Summary

Adds the task-processor worker side of the LLMO-4563 ongoing-refresh strategy: a new url-inspector-refresh handler that, given { type, siteId }, calls the per-site staleness RPC and refreshes any stale (site, month) slices via the existing wrpc_refresh_url_inspector_domain_stats. The companion every-30-min dispatcher (which feeds this queue) and the SQL migration land in separate PRs:

What's in here

File Purpose
src/utils/postgrest-client.js (new) ~180-line fetch-based PostgREST client, single .rpc(name, params) method
src/tasks/url-inspector-refresh/handler.js (new) The worker handler — staleness query + per-month refresh loop
src/index.js Registers 'url-inspector-refresh': urlInspectorRefresh in HANDLERS
test/utils/postgrest-client.test.js (new) 24 unit tests, 100% coverage on the client
test/tasks/url-inspector-refresh/url-inspector-refresh.test.js (new) 16 unit tests, 100% coverage on the handler

Design notes

Why a new mini PostgREST client instead of @adobe/spacecat-shared-data-access v3

The shared v3 package is activated by setting DATA_SERVICE_PROVIDER=postgres in the runtime env, but task-processor today is in DynamoDB mode for every other handler — disable-import-audit-processor, opportunity-status-processor, agent-executor, etc. all read dataAccess.Site/dataAccess.Configuration from DDB via src/support/data-access.js. Flipping the global flag would change all of them in one step.

postgrest-client.js lets a single handler opt into PostgREST without that side effect. It exposes the subset of the supabase-js shape that the api-service already uses (const { data, error } = await client.rpc(...)), so a future global migration to v3 is a one-line swap.

Failure model: no throws, no DLQ — leans on the next 30-min tick instead

The spacecat-task-processor-jobs queue runs with maxReceiveCount=1 (spacecat-infrastructure/modules/sqs/queues.tf:31). Combined with the fact that processTask in src/index.js catches all handler throws and returns internalServerError() (which from SQS's perspective is a successful Lambda invocation), there is no path by which a thrown error here would actually reach the DLQ.

So the handler is built to never throw to processTask:

  • Per-RPC retry: the staleness query and every per-month refresh call retry up to PER_RPC_ATTEMPTS=2 times in-handler with backoff.
  • Per-month isolation: a failed month is logged + counted + skipped; the loop continues with the next month. The failed month stays "stale" in url_inspector_domain_stats (the refresh RPC DELETEs before INSERT-ing under pg_advisory_xact_lock), so the next 30-min schedule tick will see it again and retry naturally.
  • Catastrophic staleness failure: same shape — log error, return ok({ stalenessFailed: true }), let the next tick re-detect and re-attempt.

The whole pipeline is idempotent + self-healing: at most one site falls behind by one 30-min tick before catch-up. CloudWatch alarms on the structured per-month error log lines (see below) cover the "we silently stopped refreshing" failure mode.

Per-invocation budget

SQS visibility timeout on this queue is 900s. We cap wall time at PER_INVOCATION_BUDGET_MS = 12 min and defer remaining stale months to the next schedule tick. This bounds the worst case (very-stale site after a long outage) at (staleness RPC limit) × (next tick) instead of an unbounded queue-time runaway.

Observability via structured log lines, not a CloudWatch SDK call

Each per-month outcome emits a single JSON-stringified log line:

{"event":"url-inspector-refresh.refresh","siteId":"...","month_start":"2026-04-01","status":"ok","attempts":1,"durationMs":2150}

A CloudWatch metric filter (next infra PR, tf_alarms todo) turns these into RefreshCalls{result=ok|error} counters + RefreshDurationMs distributions without an @aws-sdk/client-cloudwatch dep on the hot path. Same approach for the staleness-failed and dispatch-summary log lines.

Runtime config required

These env vars must be present in the task-processor Lambda for this handler to function. They are NOT currently set (task-processor has no /helix-deploy/spacecat-services/task-processor/latest secret at all — task-processor reads everything from Lambda env + the shared catalog via vaultSecrets). They will be provisioned in a separate ops step (tp_provision_secret):

Var Source Why
POSTGREST_URL /helix-deploy/spacecat-services/all Where to send /rpc/... POSTs
POSTGREST_API_KEY /helix-deploy/spacecat-services/all (writer JWT) Auth — wrpc_refresh_* requires postgrest_writer role
POSTGREST_SCHEMA /helix-deploy/spacecat-services/all (default public) Schema selector

⚠️ Deliberately not setting DATA_SERVICE_PROVIDER=postgres — that would flip every existing handler to v3 PostgREST and is out of scope.

Test plan

  • npm run lint clean
  • npm test — 482 passing (up from 466), 100/100/100/100 coverage on both new files
  • Probe invocation in dev with a synthetic SQS payload { type: 'url-inspector-refresh', siteId: '9ae8877a-bbf3-407d-9adb-d6a72ce3c5e3' } once the dev secret is provisioned + the SQL migration is applied → verify a successful run end-to-end against real Aurora
  • CI deploy-dev succeeds
  • Post-deploy: trigger a synthetic BP ingest on adobe.com and verify <30min freshness on url_inspector_domain_stats

Out of scope (follow-ups, in separate PRs)

  • Every-30-min dispatcher in spacecat-jobs-dispatcher that fans this queue out per-site
  • aws_scheduler_schedule (every-30-min cron) + task-processor secret shell in spacecat-infrastructureadobe/spacecat-infrastructure#531
  • New secrets/secrets.env for task-processor + npm run deploy-secrets to expose the PostgREST trio above
  • CloudWatch metric filters + alarms on the structured log lines (DispatchRuns, RefreshFailures, StalenessQueryFailures, RefreshSuccesses) — adobe/spacecat-infrastructure#531
  • Seeding feature_flags(product=LLMO, flag_name=url_inspector_pg, flag_value=true) for the adobe.com pilot org

Related

Made with Cursor

…4563)

Implements the task-processor side of LLMO-4563 Strategy B: a new
url-inspector-refresh handler that, given { type, siteId }, calls the per-site
staleness RPC and refreshes any stale (site, month) slices via the existing
wrpc_refresh_url_inspector_domain_stats. The companion dispatcher (every-30-min
fan-out from spacecat-jobs-dispatcher) and SQL migration land in separate PRs.

Depends on the staleness RPC introduced in
adobe/mysticat-data-service#611 (rpc_url_inspector_stale_slices_for_site).

src/utils/postgrest-client.js (new): minimal ~180-line fetch-based client with
a single .rpc(name, params) method. Supabase-shaped { data, error } return,
Content-Profile / Accept-Profile / Authorization: Bearer headers, per-request
AbortController with composable external signal, never-throws contract. Built
this instead of importing @adobe/spacecat-shared-data-access v3 because
flipping DATA_SERVICE_PROVIDER=postgres would change every other handler's
data backend from DynamoDB to PostgREST.

src/tasks/url-inspector-refresh/handler.js (new): validates siteId UUID,
queries staleness with 2x retry, loops stale months with per-month isolation
and 12-min wall-time budget (SQS visibility timeout is 900s), and emits one
structured log line per outcome for downstream CloudWatch metric filters. Does
NOT throw on per-month failures or on staleness errors: the queue runs
maxReceiveCount=1 and processTask in src/index.js swallows handler throws
anyway, so throwing would not produce DLQ messages. Instead, failed months
stay stale in the DB and the next 30-min schedule tick retries them
naturally — leaning on the per-site advisory lock in
wrpc_refresh_url_inspector_domain_stats for idempotency.

src/index.js: registers url-inspector-refresh in HANDLERS.

Tests: unit tests for the client (24 tests) and handler (16 tests) cover
happy path, retry-then-success, retry-exhausted (no throw), per-month failure
isolation, budget-exhausted deferral, structured log line emission, and the
default-sleep fallback. 100% statements / branches / functions / lines on
both new files; full task-processor suite up from 466 to 482 passing.
@JayKid JayKid temporarily deployed to dev-branches May 18, 2026 14:31 — with GitHub Actions Inactive
@codecov
Copy link
Copy Markdown

codecov Bot commented May 18, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant