feat(url-inspector-refresh): worker handler + PostgREST client (LLMO-4563)#280
Draft
JayKid wants to merge 1 commit into
Draft
feat(url-inspector-refresh): worker handler + PostgREST client (LLMO-4563)#280JayKid wants to merge 1 commit into
JayKid wants to merge 1 commit into
Conversation
…4563)
Implements the task-processor side of LLMO-4563 Strategy B: a new
url-inspector-refresh handler that, given { type, siteId }, calls the per-site
staleness RPC and refreshes any stale (site, month) slices via the existing
wrpc_refresh_url_inspector_domain_stats. The companion dispatcher (every-30-min
fan-out from spacecat-jobs-dispatcher) and SQL migration land in separate PRs.
Depends on the staleness RPC introduced in
adobe/mysticat-data-service#611 (rpc_url_inspector_stale_slices_for_site).
src/utils/postgrest-client.js (new): minimal ~180-line fetch-based client with
a single .rpc(name, params) method. Supabase-shaped { data, error } return,
Content-Profile / Accept-Profile / Authorization: Bearer headers, per-request
AbortController with composable external signal, never-throws contract. Built
this instead of importing @adobe/spacecat-shared-data-access v3 because
flipping DATA_SERVICE_PROVIDER=postgres would change every other handler's
data backend from DynamoDB to PostgREST.
src/tasks/url-inspector-refresh/handler.js (new): validates siteId UUID,
queries staleness with 2x retry, loops stale months with per-month isolation
and 12-min wall-time budget (SQS visibility timeout is 900s), and emits one
structured log line per outcome for downstream CloudWatch metric filters. Does
NOT throw on per-month failures or on staleness errors: the queue runs
maxReceiveCount=1 and processTask in src/index.js swallows handler throws
anyway, so throwing would not produce DLQ messages. Instead, failed months
stay stale in the DB and the next 30-min schedule tick retries them
naturally — leaning on the per-site advisory lock in
wrpc_refresh_url_inspector_domain_stats for idempotency.
src/index.js: registers url-inspector-refresh in HANDLERS.
Tests: unit tests for the client (24 tests) and handler (16 tests) cover
happy path, retry-then-success, retry-exhausted (no throw), per-month failure
isolation, budget-exhausted deferral, structured log line emission, and the
default-sleep fallback. 100% statements / branches / functions / lines on
both new files; full task-processor suite up from 466 to 482 passing.
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds the task-processor worker side of the LLMO-4563 ongoing-refresh strategy: a new
url-inspector-refreshhandler that, given{ type, siteId }, calls the per-site staleness RPC and refreshes any stale(site, month)slices via the existingwrpc_refresh_url_inspector_domain_stats. The companion every-30-min dispatcher (which feeds this queue) and the SQL migration land in separate PRs:rpc_url_inspector_stale_slices_for_site+ supporting index + dev runbook): adobe/mysticat-data-service#611 — must merge first or this handler 404s on the staleness RPCWhat's in here
src/utils/postgrest-client.js(new)fetch-based PostgREST client, single.rpc(name, params)methodsrc/tasks/url-inspector-refresh/handler.js(new)src/index.js'url-inspector-refresh': urlInspectorRefreshinHANDLERStest/utils/postgrest-client.test.js(new)test/tasks/url-inspector-refresh/url-inspector-refresh.test.js(new)Design notes
Why a new mini PostgREST client instead of
@adobe/spacecat-shared-data-accessv3The shared v3 package is activated by setting
DATA_SERVICE_PROVIDER=postgresin the runtime env, but task-processor today is in DynamoDB mode for every other handler —disable-import-audit-processor,opportunity-status-processor,agent-executor, etc. all readdataAccess.Site/dataAccess.Configurationfrom DDB via src/support/data-access.js. Flipping the global flag would change all of them in one step.postgrest-client.jslets a single handler opt into PostgREST without that side effect. It exposes the subset of the supabase-js shape that the api-service already uses (const { data, error } = await client.rpc(...)), so a future global migration to v3 is a one-line swap.Failure model: no throws, no DLQ — leans on the next 30-min tick instead
The
spacecat-task-processor-jobsqueue runs withmaxReceiveCount=1(spacecat-infrastructure/modules/sqs/queues.tf:31). Combined with the fact thatprocessTaskin src/index.js catches all handler throws and returnsinternalServerError()(which from SQS's perspective is a successful Lambda invocation), there is no path by which a thrown error here would actually reach the DLQ.So the handler is built to never throw to
processTask:PER_RPC_ATTEMPTS=2times in-handler with backoff.url_inspector_domain_stats(the refresh RPCDELETEs beforeINSERT-ing underpg_advisory_xact_lock), so the next 30-min schedule tick will see it again and retry naturally.ok({ stalenessFailed: true }), let the next tick re-detect and re-attempt.The whole pipeline is idempotent + self-healing: at most one site falls behind by one 30-min tick before catch-up. CloudWatch alarms on the structured per-month error log lines (see below) cover the "we silently stopped refreshing" failure mode.
Per-invocation budget
SQS visibility timeout on this queue is 900s. We cap wall time at
PER_INVOCATION_BUDGET_MS = 12 minand defer remaining stale months to the next schedule tick. This bounds the worst case (very-stale site after a long outage) at(staleness RPC limit) × (next tick)instead of an unbounded queue-time runaway.Observability via structured log lines, not a CloudWatch SDK call
Each per-month outcome emits a single JSON-stringified log line:
{"event":"url-inspector-refresh.refresh","siteId":"...","month_start":"2026-04-01","status":"ok","attempts":1,"durationMs":2150}A CloudWatch metric filter (next infra PR,
tf_alarmstodo) turns these intoRefreshCalls{result=ok|error}counters +RefreshDurationMsdistributions without an@aws-sdk/client-cloudwatchdep on the hot path. Same approach for the staleness-failed and dispatch-summary log lines.Runtime config required
These env vars must be present in the task-processor Lambda for this handler to function. They are NOT currently set (task-processor has no
/helix-deploy/spacecat-services/task-processor/latestsecret at all — task-processor reads everything from Lambda env + the shared catalog viavaultSecrets). They will be provisioned in a separate ops step (tp_provision_secret):POSTGREST_URL/helix-deploy/spacecat-services/all/rpc/...POSTsPOSTGREST_API_KEY/helix-deploy/spacecat-services/all(writer JWT)wrpc_refresh_*requirespostgrest_writerrolePOSTGREST_SCHEMA/helix-deploy/spacecat-services/all(defaultpublic)Test plan
npm run lintcleannpm test— 482 passing (up from 466), 100/100/100/100 coverage on both new files{ type: 'url-inspector-refresh', siteId: '9ae8877a-bbf3-407d-9adb-d6a72ce3c5e3' }once the dev secret is provisioned + the SQL migration is applied → verify a successful run end-to-end against real Aurora<30minfreshness onurl_inspector_domain_statsOut of scope (follow-ups, in separate PRs)
spacecat-jobs-dispatcherthat fans this queue out per-siteaws_scheduler_schedule(every-30-min cron) + task-processor secret shell inspacecat-infrastructure— adobe/spacecat-infrastructure#531secrets/secrets.envfor task-processor +npm run deploy-secretsto expose the PostgREST trio aboveDispatchRuns,RefreshFailures,StalenessQueryFailures,RefreshSuccesses) — adobe/spacecat-infrastructure#531feature_flags(product=LLMO, flag_name=url_inspector_pg, flag_value=true)for the adobe.com pilot orgRelated
Made with Cursor