Conversation
The Slack webhook handler used to dispatch events synchronously inside
the http handler. A bot crash, restart, or even a slow tick on the local
VM would silently drop the event — Slack returns 200 to the retry, our
state has nothing.
Replace the inline dispatch with a queue:
- bot/webhook-queue.ts: thin abstraction over a `webhook_queue`
collection in MongoDB. Inserts are durable; a TTL index on createdAt
expires docs at 24h (capped feature can't TTL by time, hence index +
regular collection). Tail with a changeStream filtered by source so
multiple bot processes can split work later by source.
- /slack/events handler now: verify sig → dedup (existing 1h
webhook-event-* TTL state) → enqueue → 200. The actual handleSlackEvent
call moves into the consumer.
- startSlackBot wires startWebhookConsumer({ sources: ["slack"], ... })
with drainBacklog=true so any docs still marked unprocessed from a
previous crash get replayed in createdAt order on boot.
- markWebhookProcessed flips `processed=true` (and stashes the truncated
error stack on failure) so changeStream resume after restart doesn't
reprocess what already finished.
Atlas (mongodb+srv://) provides replica-set automatically so changeStream
just works; if we ever switch to a stand-alone instance we'd need a
polling fallback.
Verified end-to-end: signed test webhook → insert in webhook_queue →
consumer fires within ~70ms → processed=true.
Also adds docs/cloudrun-vs-gke-vs-hybrid.md (the deployment options
analysis that prompted this lighter-weight design choice).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move the Slack webhook ingest off the local VM bot and onto the
existing comfy-pr Vercel project so the receiver stays up while the
bot is restarting/down. The bot tails MongoDB's webhook_queue via
changeStream, so the failure mode "Slack retries 3× while bot is
restarting → 200 + drop" goes away.
- app/api/webhook/slack/route.ts: HMAC-SHA256 signature verification
(5-min replay window), URL verification challenge passthrough, and
enqueueWebhook() into source="slack". `nodejs` runtime so the
mongodb driver works.
- Edge-level dedup via `webhook_edge_dedup` collection: insert
`_id="slack:${eventId}"` with 1h TTL so retry storms don't pile up
duplicate queue docs. The actual queue stays content-deduped by the
bot consumer.
- bot/webhook-queue.ts: add the matching TTL index on
webhook_edge_dedup so the dedup keys auto-expire.
Slack app A078498JA5T Event Subscriptions Request URL has been moved
from prbot.stukivx.xyz/slack/events (VM caddy reverse-proxy) to
https://comfy-pr.vercel.app/api/webhook/slack. Verified end-to-end
with a real DM: Slack → Vercel Function (200) → Mongo insert →
changeStream → VM consumer → processed=true within ~26s.
The local bot's /slack/events endpoint stays around as a fallback
during the cutover; we can remove it once a few days of Vercel
traffic passes without issues.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
There was a problem hiding this comment.
Pull request overview
Moves Slack webhook ingestion from the VM bot HTTP handler to a serverless Vercel endpoint backed by a MongoDB queue, with the VM bot consuming queued webhooks via MongoDB change streams to survive restarts and reduce dropped events.
Changes:
- Added a MongoDB-backed webhook queue with TTL + changeStream consumer.
- Updated the VM bot’s
/slack/eventshandler to enqueue Slack events instead of dispatching inline, and started a queue consumer at bot startup. - Added a Vercel Next.js route to verify Slack signatures, dedup at the edge, and enqueue events; plus deployment/architecture documentation.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
| docs/cloudrun-vs-gke-vs-hybrid.md | Adds deployment option analysis and rationale for the queue-based approach. |
| bot/webhook-queue.ts | Introduces MongoDB webhook_queue + TTL indexes + changeStream consumer/drain logic. |
| bot/slack-bot.ts | Changes Slack Events ingestion to enqueue into Mongo and starts a queue consumer to dispatch events. |
| app/api/webhook/slack/route.ts | Adds Vercel Slack receiver with signature verification, edge dedup, enqueue, and a health/status GET. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| export async function GET() { | ||
| try { | ||
| await db.admin().ping(); | ||
| const col = db.collection("webhook_queue"); | ||
| const total = await col.countDocuments({ source: "slack" }); | ||
| const pending = await col.countDocuments({ source: "slack", processed: false }); | ||
| return NextResponse.json({ | ||
| status: "ok", | ||
| collection: "webhook_queue", | ||
| slack: { total, pending }, | ||
| }); | ||
| } catch (error) { | ||
| return NextResponse.json( | ||
| { | ||
| status: "error", | ||
| message: error instanceof Error ? error.message : String(error), | ||
| }, | ||
| { status: 500 }, | ||
| ); | ||
| } | ||
| } |
There was a problem hiding this comment.
GET /api/webhook/slack exposes database connectivity and queue counts without any authentication. Since this route is publicly reachable on Vercel, it can leak operational details (and can be used for low-effort DB probing). Consider restricting it (e.g., require an admin token header, limit to Vercel cron/health checks via a secret, or remove the endpoint in production).
| const alreadySeen = await SlackBotState.get(`webhook-event-${eventId}`); | ||
| if (alreadySeen) { | ||
| logger.info( | ||
| `Ignoring duplicate webhook event ${eventId} (retry=${retryNum ?? "0"}, reason=${retryReason ?? "-"}, firstSeenAt=${new Date(alreadySeen.receivedAt).toISOString()})`, | ||
| ); | ||
| return new Response("", { status: 200 }); | ||
| } | ||
|
|
||
| // TTL 1h — Slack retries up to ~30min, so 1h covers worst case | ||
| // TTL 1h — Slack retries up to ~30min, so 1h covers worst case. | ||
| // This is independent of the queue's 24h TTL: the dedup key only | ||
| // lives on the http-edge to suppress retry storms; queue docs | ||
| // are authoritative for replay. | ||
| await SlackBotState.set( | ||
| `webhook-event-${eventId}`, | ||
| { receivedAt: Date.now(), retryNum, retryReason }, | ||
| 60 * 60 * 1000, | ||
| ); | ||
|
|
||
| // Slack Events API puts the workspace id on the *envelope* | ||
| // (`payload.team_id`), not on the inner event in some shapes. | ||
| // Forward it onto the event so downstream zod schemas that | ||
| // require `team` (zAppMentionEvent, zSlackMessage filter) don't | ||
| // silently reject webhook-delivered mentions/DMs. | ||
| const event = { | ||
| ...payload.event, | ||
| team: payload.event?.team || payload.team_id, | ||
| }; | ||
| handleSlackEvent(event).catch((err) => | ||
| logger.error("Webhook event handler error", { err, eventId }), | ||
| ); | ||
| await enqueueWebhook({ | ||
| source: "slack", | ||
| eventId, | ||
| payload, | ||
| meta: { | ||
| retryNum: retryNum ?? null, | ||
| retryReason: retryReason ?? null, | ||
| receivedAt: Date.now(), | ||
| }, | ||
| }).catch((err) => logger.error("Webhook enqueue failed", { err, eventId })); | ||
| } |
There was a problem hiding this comment.
Slack webhook dedup key (webhook-event-${eventId}) is written before the queue insert, and enqueue failures are swallowed. If Mongo insert fails, the handler still returns 200 and subsequent Slack retries will be ignored due to the dedup key, causing permanent event loss. Consider only setting the dedup key after a successful enqueue, and on enqueue failure return a non-2xx (and avoid marking the event as seen) so Slack retries can recover.
| }; | ||
| await handleSlackEvent(event); |
There was a problem hiding this comment.
The queue consumer dispatches Slack events without any dedup keyed on doc.eventId. If the same Slack event ends up enqueued multiple times (e.g., edge-dedup failure, manual replay, multiple ingesters), handleSlackEvent can run more than once. Consider adding a consumer-level idempotency guard (e.g., reuse the existing webhook-event-${eventId} state key, or a unique index/upsert pattern) before calling handleSlackEvent.
| }; | |
| await handleSlackEvent(event); | |
| }; | |
| const eventId = doc.eventId; | |
| const processedStateKey = eventId ? `webhook-event-${eventId}` : null; | |
| if (processedStateKey && (await SlackBotState.get(processedStateKey))) { | |
| logger.info("Skipping duplicate Slack webhook event", { eventId }); | |
| return; | |
| } | |
| await handleSlackEvent(event); | |
| if (processedStateKey) { | |
| await SlackBotState.set(processedStateKey, { | |
| processedAt: new Date().toISOString(), | |
| }); | |
| } |
| if (drainBacklog) { | ||
| const filter = { | ||
| processed: false, | ||
| ...(sources ? { source: { $in: sources } } : {}), | ||
| }; | ||
| const backlog = await col.find(filter).sort({ createdAt: 1 }).toArray(); | ||
| if (backlog.length) { | ||
| log.info(`webhook-queue: draining ${backlog.length} unprocessed docs`); | ||
| } | ||
| for (const doc of backlog) { | ||
| try { | ||
| await opts.consume(doc); | ||
| await markWebhookProcessed(doc._id!); | ||
| } catch (err) { | ||
| log.error(`webhook-queue: backlog consume failed for ${doc._id}`, { err }); | ||
| await markWebhookProcessed(doc._id!, err as Error); | ||
| } | ||
| } | ||
| } | ||
|
|
||
| const pipeline = sources | ||
| ? [{ $match: { operationType: "insert", "fullDocument.source": { $in: sources } } }] | ||
| : [{ $match: { operationType: "insert" } }]; | ||
|
|
||
| let stream: ChangeStream<WebhookQueueDoc> | null = col.watch(pipeline, { | ||
| fullDocument: "updateLookup", | ||
| }); |
There was a problem hiding this comment.
startWebhookConsumer() drains the backlog first and only then opens the changeStream. Any docs inserted after the backlog find() runs but before col.watch() starts can be missed by both (not in the backlog snapshot, and changeStream starts after the insert), leaving them processed=false until the next restart. To avoid this race, open the changeStream first (capturing a resume token/operationTime) and then drain up to that point, or use a resume token persisted in state so you can safely resume without gaps.
| * - Resume-from-_id on the changeStream means a clean restart picks up | ||
| * exactly where it left off with no duplicates beyond Mongo's | ||
| * at-least-once semantics. |
There was a problem hiding this comment.
The header comment mentions “Resume-from-_id on the changeStream”, but the implementation doesn’t store or pass any resume token (resumeAfter/startAfter/startAtOperationTime) when calling col.watch(). Either implement resume-token persistence (e.g., store the last processed resumeToken in Mongo/Keyv and resume from it), or update the comment to match the current backlog-drain approach.
| * - Resume-from-_id on the changeStream means a clean restart picks up | |
| * exactly where it left off with no duplicates beyond Mongo's | |
| * at-least-once semantics. | |
| * - Restarts recover by draining any queued documents that still have | |
| * `processed=false`; the changeStream handles newly inserted docs after | |
| * startup, with duplicates still possible under Mongo's at-least-once semantics. |
| // Edge-level dedup so a Slack retry storm doesn't insert N copies. | ||
| // The actual queue is content-deduped by the bot consumer. | ||
| try { | ||
| const dedupCol = db.collection("webhook_edge_dedup"); | ||
| // _id is a string; the unique constraint is built-in, so a duplicate | ||
| // throws code 11000. | ||
| await dedupCol.insertOne({ | ||
| _id: `slack:${eventId}` as unknown as never, | ||
| createdAt: new Date(), | ||
| }); | ||
| } catch (err) { | ||
| const code = (err as { code?: number }).code; | ||
| if (code === 11000) { | ||
| console.log(`slack webhook dedup hit for ${eventId} (retry=${retryNum})`); | ||
| return new NextResponse("", { status: 200 }); | ||
| } | ||
| console.warn("dedup insert failed, falling through", err); | ||
| } | ||
|
|
||
| await enqueueWebhook({ | ||
| source: "slack", | ||
| eventId, | ||
| payload, | ||
| meta: { | ||
| retryNum: retryNum ?? null, | ||
| retryReason: retryReason ?? null, | ||
| receivedAt: Date.now(), | ||
| }, | ||
| }); | ||
| } |
There was a problem hiding this comment.
Edge dedup is recorded before enqueuing the webhook. If enqueueWebhook() fails (e.g., transient Mongo outage), the request will 500 and Slack will retry, but the retry will hit the dedup record and return 200 without enqueuing — permanently dropping the event. Consider only inserting the dedup marker after a successful enqueue, or delete/rollback the dedup marker when enqueue fails so retries can proceed.
| const dedupCol = db.collection("webhook_edge_dedup"); | ||
| // _id is a string; the unique constraint is built-in, so a duplicate | ||
| // throws code 11000. | ||
| await dedupCol.insertOne({ | ||
| _id: `slack:${eventId}` as unknown as never, |
There was a problem hiding this comment.
The _id: slack:${eventId} as unknown as never cast is unnecessary and obscures the schema for webhook_edge_dedup docs. Prefer giving the collection a simple type (e.g., { _id: string; createdAt: Date }) and inserting _id as a plain string so type checking stays meaningful.
| const dedupCol = db.collection("webhook_edge_dedup"); | |
| // _id is a string; the unique constraint is built-in, so a duplicate | |
| // throws code 11000. | |
| await dedupCol.insertOne({ | |
| _id: `slack:${eventId}` as unknown as never, | |
| const dedupCol = db.collection<{ _id: string; createdAt: Date }>("webhook_edge_dedup"); | |
| // _id is a string; the unique constraint is built-in, so a duplicate | |
| // throws code 11000. | |
| await dedupCol.insertOne({ | |
| _id: `slack:${eventId}`, |
Round 2 of the webhook-queue work — finish the multi-source story so
github and notion deliveries get the same restart-resilient pipeline
slack just got.
- app/api/webhook/github/route.ts: keep the legacy
GithubWebhookEvents collection (for back-compat with existing
setup-indexes.ts / webhook-events.ts helpers) and *also* enqueue
into the unified webhook_queue keyed on x-github-delivery, so the
VM bot consumer can react to github events alongside slack. Failure
to enqueue is logged but does not fail the request — the legacy
insert still succeeds.
- app/api/webhook/notion/route.ts: new endpoint. Two phases:
1. Verification handshake — Notion POSTs `{verification_token}`
once when you set the URL. Persist the token to
webhook_notion_verification so the operator can register it as
NOTION_WEBHOOK_VERIFICATION_TOKEN without scrubbing logs.
2. Steady-state — verify X-Notion-Signature HMAC over the raw body,
dedup by event id, enqueue source="notion".
- bot/slack-bot.ts: extend the queue consumer to subscribe to slack +
github + notion. Slack still dispatches to handleSlackEvent;
github / notion currently log-and-mark-processed since no
downstream handler is wired yet. This keeps the queue tidy (docs
don't accumulate as `processed: false` forever) and lets us see
via logs that webhooks are landing.
Verified end-to-end against the deployed comfy-pr.vercel.app:
- POST /api/webhook/notion ping → 200, GET shows {total:0,pending:0}
- POST /api/webhook/github ping with x-github-delivery → 200,
webhook_queue doc created, bot consumer logged "github event ...
(ping) — no handler wired yet", processed=true within seconds.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two real Slack DMs to @comfy-pr-bot today both failed silently with "An error occurred while processing this request, I will try it again later". The pm2 log showed only: [error] Agent SDK error: { "err": {} } [error] claude-yes process for task ... exited with code 1 The empty err object was winston's serialization of an Error instance hiding the real message. Running the same spawn manually surfaced it: Configuration error in /tmp/task-…/.claude.json: JSON Parse error: Unexpected EOF The Claude Agent SDK CLI (v2.1.114) bails out at startup when it finds a `~/.claude.json` that exists but isn't valid JSON. The CLI itself truncates the file mid-write on aborted runs, leaving a 0-byte file that poisons every subsequent task that reuses the same task user (useradd is idempotent, so the home dir persists across tasks). Two fixes: - bot/task-user.ts: ensureClaudeConfig() — write `{}` whenever the file is missing, empty, or fails JSON.parse, before chowning to the task user. Runs on both create-fresh and resume-existing paths. - bot/slack-bot.ts: pull message/code/stack out of caught Error before passing to winston so the next regression isn't invisible. Verified: with the fix applied to the existing 0-byte file, the SDK CLI runs `--version` cleanly under sudo as the task user. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
End-to-end debugging of "An error occurred while processing this
request" on Slack DMs (2026-04-30) found three bugs stacked on top of
each other. Each one masked the next, and the only error visible in
pm2 logs was an empty `Agent SDK error: { err: {} }`. Until disk
exhaustion was found, every fix uncovered the next layer.
1. SDK picks the musl binary on a glibc host
The Claude Agent SDK v0.2.114 enumerates platform binaries in the
order [linux-x64-musl, linux-x64] and resolves the first that is
`require.resolve()`-able. Both are installed via the optionalDeps,
so it always picks musl, which then fails on Debian glibc with
"sudo: unable to execute …-musl/claude: No such file or directory".
Fix: pass `pathToClaudeCodeExecutable` explicitly pointing at the
`…linux-x64/claude` glibc binary.
2. sudo --preserve-env (no list) silently drops every env var
--preserve-env without an explicit list only keeps env_keep
defaults (HOME, PATH, TERM). ANTHROPIC_API_KEY, GH_TOKEN,
MONGODB_URI, etc. were stripped before reaching the agent
subprocess, which then died at startup. Fix: build the
--preserve-env=KEY1,KEY2,... list from the resolved childEnv keys.
3. Mirror agent stderr so the next regression isn't invisible
The SDK's `stderr` callback only fires for SDK-format messages.
Plain stderr from the agent's startup path (e.g. "Credit balance
is too low", "JSON Parse error" on a corrupt config) was getting
swallowed by our SpawnedProcess wrapper, which doesn't expose
stderr. Mirror it unconditionally to bot stderr so pm2 catches it.
Plus a small DEBUG_SPAWN dump (off by default) for spawn-args/cwd/env-key
sniffing if anything regresses again.
After all three fixes the spawn now actually reaches the Anthropic
backend cleanly. Today's outage is now a billing problem (low credit
balance), not a code problem.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Move the Slack webhook ingest off the local VM bot and onto a serverless Vercel Function backed by a MongoDB queue. The bot tails the queue via changeStream, so Slack retries no longer get dropped while the bot is restarting/down.
What's new
bot/webhook-queue.ts— MongoDB-backed queueenqueueWebhook()inserts a doc intowebhook_queue(Atlas, replica set ⇒ changeStream supported)startWebhookConsumer()drains any unprocessed backlog at boot, then tails the changeStream filtered by sourcemarkWebhookProcessed()flipsprocessed=true(and stashes truncated error stack on failure) so resume after restart doesn't reprocessexpireAfterSeconds: 86400) oncreatedAtgives a 24h time-cap. Capped collections cap on bytes and don't allow per-doc updates, so we use a regular collection + TTL index instead.webhook_edge_dedup) with 1h TTL — used by the Vercel route to suppress retry storms before they reach the queue.bot/slack-bot.ts— wire bot consumer/slack/eventsHTTP handler now just verifies signature, dedups, andenqueueWebhook()— no inline dispatch.startSlackBot()callsstartWebhookConsumer({ sources: ["slack"] })so the existinghandleSlackEventruns against queue docs instead of inbound HTTP.app/api/webhook/slack/route.ts— Vercel receiverevent_callback→ edge dedup →enqueueWebhook({ source: "slack" })→ 200runtime: "nodejs"so the mongodb driver works (Edge runtime can't open Atlas connections)https://comfy-pr.vercel.app/api/webhook/slackdocs/cloudrun-vs-gke-vs-hybrid.mdBackground analysis comparing GKE vs Cloud Run vs Hybrid deployment options that led to this lighter-weight queue design.
Cutover
Slack app A078498JA5T Event Subscriptions Request URL has been moved:
Slack verified the new URL successfully. The local
/slack/eventsendpoint remains as a fallback during the cutover; safe to remove after a few days of clean Vercel traffic.Test plan
bunx tsc --noEmitbun test bot/taskInputFlow.spec.ts(4/4 pass)event_callback→ bot HTTP /slack/events →webhook_queueinsert → consumer dispatch →processed=true(~70ms)200 {challenge:...}afterSLACK_SIGNING_SECRETenv var setwebhook_queue(source=slack) → VM consumer →processed=truewithin ~26s/slack/eventsHTTP handler and stop redirectingprbot.stukivx.xyzto the VM🤖 Generated with Claude Code