feat(cli): add the 'clerk webhooks' command group#323
Draft
rafa-thayto wants to merge 42 commits into
Draft
Conversation
🦋 Changeset detectedLatest commit: dba0f5b The changes in this PR will be included in the next version bump. This PR includes changesets to release 1 package
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
Contributor
Author
|
!snapshot |
Contributor
Snapshot publishednpm install -g clerk@2.0.1-snapshot.9f8329d
|
71a9dc7 to
7a249e6
Compare
The flaky E2E failures were Nuxt's beforeAll hitting the 300s budget. Two distinct stalls shared one opaque "hook timed out" signature: one CI run hung in `clerk link` (an untimed `fetch()` to the production Clerk API), another in `git init`. - Add a default 60s timeout to `loggedFetch`, composed with any caller signal via `AbortSignal.any` so tighter budgets (keyless's 15s) still win. A stalled connection now fails fast across every CLI command, not just in tests. - Wrap each fixture setup step (git / clerk link / clerk init / npm ci) in a per-step timeout that fails with a labeled error instead of silently eating the whole 300s budget. - Cap e2e `--parallel=4` to cut startup contention; add an explicit afterEach cleanup budget and `npm ci --no-audit --no-fund`. - Drop noisy success-path debug traces; keep failure diagnostics. Claude-Session: https://claude.ai/code/session_01V1YkHZ2Ad1okwkX9bxTYsd
Address PR review feedback: - `runStep` now spawns each setup step via `Bun.spawn` with an `AbortSignal` (Bun.$ can't be cancelled), so a timed-out git/clerk/npm step is killed instead of orphaned and left to race teardown. Adds runStep unit tests. - fetch timeout test now fails if `loggedFetch` resolves instead of rejecting (no more false pass via swallowed error). - Trim verbose comments. Claude-Session: https://claude.ai/code/session_01V1YkHZ2Ad1okwkX9bxTYsd
The Bun.spawn rewrite (5ce158a) regressed the E2E job: 3 fixtures hung the full 300s in beforeAll with no per-step timeout recovering, because reading a killed child's piped stderr to EOF can block when a grandchild keeps the pipe open. Restore the prior approach, which passed E2E in 52s: - setup steps use Bun.$ again, wrapped in the Promise.race `withStepTimeout` (a timed-out step's subprocess is left to settle — beforeAll is never retried, so it can't cascade). - drop the runStep Bun.spawn helper and its unit test. The real root-cause fix (the 60s loggedFetch timeout that bounds a stalled clerk link/init network call at the source) is unchanged. Claude-Session: https://claude.ai/code/session_01V1YkHZ2Ad1okwkX9bxTYsd
The Bun.spawn `runStep` rewrite (5ce158a) regressed CI. `clerk init` runs an internal `npm install` with inherited stderr (init/heuristics.ts installSdk), so when the per-step AbortSignal SIGKILLed the CLI, the npm grandchild survived holding the stderr pipe open — `new Response(proc.stderr).text()` never EOF'd, the timeout never threw, and the 300s beforeAll fired instead. 3 fixtures hung. Root realization: `clerk init` and `npm ci` do package installs whose duration scales with CI contention, so any fixed per-step budget false-fails under load (clerk init blew past its 90s budget in the failing run). You can't fix contention-driven flakiness by capping variable-duration install work tighter. Fix: remove per-step timeouts entirely. The real root-cause fix — the 60s loggedFetch timeout — still bounds the only thing that can truly hang (network calls); `--parallel=4` cuts contention; the 300s beforeAll is the backstop. Setup steps return to plain Bun.$ (as on main). Removes runStep and its test. Claude-Session: https://claude.ai/code/session_01V1YkHZ2Ad1okwkX9bxTYsd
The remaining flake is npm, not the CLI. `npm ci`'s default `fetch-timeout` is 300000ms — identical to the test's 300s beforeAll budget — so a single stalled npm registry connection hangs setup until the hook times out. (clerk init's installSdk skips here because the isolated env has no PATH, so npm ci is the only unbounded npm install.) - npm ci: add --fetch-timeout=60000 --fetch-retries=5 so a stalled fetch aborts at 60s and retries, mirroring the CLI's loggedFetch timeout. - Restore the debug-gated git/link/init/npm step markers so any residual hang names the exact step instead of an opaque "hook timed out". Claude-Session: https://claude.ai/code/session_01V1YkHZ2Ad1okwkX9bxTYsd
The persistent 300s beforeAll hang was npm, not the CLI. npm's default fetch-timeout is 300000ms, so one stalled registry connection during either npm operation in setup blocks until the test budget expires. The previous commit bounded `npm ci` but missed the other one: `clerk init` runs an internal `npm install @clerk/<sdk>` (installSdk), which was still unbounded — that's what hung the Vue fixture at 300007ms. Write a project `.npmrc` (fetch-timeout=30s, fetch-retries=3) before any npm runs. Both `clerk init`'s install and `npm ci` use projectDir as cwd, so it covers both: a stalled fetch now aborts in 30s and retries on a fresh connection instead of waiting 5 minutes. Worst case ~120s, safely under the 300s budget. Drops the redundant per-command npm flags. Claude-Session: https://claude.ai/code/session_01V1YkHZ2Ad1okwkX9bxTYsd
Across four CI runs the 300s beforeAll hang moved randomly between fixtures AND steps — including `git init`, a local, near-instant, near-silent command. That rules out npm, the network, loggedFetch and the earlier Bun.spawn pipe deadlock: the only thing that explains a trivial `git` subprocess hanging 300s intermittently and only under `--parallel` is Bun.$ subprocess spawning/reaping stalling under high concurrent load (each of 4 workers spawns git + 2 `bun` CLIs + npm + a dev server + chromium at once). Run fixtures serially (`--parallel=1`, still isolated) so at most one fixture's subprocesses run at a time. Bump the E2E job timeout 30->45m for the slower serial run. Keeps the .npmrc fetch-timeout and loggedFetch fixes. Claude-Session: https://claude.ai/code/session_01V1YkHZ2Ad1okwkX9bxTYsd
Serializing fixtures fixed the contention-driven setup hangs, but exposed a second, independent flake: `clerk link` (and `init`) intermittently hang ~300s in a non-fetch path the CLI's loggedFetch timeout can't bound — in human mode they shell out to git and can stall on a git subprocess or prompt. It lands on a different fixture each run, so it's transient, not deterministic. Wrap both CLI steps in withRetry: a stall trips a hard timeout (90s/120s, above loggedFetch's 60s so genuinely-slow API calls aren't pre-empted) and the retry runs a fresh subprocess. Promise.race abandons the hung process (no stream deadlock); beforeAll isn't retried so the orphan can't cascade.
Harden the setup against the intermittent Bun.$ subprocess stall (a spawned git/clerk/npm step occasionally never resolves — verified a Promise.race timeout still fires during the hang, so a retry recovers it). - withRetry now wraps every step: git init, clerk link, clerk init, npm ci. A hung attempt is abandoned at its budget and a fresh subprocess retried. - Tighten the project .npmrc (fetch-timeout 30s->20s, retries 3->2) so a real npm stall resolves well under the step budgets and can't false-trip them. - Restore --parallel=4 (retry absorbs the higher hang frequency) and revert the E2E job timeout to 30m. Keeps the loggedFetch 60s request timeout (bounds the CLI's own API calls). Claude-Session: https://claude.ai/code/session_01V1YkHZ2Ad1okwkX9bxTYsd
99624a2 to
0bcbca4
Compare
The retry on `clerk link` was making things worse: attempt 1 writes the profile
then the process intermittently hangs (a lingering handle after setProfile, not
a fetch — confirmed AbortSignal.timeout is unref'd), so withRetry kills it at
90s; attempt 2 then ran `clerk link --mode human` on the now-linked project,
hit the interactive "re-link?" confirm prompt, and failed with "Already linked"
(3/3 rerun sample failed this way).
Run link in `--mode agent`: on an already-linked project it prints status and
exits 0 instead of prompting, so the retry's second attempt succeeds. `clerk
init` is already idempotent on re-run ("Clerk is already set up" -> exit 0).
…; gate listen deliveries until setup completes
…tdin pipes - delete / secret --rotate / replay --since now run the --yes/prompt gate before resolveAppContext, so agent mode gets the deterministic usage error without a network round-trip (and regardless of key validity) - the implicit piped-stdin --input-json expansion now stands down when a literal '-' is in argv, fixing 'verify --delivery -' / '--payload -' which previously had their stdin consumed and rejected as nested JSON
…li-core/undefined/
…he inbox URL Live-relay verified: play.svix.com returns 400 'Invalid token' for unprefixed tokens, and the relay only registers an inbox when the start frame carries the same c_ token. With c_ in both, a POST to the inbox round-trips through the WebSocket and the reply frame is accepted — proven end-to-end against the real relay with no PLAPI involvement. Reverses spec change #12 (recorded as spec change #27).
…li-program - Import createOption from @commander-js/extra-typings (used by webhooks messages --status) - Import parseIntegerOption from lib/option-parsers (used by webhooks list and messages --limit) - Remove stray conflict-marker text fragments left by conflict resolution
…ndling - splitCommaList now returns undefined for empty/whitespace-only values so callers treat them as "not provided" rather than sending an empty array - list now prints the iterator hint when paginating - add relay-client tests plus list/replay/update/verify coverage - README: clarify keepalive probe timing and JSON-mode type discriminator Claude-Session: https://claude.ai/code/session_01Mwcxk4pmfNYtmvjwWs9jUE
Adversarial audit follow-ups on the unreleased webhooks group: - verify: reject an explicit empty --payload as a usage error instead of hashing an `undefined` pre-image and silently failing (exit 2, not 1). - create: propagate an AuthError from the post-create secret fetch instead of masking it as "secret unavailable"; tag the genuine partial-failure with the new webhook_secret_fetch_failed code for agent branching. - replay: `--until` alone now points at the missing --since rather than emitting the vaguer "pass <msg_id> or --since" hint. - relay-client: route the 1008 token-collision redial through the standard reconnect backoff (no zero-delay storm) and guard onopen against a stop() that races socket construction. - README: document at-least-once redelivery on reconnect (handlers must key on svix-id). Adds tests for each fix. Full suite 1934 pass / 0 fail. Claude-Session: https://claude.ai/code/session_015J6Sduw5KeHz6SxLEBfViF
d96c2ca to
a1352d3
Compare
…ant pattern Run the local relay tunnel with no Clerk backend via `listen --relay-only` (skips PLAPI endpoint provisioning and the group auth gate, forces verification off). Persist the relay token per instance so the relay URL is stable across restarts, and add `--token <c_…>` to pin a deterministic, shareable URL. Move the webhooks command tree into `registerWebhooks(program)` exported from commands/webhooks/index.ts and wire it through cli-program's registrants array, matching the project's command-registration pattern. Add a .claude rule that documents the pattern when cli-program.ts or a command index.ts is edited. Harden the command group's edge cases (svix_app_missing handling, friendly API errors, --limit validation, header forwarding) and extract the SIGINT handler into lib/signals.ts. Claude-Session: https://claude.ai/code/session_01SYYJBsRxBQjCAuNbQiiLma
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds the full
clerk webhookscommand group (13 commands) per the final spec: CRUD, delivery inspection, local forwarding via the Svix relay, replay, offline signature verification, and portal open.list,get,create,update,delete,secret [--rotate],event-types,messageslisten(Svix relay WebSocket, persistent per-instance endpoint, HMAC verification, local forwarding with per-delivery diagnostics),trigger(validates event type first),replay(single message or bulk--since [--until]recovery)verify— pure HMAC-SHA256 check, no auth gate, consumeslistenNDJSON event lines via--delivery @file|-ERROR_CODEentries, per-instancerelayconfig, typed PLAPI functions for the 13 new routes (--iterator→starting_afterwire translation), group-level--app/--instance/--jsonwith an authpreActionhook that exemptsverifyAgent contract: bare domain JSON on stdout via
log.data(), structured{"error":{code,…}}on stderr, exit codes 0/1/2/130, NDJSON forlisten. Destructive commands (delete,secret --rotate,replay --since) prompt in human mode and require--yesin agent mode — validated before any network call.Notable fixes that came out of the verification passes:
triggervalidates the event type before endpoint resolution so agents always getunknown_event_typelistengates delivery processing until the signing secret is fetched (no false verification warnings during startup)--input-jsonexpansion stands down when a literal-is in argv, unblockingverify --delivery -pipesThe 13 PLAPI routes are being built in parallel in clerk_go; unit tests mock the PLAPI layer.
Test plan
bun run format/lint/typecheckcleanbun run test— 1846 tests pass (187 in the webhooks group)CLERK_MODE=agent, isolatedCLERK_CONFIG_DIR): verify success/mismatch/usage errors, stdin pipes, fail-fast--yesgates, structured API/auth error shapes