Skip to content

feat: add orb-webhooks skill#63

Merged
leggetter merged 2 commits into
mainfrom
feat/orb-webhooks
May 14, 2026
Merged

feat: add orb-webhooks skill#63
leggetter merged 2 commits into
mainfrom
feat/orb-webhooks

Conversation

@leggetter

Copy link
Copy Markdown
Collaborator

Summary

Adds a complete orb-webhooks provider skill for Orb (usage-based billing). HMAC-SHA256 manual verification with the unusual signed-content format v1:{X-Orb-Timestamp}:{rawBody} (literal v1 prefix, ISO-8601 timestamp, colon separators).

What's included

  • skills/orb-webhooks/SKILL.md — entry point with frontmatter and the verification core
  • skills/orb-webhooks/references/ — overview (event taxonomy + summary-webhooks variant), setup (dashboard config + per-endpoint secret), verification (signature algorithm, gotchas, idempotency recommendation)
  • skills/orb-webhooks/examples/ — Express, Next.js App Router, FastAPI handlers with tests
  • Integration: providers.yaml, README.md, .claude-plugin/marketplace.json (both as a standalone plugin and added to the webhook-skills bundle)

Notes

  • Header pair: X-Orb-Signature: v1=<hex> carries the HMAC; X-Orb-Timestamp: <ISO-8601> carries the timestamp separately.
  • Signed content: v1:{X-Orb-Timestamp}:{raw-body} — literal v1, colon, ISO timestamp (as a string, not a Unix epoch), colon, raw body bytes. Pass the raw request body; don't JSON.parse and re-serialize.
  • Signing key: per-endpoint signing secret from the Orb dashboard. Each webhook endpoint gets its own secret (NOT the account API key).
  • Replay protection: the docs don't mandate a tolerance window — Orb just delivers X-Orb-Timestamp and recommends consumers pick a threshold. The skill recommends a 5-minute window in handlers plus event-id idempotency for at-least-once delivery safety.
  • Common events: customer (customer.created, customer.credit_balance_dropped), subscriptions (subscription.created / .started / .ended / .plan_changed / .edited / .usage_exceeded), invoices (invoice.issued / .payment_succeeded / .payment_failed / .edited), data exports (data_exports.transfer_success).
  • Summary webhooks: opt-in variant covering the same events with smaller payloads (line_items omitted from invoices; customer/plan minified to identification fields). Same signature scheme. Skill recommends fetching full resources via API when detail is needed.
  • SDKs: orb-billing on both npm and PyPI (same package name on both). Neither SDK exposes a Stripe-style unwrap()/constructEvent() helper at the time of writing — manual HMAC verification is the canonical path. The SDK is declared in providers.yaml's sdks field so future review runs will catch stale pins.

Test plan

  • cd skills/orb-webhooks/examples/express && npm install && npm test
  • cd skills/orb-webhooks/examples/nextjs && npm install && npm test
  • cd skills/orb-webhooks/examples/fastapi && python3 -m venv venv && source venv/bin/activate && pip install -r requirements.txt && pytest test_webhook.py -v
  • Verify the signature helpers reproduce the exact format from https://docs.withorb.com/integrations-and-exports/webhooks ("v1=" + HMAC-SHA256(secret, "v1:" + iso_ts + ":" + body).hexdigest())
  • Confirm event names match the live docs across both regular and summary webhook variants
  • Confirm the webhook-skills marketplace bundle now lists 38 skill paths (37 → 38)

Generation details

  • Generated via ./scripts/generate-skills.sh generate orb --config providers.yaml --model claude-opus-4-7
  • 1 iteration (initial generation passed review on first pass)
  • Locally: npx hookdeck-cli listen 3000 orb --path /webhooks/orb

https://claude.ai/code/session_01NNTgQRJss1V7gyzzJ9rjnB


Generated by Claude Code

claude added 2 commits May 13, 2026 18:34
Adds a webhook skill for Orb (usage-based billing) with HMAC-SHA256
manual verification over `v1:{X-Orb-Timestamp}:{rawBody}`, plus
runnable Express, Next.js, and FastAPI examples with tests.

https://claude.ai/code/session_01NNTgQRJss1V7gyzzJ9rjnB
…tplace.json

- README.md: add Orb row (alphabetically between OpenClaw and Paddle), linkified to official docs
- providers.yaml: add orb entry with HMAC-SHA256/`v1:{ts}:{body}` scheme notes, common events, summary-webhooks variant, and `orb-billing` SDK declared for both npm and pip so the version-tracker covers it
- .claude-plugin/marketplace.json: add `orb-webhooks` plugin entry (matching the per-skill pattern from PR #62) and append `./skills/orb-webhooks` to the `webhook-skills` bundle

Skill content (skills/orb-webhooks/) landed in the previous commit via the
generator.

https://claude.ai/code/session_01NNTgQRJss1V7gyzzJ9rjnB
@leggetter leggetter merged commit 22697e2 into main May 14, 2026
6 checks passed
@leggetter leggetter deleted the feat/orb-webhooks branch May 14, 2026 09:47
leggetter added a commit that referenced this pull request Jun 5, 2026
… merged since count was last set)

The README claimed "38 skills" in two places (bundle install copy) but
the bundle in .claude-plugin/marketplace.json now lists 40. The two
extra skills (knock-webhooks #64, orb-webhooks #63) were merged to main
after the bundle count was last updated, and the webhook-dx-audit PR
branch picked them up via merge from main without re-syncing the count.

Test plan item from PR #67 calls for "the bundle still totals 38
skills" - that's now 40. README updated in both occurrences (lines 111,
120) to match the actual bundle contents.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leggetter added a commit that referenced this pull request Jun 5, 2026
* feat: add webhook-dx-audit skill

Adds a meta/audit skill that reviews the developer experience of any
platform sending outbound webhooks and produces a scored review with
prioritized recommendations across signing, retries, event catalog,
observability, local dev, and agent readiness.

This is a different category from the existing repo skills (which
receive, send, or verify webhooks): the audit evaluates how other
platforms expose their webhook DX. README adds a new "Webhook DX
& Audit Skills" section to make the distinction clear, and the
bundle includes the new skill.

* docs: broaden audit skill scope to event destinations

The skill audits webhooks AND event destinations (SQS, RabbitMQ,
Pub/Sub, EventBridge, Kafka), not just webhooks. Surface the
distinction in the README section title and marketplace metadata,
and add an event-destinations keyword for discovery.

* feat(webhook-dx-audit): align with Event Destinations initiative

The scope is webhooks AND event destinations, not just webhooks.
Industry terminology is shifting (Stripe "event destinations" with
direct EventBridge/Event Grid delivery, Shopify "Event
Subscriptions"). Update the rubric and SKILL.md to assess the
broader concept and benchmark against the Event Destinations
initiative (https://eventdestinations.org).

Changes:
- SKILL.md: add scope paragraph referencing the spec and the
  terminology shift; clarify that the audit applies regardless of
  what the platform calls it.
- rubric.md intro: add the spec reference and a summary of its
  required/recommended capabilities.
- rubric category 5 (Security & authentication): split into
  webhook-specific signing criteria and a new "Destination-native
  auth" criterion covering IAM, service accounts, managed
  identities, and SASL/mTLS for non-HTTP destinations. Webhook
  criteria become Not assessed for queue-only platforms.
- rubric category 6 (Delivery semantics & reliability): add a
  "Destination type breadth" criterion - the spec's central
  required capability.
- rubric category 8 (SDKs & verification libraries): clarify that
  this stays webhook-focused because webhooks remain the most
  common destination and hand-rolled HMAC is where integrators
  get burned; non-HTTP destinations use native SDK auth and are
  scored under category 5.
- methodology.md: add a step 0 to identify destination types up
  front, since that determines which criteria apply.

* fix(webhook-dx-audit): tighten rubric and methodology from Stripe test run

Test-drove the skill against Stripe (Pass 1, public surface only;
84/B). Stripe's three-destination story (webhooks + EventBridge +
Event Grid) surfaced anchor ambiguities and methodology gaps. This
commit applies all 22 ranked fixes from that test, batched because
they cohere and each is small.

rubric.md (10 fixes):
- Cat 4: machine-readable spec anchors name OpenAPI 3.1 webhooks
  block, AsyncAPI, and per-event JSON Schema explicitly; a single
  polymorphic event envelope scores 1, not 2.
- Cat 5: add explicit "if both webhooks AND native destinations
  are offered, score all six criteria" rule at the top.
- Cat 5 destination-auth-options: clarify it scores configurable
  bearer/headers/OAuth2/mTLS independently of the signature scheme.
- Cat 6 failure handling & auto-disable: split anchors into the
  two distinct gaps (post-retry behavior docs, auto-disable
  feature with reactivation).
- Cat 6 failure alerting: limit scoring to push channels; dashboard
  widgets count under cat 10 (observability) instead.
- Cat 6 manual replay: anchors recognize partial coverage
  (sandbox-only, UI-only, CLI-only) at level 1.
- Cat 7 IaC: split community-vs-official cliff; community provider
  with current coverage can score 1, vendor-maintained scores 2.
- Cat 10 latency: spell out the three signals (attempt count,
  next-retry time, per-attempt response latency).
- Cat 12 push-to-agent: add a 1 anchor for partial coverage.
- Cat 12 MCP: 1 anchor names "agent SDK or function-calling
  toolkit"; 2 requires MCP or a deliberate scoped surface.

methodology.md (5 fixes):
- Read-what-a-human-reads: distinguish evidence collection (any
  source) from scoring (HTML page).
- Step 0 destination types: expand search-term list (endpoint,
  partner event source, stream, etc.).
- Step 2 specs: name OpenAPI 3.1 webhooks block, AsyncAPI,
  per-event JSON Schema as the three things to look for.
- Step 9 agent readiness: define "scoped sensibly" for llms.txt.
- What good looks like: handle calibration circularity when the
  audit subject is itself a reference platform (calibrate against
  the broader Event Destinations bar instead).

SKILL.md (2 fixes):
- Add evidence-vs-scoring distinction for .md exports.
- Document the Pass-1-only exit path: skip the human checklist,
  mark gated criteria Not assessed, proceed to scoring.

scoring.md (2 fixes):
- Add a second worked example with a Not-Assessed exclusion.
- Add the renormalization formula and a worked example for when
  a category is fully dropped.

report-template.md (3 fixes):
- Access field examples signal Pass 1 vs Pass 1+2.
- Caption under the scorecard clarifies Overall is weight-adjusted.
- Findings section: always list every criterion, mark unreached
  ones Not assessed inline.

program-mapping.md (3 fixes):
- New row: endpoint health (auto-disable, alerting, reactivation)
  -> Hookdeck Event Gateway in front of the consumer endpoint.
  Addresses the highest-impact cat 6 gap most platforms have.
- New row: OpenAPI lacks webhooks block -> webhook skill in
  hookdeck/webhook-skills as an agent-shaped substitute.
- Broaden the existing webhook-skill row to acknowledge it can
  also surface cat 3/6 docs gaps to agent consumers, not just cat 12.

Test artifacts (Stripe audit + findings doc) saved in /tmp;
not committed.

* feat(webhook-dx-audit): add workflow/scenario simulation criterion

Investigation of Stripe, Shopify, and Paddle revealed a real
maturity differentiator the rubric was not capturing:

- Paddle ships named "Scenarios" (subscription_creation = 12
  events, renewal = 7, etc.) that fire curated lifecycle sequences
  in one trigger.
- Stripe has implicit prerequisite chaining (firing
  payment_intent.succeeded also fires payment_intent.created)
  plus CLI fixtures for scripted multi-step composition.
- Shopify's webhook trigger is explicitly single-event only with
  fixed payload, recommending real Shopify actions for end-to-end
  tests.

Add a "Workflow / scenario simulation" criterion under category 11
(local dev / testing) as a sibling to test/sandbox parity. Framed
as a maturity differentiator, not a baseline: 0 is acceptable for
most platforms; 1 covers Stripe-style fixtures or implicit chains;
2 covers Paddle-style named lifecycle scenarios.

Update methodology step 8 with search terms (scenario, fixture,
lifecycle, workflow, trigger sequence) and the three platform
patterns as calibration anchors.

* fix(webhook-dx-audit): three-state taxonomy + dual-score aggregation

The rubric was collapsing three different states into one "Not assessed"
label, which caused Pass-1-only grades to inflate (the Stripe re-audit
went from 84/B to 85/A largely because of this). Split the states and
adjust the math so the labels mean what they say.

Three states (rubric.md):
- Not Supported: capability should exist but doesn't. Score 0;
  numerator 0, full weight in denominator. (Existing 0 behavior;
  the label clarifies intent in evidence.)
- Not Applicable: a logical rule excludes the criterion *as a
  concept* (e.g. Cat 5 destination-native auth on a webhook-only
  platform — there are no non-HTTP destinations to score auth
  for). Drop from both numerator and denominator. Critically NOT
  for "the platform should have this but doesn't" cases — those
  are Not Supported = 0.
- Not Assessed: should assess but cannot reach right now (HITL
  gap, gated dashboard). Treated differently across the two
  roll-ups below; signals HITL would lift the score.

Two roll-ups from the same per-criterion data (scoring.md):
- Public-scope grade. "How good are the parts we could see?"
  Drops both Not Applicable and Not Assessed from numerator and
  denominator. Honest score over what was reachable.
- Provisional minimum. "What's the floor if HITL never runs?"
  Drops Not Applicable only. Treats Not Assessed as 0 in
  numerator with full weight in denominator. HITL Pass 2 can
  only raise this number.

When HITL completes (no Not Assessed criteria remain), the two
scores converge on a single final grade.

Report template (assets/report-template.md):
- Scorecard now shows both columns per category and overall.
- Header shows the headline number twice: Public-scope leads
  when no HITL is planned; Provisional minimum leads when HITL
  is planned (conservative bound the customer can rely on).
- Coverage line under the scorecard counts how many criteria
  landed in each state.
- Optional "Context" line in the frontmatter for audits that
  are existing-customer deliverables.
- Recommendations template encourages "Concrete change (platform
  side) / Hookdeck offering (already available or in path)"
  framing for existing customers.

Cat 5 (rubric.md):
- Header now spells out three branches (webhook-only, non-HTTP-
  only, multi-destination). Per-criterion N/A clauses encode the
  logical rules. Stripe and Shopify both cited as multi-
  destination examples (Shopify ships HTTP + EventBridge + Pub/
  Sub destinations).

Cat 12 CLI for agents:
- Was incorrectly allowed an N/A escape hatch in the prior draft.
  Reverted: a CLI is a recommended capability for any developer
  platform; absence is a gap (Not Supported = 0), not a logical
  exclusion. The 0 anchor language now makes this explicit.

Other updates:
- SKILL.md scope paragraph reflects Shopify is multi-destination,
  not webhook-only. Adds a one-paragraph summary of the three
  states + two roll-ups.
- methodology.md adds a "Pick the right label" note explaining
  when to use each state and why the arithmetic differs.

No customer-specific content; this work is generic to the skill.

* fix(webhook-dx-audit): correct N/A definition examples (Cat 12 CLI is 0, not N/A)

* feat(webhook-dx-audit): N/A logic table as single source of truth

Add an explicit N/A logic table after the Categories list. Apply
mechanically based on the destination types identified at
methodology step 0; do not re-derive N/A from per-criterion text.

The table lists the four possible step-0 facts and which criteria
become N/A for each. Currently seven criteria (Cat 5 x 6 + Cat 8
x 1) can be N/A, all driven by two boolean facts (offers webhooks?
offers non-HTTP destinations?).

To make the table the single source of truth:
- Stripped the redundant per-criterion "(Not Applicable if X)"
  clauses from Cat 5 (5 criteria) and Cat 8 (1 criterion).
- Trimmed the Cat 5 header: removed the three-branch list
  (webhook-only / non-HTTP-only / multi-destination) since that's
  now encoded in the table. Kept the security-philosophy paragraph
  because it explains why the criteria differ by destination type.
- Cat 8 verification helper 0 anchor reframed to acknowledge the
  upstream Cat 5 dependency (no signature scheme to verify ->
  the helper question is downstream of that gap).

New criteria with N/A conditions should add a row to the table
rather than introducing a new inline clause. Comments in the
commit message of any future change should reference the table
row affected.

* feat(webhook-dx-audit): access-level table as source of truth for Not Assessed

Add a deterministic table tagging which criteria require account-level
(L1) or active-usage (L2) access to score, alongside the existing N/A
logic table. Pass-1 audits at L0 (public only) now have a mechanical
rule for which criteria become Not Assessed; the agent does not have
to derive it per-criterion.

Three access levels (rubric.md):
- L0: public docs, SDK source, machine specs, llms.txt
- L1: logged-in session; can read dashboard, settings, account-gated
  docs
- L2: L1 plus at least one delivered event observed; delivery logs,
  retries, alerting visible in practice

How L1 or L2 was obtained does not matter to the rubric. The auditor
may have signed up themselves, used agent-driven signup (e.g. Stripe
Projects, https://projects.dev), or been given access by the
platform's operator. Future-proof: as agent-signup capabilities
mature, more audits can declare L1/L2 without changing the rubric.

~12 criteria are tagged with required access levels (mostly Cat 1,
2, 7, 9, 10, 11). A few are "L1 or L0 if docs are thorough enough" -
those remain agent judgment within a tighter frame.

Other changes:
- Report template's Access line dropped the "customer-provided
  access" wording (that was an Outpost-audit context leak) and now
  uses the L0/L1/L2 levels directly. A note clarifies that the
  means of obtaining access does not matter, only the level
  reached.
- methodology.md "What good looks like" adds Stripe Projects
  (projects.dev) as an agent-driven provisioning calibration
  anchor for Cat 12 Action-layer scoring.

This makes Not Assessed deterministic at the level the framework
can reasonably enforce. The remaining agent judgment is limited to
the few criteria explicitly tagged "L1 or L0 if ..." in the table.

* fix(webhook-dx-audit): remove Hookdeck Outpost / Svix from calibration anchors

Hookdeck Outpost and Svix are webhook delivery products platforms
use to send events. Naming them as calibration anchors for
sender-DX scoring was the wrong reference frame: integrators
typically experience the *platform* (its docs, signing scheme,
dashboard), not the delivery infrastructure embedded behind it.
And this skill lives in hookdeck/webhook-skills, so naming Hookdeck
specifically as a benchmark would be a conflict of interest.

Use platforms integrators directly experience and benchmark
against: Stripe as the primary anchor; SendGrid (ECDSA signing),
GitHub (event taxonomy), Twilio (per-attempt status callbacks)
for specific features. The Event Destinations initiative
(eventdestinations.org) sets the broader floor.

Hookdeck Outpost stays in program-mapping.md as a gap-closing
recommendation for the platform side. Hookdeck Event Gateway
tools stay in the "Hookdeck tooling" section of methodology.md
as evidence-gathering aids during the audit (Console test URLs
for inspecting payloads, CLI for receiving on localhost) -
those are ingestion tools for the auditor, not benchmarks.

* fix(webhook-dx-audit): rule for L0 scoring from absence; HITL headroom; modern docs platforms

Three small refinements surfaced by the customer audit re-run.

A1: rubric.md access-level table now explicitly authorizes scoring
from L0 absence-of-documentation. If public docs are completely
silent on a capability tagged L1 or L2, score 0 (Not Supported)
from L0 rather than Not Assessed. The access-level requirement is
for VERIFICATION of a documented capability; confirming
non-existence is an L0 finding. Removes the only judgment call I
had to make by interpretation during the re-run.

A2: report-template.md scorecard now surfaces "HITL headroom: NN
points" prominently between the table and the renormalization
caption. Small headroom means HITL won't materially change the
grade; large headroom means HITL is load-bearing. Easier to see
than the gap in the dual-score columns.

A3: Cat 12 push-to-agent criterion now defaults to Not Assessed
(not 0) for docs hosted on modern platforms (Mintlify 2025+,
Docusaurus 3+, GitBook, ReadMe) where Copy-as-Markdown and
Open-in-X are typically JS-rendered. A non-browser fetch may not
see the buttons; the right call is to defer to HITL rather than
score 0 from rendering blindness.

The first customer audit had to interpret all three of these rules;
the framework now encodes them.

* fix(webhook-dx-audit): six refinements from HITL audit feedback

All surfaced by the HITL Pass 2 of a real customer audit. Each
addresses a real ambiguity or editorial leak in the rubric.

scoring.md grade bands:
- Dropped the editorial "Reading" column entirely. Grade letters
  alone; the "band is a headline, not the point" note already
  carried the framing. Per audit feedback that "painful or
  risky"-style language doesn't belong in audit output.
- Added explicit "do not write qualitative judgments of the grade
  into the audit report" line.
- Added boundary-zone note for 28-32 (F/D) and 83-87 (B/A) — these
  are sanity-check zones where rounding shifts the band.

Cat 4 payload shape guidance:
- Relaxed the 2 anchor. Was "explicit thin-vs-fat rationale, OR
  standard envelope like CloudEvents". Now "envelope is consistent
  across all event types and documented". CloudEvents alignment
  and thin-vs-fat rationale moved to bonus signals worth citing
  in evidence but not required for 2. Most platforms with strong
  event catalogs don't formally address the meta-framing; the
  prior anchor over-penalized them.

Cat 1 free/test access:
- Reworked anchors to handle two underlying questions (does free
  tier reach config? are test deliveries free?) as a sliding
  scale. 1 covers the partial case (e.g. paid plan required for
  config but test deliveries free once configured, the audited
  customer's shape) which the prior binary 0/2 anchor missed.

Cat 5 destination auth options:
- Requires auth framing in docs for any score above 0. A platform
  shipping an arbitrary header passthrough field without
  documenting it as an auth mechanism now correctly scores 0;
  previously a strict reading allowed 1 for mere field existence.

Audience scoping (new N/A logic Table 2):
- Two audiences: developer-platform (default, where integrators
  are software engineers) and no-code-saas (where integrators are
  power users in a UI). For no-code-saas, Cat 7 IaC and Cat 11
  workflow simulation and local-to-production transition become
  N/A. The third option "mixed" defaults to scoring all criteria
  unless the platform clearly serves one exclusively.
- Audience declared at methodology step 0 and in report
  frontmatter alongside Access level.
- Existing destination-type N/A logic becomes Table 1; audience
  becomes Table 2.

Methodology audit voice guidance:
- Explicit "stay factual, no editorial" rule. Per-category prose
  describes observation; reactions and synthesis go in the summary
  and recommendations. Examples cited: don't use "surprising",
  "impressive", "disappointing", "painful" in per-criterion or
  per-category text.

These changes are derived from real auditor experience; the
customer audit itself stays at /tmp (not committed).

* fix(webhook-dx-audit): Cat 1 reframe - discoverability only, no business model

The previous Cat 1 "Free/test access" criterion conflated two
distinct concerns: (a) business model (is the platform/feature
free to access), and (b) DX (can you test webhooks without
producing real production activity). Per repeated audit feedback,
business model is not a DX question and shouldn't penalize
platforms that offer webhooks behind a paid plan. The testability
question is already covered by Cat 2 Test event / trigger and Cat
11 Test / sandbox parity.

Cat 1 changes:
- Removed "Free/test access" criterion entirely.
- Reframed "Signup friction to webhook config" as "In-product
  discoverability of webhook configuration". Explicitly handles
  plan-gating: plan-gated features are fine as long as the
  configuration surface is visible in product navigation.
- Findability of webhook docs criterion now distinguishes deep-nav
  (1) from top-level (2) explicitly.

Cat 1 now has 2 criteria, both focused on discoverability:
1. Can a developer find the webhook docs from the top-level docs
   or product nav? (the "docs side" of discovery)
2. From a signed-in account on any tier, can a user discover that
   the platform offers webhooks and find where they would be
   configured? (the "in-product side" of discovery)

Also added a note clarifying what is NOT scored in Cat 1: pre-
purchase evaluation (business model) and production-data isolation
(covered in Cat 2 and Cat 11).

Updated the access-level table to reflect the criterion name
change and removal. Total rubric criterion count drops by one.

Derived from real auditor experience on a paid-plan-gated platform
where the prior rubric incorrectly penalized the business model.

* feat(webhook-dx-audit): add idempotency criteria (Cat 3 + Cat 4)

Cat 3 Documentation quality:
- Added "Idempotency guidance" criterion. Scores whether the docs
  (a) identify the unique delivery ID developers should dedupe on
  (a top-level event ID in the payload, a webhook-id-style header,
  or equivalent), and (b) explain the high-level dedup pattern
  (check ID -> process -> store ID -> return success for
  duplicates). 0/1/2.
- Removed "idempotency" from the Best-practices coverage anchor
  list since it now has its own criterion. Best-practices now
  covers: out-of-order delivery, consumer-side retries, timeouts.

Cat 4 Event catalog & schema:
- Added "Per-event unique ID" criterion. Scores whether the
  platform delivers a documented per-event unique delivery ID —
  in the payload, in headers (e.g. webhook-id, X-GitHub-Delivery,
  x-outpost-event-id), or equivalent. Distinct from any domain
  ID inside the payload (e.g. post.id is not a delivery ID).
  0: none. 1: ID delivered but docs don't identify it as the dedup
  key. 2: clearly documented as the dedup key.

The two criteria are explicitly linked: Cat 4 scores whether the
ID exists in the schema; Cat 3 scores whether the docs teach how
to use it. A platform can ship the ID (Cat 4 = 1) without
documenting it (Cat 3 = 0) — exactly the pattern surfaced by the
customer audit, where Outpost ships x-outpost-event-id on every
delivery but the customer didn't surface this to their integrators.

Net effect: rubric grows by two criteria. Platforms that document
idempotency at a high level (signal mention) but don't identify
the dedup ID will now score 1 instead of 2 on the new Cat 3
criterion, surfacing a specific actionable finding.

* docs: expand HITL acronym on first use across audit skill

HITL is used 16+ times across SKILL.md, rubric.md, methodology.md,
scoring.md, and report-template.md without ever being expanded or
defined. Readers outside AI/ML circles can struggle to parse it.

Expand to "human-in-the-loop (HITL)" on the first occurrence in each
file so the abbreviation has a definition before subsequent uses.
Subsequent uses stay as HITL once defined.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: rename Cat 3 "Documentation quality" to "Implementation guidance"

"Documentation quality" reads as a sweeping judgment on the
platform's docs, but the 5 criteria the category scores
(verification walkthrough, processing & handler guidance,
idempotency guidance, best-practices coverage, accuracy & freshness)
all measure implementation-guidance content for integrators
consuming webhooks. The event catalog and API reference are scored
separately under Cat 4 "Event catalog & schema".

A platform with a comprehensive event catalog but no handler
patterns or signing walkthroughs scores 0% on Cat 3, which reads
confusingly because their webhook docs do exist. The new name
makes it clear that Cat 3 scores integration-implementation
content specifically.

Sweep applied via replace_all to rubric.md (category list and
section heading), scoring.md (weight table), and report-template.md
(scorecard row). 4 files, 4 lines net.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: scope Cat 3 intro to webhooks explicitly

After renaming Cat 3 from "Documentation quality" to "Implementation
guidance", the previous intro ("The webhook section as a developer
reads it, not the marketing page") no longer fit. The contrast with
marketing was meaningful when the name was generic; under the new
name, implementation guidance is obviously not marketing.

The new intro is explicit about scope. Cat 3's 5 criteria are
webhook-specific in practice (HMAC verification, 2xx HTTP handler
patterns, dedup ID delivered with HTTP webhooks). Non-HTTP
destinations (SQS, Pub/Sub, RabbitMQ, etc.) rely on destination-
native SDKs; their integration-guidance equivalents are scored under
Cat 5 (destination-native auth) and Cat 6 (delivery semantics).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: clean up Cat 2, 5, 7, 11 intro lines

Cat 2 Onboarding & first event: drop "verified" from "I received a
verified event" since verification (HMAC) is a webhook-only concept.
For non-HTTP destinations, the event is just received, not verified
in the same sense.

Cat 5 Security & authentication: replace the editorial intro
("The capability most often weak and most consequential") with a
scope description that mirrors the rest of the rubric: HTTP webhooks
(signing, replay protection, secret rotation) and non-HTTP
destinations (destination-native auth). Weight note kept.

Cat 7 Setup surfaces: "webhooks" -> "webhooks and event destinations"
to match the audit's full scope (the category's criteria already
cover both).

Cat 11 Local dev: drop the vague "The program calls this out
explicitly" trailing sentence (unclear what "the program" referenced).
Replace with an explicit scope note: criteria focus on HTTP webhooks
(localhost tunnels and replay); non-HTTP destinations rely on
cloud-provider emulators (LocalStack, GCP Pub/Sub emulator, Azure
Service Bus emulator) as equivalents.

The other 8 categories' intros were already clean or were updated
previously (Cat 3 just landed in 933b724 and 37761ff).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: broaden methodology scope and fix stale Cat 5 example in scoring

Methodology step 3: "Read the webhook docs properly" was webhook-only
in framing but the step's scope covers Categories 3, 4, 5, 6.
Categories 4 (event catalog), 5 (security including non-HTTP
destination auth), and 6 (delivery semantics across all destination
types) cover event destinations beyond webhooks. Broaden the step
title and add destination-type-breadth and per-destination-native-
auth as evidence to capture.

Methodology step 5: "API endpoints for webhook CRUD" and "Terraform
provider and whether it covers webhooks" narrowed Category 7 to
webhooks only. Cat 7's intro now reads "webhooks and event
destinations"; the step now matches: webhook and destination CRUD,
and Terraform coverage of webhooks and destinations.

Scoring Example 1: "Security has 5 criteria" was stale; Cat 5 has
6 criteria (the destination-native-auth criterion was added but
Example 1 was never updated). Examples 2-4 already use 6 criteria.
Example 1 now matches: 6 criteria, score 2/1/1/0/2/2, sum 8, max
12, both roll-ups 67%.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: report-template summary scope and program-mapping link format

Line 10 summary instruction said "the platform's webhook DX" but the
audit's scope is webhooks AND event destinations (SQS, Pub/Sub,
RabbitMQ, EventBridge, Kafka, Azure Event Grid). A literal reader
might omit non-HTTP destination coverage. Broaden to "webhook and
event-destination DX".

Line 60 referenced "(see program-mapping)" without a file extension
or backticks, reading as a placeholder. Line 43 already references
`rubric.md` with backticks; match that pattern: "(see
`program-mapping.md`)".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: catch remaining Cat 3 references after rename

Two spots survived the original "Documentation quality" sweep because
they used lowercase or paraphrased forms.

SKILL.md line 45: agent-responsibilities list said "documentation
quality" (lowercase) which the title-case sweep missed. Rename to
"implementation guidance" to match the new category name.

program-mapping.md line 16: "category 3/6 documentation gaps" was
ambiguous after the rename. Replace with "category 3 implementation-
guidance and category 6 delivery-semantics gaps" so both category
references are explicit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: scope Summary list to webhook-surface features only

The Summary instruction told the writer to summarize the platform's
webhook DX but did not say what counts as a webhook-surface feature.
Audit agents reading the instruction listed positive platform signals
they noticed (OpenAPI specs, MCP servers, CLIs) without distinguishing
which ones actually apply to the webhook and event-destination
surface.

This produced misleading Summaries where, for example, an OpenAPI 3.1
spec without a `webhooks` block was listed as evidence of a working
webhook surface even though the spec does not carry webhook payload
contracts (which scores 1 under Cat 4 for that exact reason). The
customer reads the listed item as a strength, then later finds the
caveat that excludes it.

The instruction now scopes the list: include only items that
contribute to the webhook and event-destination surface. An OpenAPI
spec without a `webhooks` block, an MCP server without webhook tools,
or a CLI that does not manage webhook configuration are platform
features that do not apply in the Summary; they belong in their
respective category findings, with the scores that reflect their
limitations.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: restructure Cat 12 Action layer (combine CLI+MCP, add API access)

Cat 12's Action-layer scoring had two issues:

1. CLI and MCP were scored as separate criteria, requiring each
   surface to exist for full credit. In practice an agent only needs
   one agent-shaped interface beyond the raw API; either suffices.

2. The MCP criterion accepted "an MCP server exists" without
   requiring webhook scope, so a platform-wide MCP that excludes
   webhook management could score 2 even though Cat 12 measures
   webhook agent-readiness. The Ordinal audit hit this tension
   (hosted MCP for the core API but no webhook tools, scored 2).

3. The foundational layer (whether the webhook configuration API
   is publicly callable by an agent) was implicit, scattered across
   Cat 7 API configuration and Cat 4 machine-readable spec. The
   agent-readiness view of the API was not captured as its own
   signal in Cat 12.

The Action layer now has two criteria:

- API access for agents: foundational. Documented public HTTP API
  for webhook configuration. Overlaps with Cat 7 / Cat 4 but
  captures the agent's-eye view distinctly. 0 if dashboard-only or
  undocumented, 1 if SDK-only, 2 if documented HTTP API.

- CLI or MCP for the webhook surface: higher-leverage. CLI or MCP
  (either suffices) covering webhook management with structured
  output / agent-friendly tools. 0 if neither covers webhooks
  (explicitly including platform-wide MCPs without webhook tools),
  1 if partial coverage, 2 if full.

Methodology step 9 Action sentence updated to walk the new
criteria.

Cat 12 still has 6 criteria total; weight unchanged. Existing
audits that scored MCP at 2 because a platform-wide MCP exists
should re-evaluate under the new combined criterion.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: instruct HITL to capture and share a real delivery payload

Audits that only have docs evidence produce conditional recommendations
("in default mode the header is X; in Standard Webhooks mode it's Y").
A single actual delivery payload (request headers + body) lets the
auditor recommend directly: name the specific signature header, dedup
ID, timestamp format, and custom headers that are actually in use.

The Ordinal audit hit this exact case: the audit framed signing
conditionally because HITL had not shared an example delivery. Once
Phil shared a screenshot of a real delivery (Standard Webhooks mode
active; webhook-signature, webhook-timestamp, webhook-id headers
present; x-api-key set via the custom-headers feature), the
recommendation became concrete: "document the webhook-signature
you're already sending" rather than "add a signature scheme".

Two updates to the audit skill:

Roles section: add a "Critical HITL capture: an example delivery
payload" paragraph explaining what to capture and why. Whenever the
human fires a test event or observes a real delivery, they capture
and paste back the full delivery payload (all request headers and
the body) so the auditor can score signing, idempotency, event
schema, and destination-auth criteria against the actual delivery
shape rather than docs alone.

How an audit runs step 3 (the HITL checklist examples): add a third
example to the checklist, phrased as "capture and paste back the
full request payload of one real delivery, including all headers
and the body, so I can name the actual signature header, dedup ID,
and any custom headers in the recommendations".

Future audits should now produce concrete signature/dedupe/auth
recommendations whenever HITL is available, since the checklist
specifically requests the payload capture.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: require verified audience designation with cited signals

The audience declaration drives the audit's N/A logic (Table 2 in
the rubric) but the methodology had it as a brief read-and-judge
step with `developer-platform` as a silent default. In practice
agents either took the default without verification or relied on
HITL Pass 2 to set the designation, leading to misframed audits
when the default didn't match reality.

The Ordinal audit hit this: HITL Pass 2 declared no-code-saas
without site verification; the no-code designation triggered the
Cat 11 audience-N/A logic; later correction to mixed required
re-scoring two criteria. A site-verified audience designation at
audit start would have produced the correct framing from Pass 1.

Three updates:

Methodology step 0: explicit checklist of signals to verify the
designation against (hero copy, nav structure, testimonials,
pricing tiers, API prominence, onboarding CTA framing). Requires
citing at least three signals with quoted marketing copy. `mixed`
listed as a first-class option, with guidance to prefer it when
the platform clearly serves more than one audience. The
`developer-platform` default is allowed only as a Pass-1 fallback
when the homepage cannot be reached; Pass 2 must verify.

SKILL.md "Audience matters" paragraph: `mixed` named as one of
three options (not just a fallback). Adds the verification
requirement and points at the methodology checklist. Notes that
mixed audiences score by judgment per criterion.

report-template Audience header: now requires inline citations of
the signals that informed the designation (e.g. "mixed (primary
marketing teams per hero copy 'X'; secondary agencies via 'Y' nav;
tertiary developers via mid-page API mention)"). The bare
designation alone is no longer sufficient.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: Cat 3 ingest-at-scale guidance becomes first-class

Cat 3 "Processing & handler guidance" was scored against a generic
anchor ("covers the handler lifecycle"). Platforms that mention
"respond fast" and "process async" in passing could score 2 without
teaching the actual ingest-verify-queue pattern, naming their
response timeout, or pointing integrators at concrete architectures.

This is the criterion most directly tied to Hookdeck Event Gateway's
value prop (and to cloud-native alternatives like AWS EventBridge +
API Gateway, GCP Pub/Sub + serverless function), but the rubric
didn't surface that connection.

Three updates:

rubric.md Cat 3 Processing & handler guidance: criterion text now
spells out the ingest-verify-queue pattern as the production-traffic
contract integrators need: acknowledge quickly with 2xx, verify the
signature, queue work to a background processor so burst traffic
and slow downstream work do not exceed the timeout. 2-anchor now
requires the platform to (a) name the timeout window, (b) explain
the pattern, and (c) point at concrete reference architectures
(Hookdeck Event Gateway, cloud-native EventBridge+API Gateway or
Pub/Sub+serverless function, or queue+worker on the integrator's
own infrastructure). 1-anchor covers partial coverage.

methodology.md step 3 (Read the webhook docs): explicit prompt to
look for the response timeout window, the ingest-verify-queue
pattern, and architecture references. Tactics search-term list adds
"timeout", "respond", "async", "queue", "EventBridge", "Pub/Sub",
"Event Gateway", "ingest".

program-mapping.md: new row mapping the ingest-at-scale gap to
Hookdeck Event Gateway as the integrator's ingest layer (or cloud-
native alternatives EventBridge+API Gateway, Pub/Sub+serverless
function). Distinguished from the existing endpoint-health row:
that one is about reliability for an existing handler, this one is
about teaching integrators the pattern itself.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: phrasing - 'ingest-at-scale' -> 'ingest reliably'

The pattern matters at any volume, not just at scale. A 5-second
timeout kills a delivery whether the integrator is handling 1 req/sec
or 1000. 'Reliably ingest' captures the goal (don't time out, don't
lose deliveries) better than 'at scale', which implies high volume
specifically.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: add PLAN-v2.md for review

Captures the v2 pass: migrate audit format to structured YAML
(primary driver: cloud agent + public website for URL-submitted
audits), consolidate v1's rubric and methodology learnings, and
preserve Ordinal's HITL Pass 2 evidence so it does not need to be
re-collected.

Seven phases sketched: schema design, consolidate v1 learnings,
migrate audit template to YAML, update SKILL.md and methodology,
preserve and port Ordinal HITL evidence, decide downstream
backwards-compat path, re-run Ordinal under v2, cascade to
downstream skill.

Includes a complete inventory of HITL-derived facts to carry
forward (active usage observations, signing and delivery shape
from the captured payload, audience verification, scoring
decisions). Cross-check this list at every phase boundary.

Plan is for review before execution; commit per phase once
execution starts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: PLAN-v2 - all-in on YAML upstream, lockstep downstream cascade

Per direction: upstream skill emits YAML only. No Markdown audit
output, no renderer, no Markdown template, no transitional
backwards-compat phase. The downstream outpost-customer-audit-report
skill is updated in lockstep to consume YAML.

Changes from the previous PLAN-v2 draft:

- Target layout drops renderers/ and assets/report-template.md
- Phase 2 simplifies: produce assets/report-template.yaml and delete
  the Markdown template; no renderer
- Phase 3 SKILL.md update: explicit YAML-only output
- Phase 5 (was "decide on backwards-compatibility") removed; no
  decision to make - downstream cascades in lockstep
- Phase 5 (new, was Phase 6) re-runs Ordinal; produces audit.yaml
  only; v1 audit.md gets archived to customers/ordinal/archive/
- Phase 6 (new, was Phase 7) cascades to downstream skill in
  hookdeck-skills-internal: input becomes YAML, customer report
  stays Markdown (still the customer-facing deliverable)

Customer report format kept as Markdown for now since it is the
customer-facing artifact. Open question for review: if the cloud
agent's website ends up rendering customer reports as well, that
decision can flip to YAML in a follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: PLAN-v2 - resolve customer report format decision

Customer report stays Markdown. The cloud agent has no current plan
to render customer reports; the customer-facing artifact is sent or
shared as a file. Decision settled, not an open question.

Open Question 3 ("Customer report format") removed from the open
list and added to a new "Resolved decisions" section at the top of
the resolved choices that v2 execution should not relitigate
(upstream YAML-only, customer Markdown, downstream lockstep).

Remaining open questions renumbered.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: PLAN-v2 - add commit refs, schema sketch, open-question recommendations

Three additions to make PLAN-v2 self-sufficient for a fresh agent
picking up the v2 work cold:

Phase 1 consolidation list: each item now has the v1 commit hash
inline so the rationale is one git show away. Editorial qualifier
rules also annotated as downstream-only with a pointer at the
internal repo's methodology.

Schema sketch (illustrative): inline YAML showing the rough shape of
audit.yaml and hitl-evidence.yaml. Field names, nesting, status
enums, and scoring decision records all present. Marked as a
starting point that Phase 0 refines against the schema linter; not
authoritative.

Open question recommendations: each open question now has a
"Recommendation:" line so a fresh agent has a default to push back
against rather than picking from scratch (schema tooling, YAML lib
and lint config, cloud-agent field reservation, archive location;
re-audit timing already had one). Open questions remain genuine
questions; the recommendations are starting points.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(webhook-dx-audit): schema and lint tooling for v2 audit YAML

PLAN-v2 Phase 0. Adds the JSON Schema (Draft 2020-12, authored in YAML)
that defines the v2 audit format and a Node-based linter that validates
audit or hitl-evidence YAML against it.

- schema/audit.schema.yaml: full audit shape with locked CategoryId and
  CriterionId enums, status taxonomy, dual-score support, embedded HITL
  evidence, and reserved cloud-agent fields.
- schema/hitl-evidence.schema.yaml: companion shape for the standalone
  hitl-evidence pre-load file.
- schema/*.example.yaml: illustrative Stripe-shaped examples that
  validate against the schemas.
- schema/README.md: file layout, status taxonomy, dual-score handling,
  and how to run the linter.
- scripts/lint-audit.mjs + package.json: ajv + js-yaml CLI that
  auto-detects which schema to use and reports JSON-pointer paths on
  failure.

Examples use Stripe (the methodology calibration anchor) rather than any
customer; the public repo carries no customer-identifying content.

Refs: skills/webhook-dx-audit/PLAN-v2.md Phase 0.

* docs: PLAN-v2 - drop added_by_report from upstream schema spec

The flag belongs in the downstream report skill's own schema; upstream
audit stays free of downstream concepts. Aligns the plan with the v2
schema that landed in 7f23b1c.

* docs(webhook-dx-audit): consolidate v1 rubric and methodology learnings

PLAN-v2 Phase 1. Walked rubric, methodology, scoring, and program-mapping
end to end against the eleven enumerated v1 commits; every rule reads
cleanly in isolation. Three categories of edit landed:

- YAML field-name translation. Prose references to v1 Markdown structure
  ("the report's Access frontmatter line", "the report's Access limits",
  the audit's `Audience:` header line) now point at v2 YAML fields
  (`audience.designation`, `audience.signals`, `access_limits`,
  `summary`, `recommendations`).
- Editorial rules tightened. The "Stay factual; no editorial" tactic now
  carries the two specific sub-rules from the downstream methodology
  Section 3: no company-stage commentary and no unanchored qualifiers.
  These apply to upstream audit prose too; lifting them keeps audit-side
  voice consistent with how the customer-facing report reads them.
- Summary scoping rule added as a methodology tactic. The Summary should
  list only platform features that contribute to the webhook and
  event-destination surface; OpenAPI specs without `webhooks` blocks,
  platform-wide MCPs without webhook tools, and CLIs that do not manage
  webhook config belong in their respective category findings, not in
  the summary.

The other Phase 1 items (Cat 3 rename and intro, Cat 12 restructure,
Cat 2/5/7/11 intro cleanups, audience verification at Pass 1, HITL
acronym expansion on first use, methodology steps 3 and 5 broadened to
webhook AND event destinations, Cat 5 six-criteria example correction,
program-mapping reliable-ingestion row) were already in place; verified
without further edits. HITL payload capture lives in SKILL.md and is
Phase 3 territory.

Refs: skills/webhook-dx-audit/PLAN-v2.md Phase 1.

* feat(webhook-dx-audit): migrate audit template to YAML

PLAN-v2 Phase 2. Replaces assets/report-template.md (deleted) with
assets/report-template.yaml: a structural skeleton enumerating all 12
categories and all 54 criterion IDs with placeholder values that lint
clean against schema/audit.schema.yaml. Inline comments explain each
field's purpose, valid values, and the rubric/methodology section that
anchors it.

The template carries the v2-specific guidance directly:
- summary scoping rule (only items contributing to the webhook surface).
- editorial rules (no company-stage commentary, no unanchored qualifiers).
- status taxonomy quick reference (scored / not_supported /
  not_applicable / not_assessed).
- Cat 5 Table 1 reminder and Cat 7/11 Table 2 reminder for N/A logic.
- Cat 12 reminder not to re-score Cat 4 / Cat 8 surfaces.

The lint script now covers the template alongside both example files so
schema drift catches it.

Refs: skills/webhook-dx-audit/PLAN-v2.md Phase 2.

* docs(webhook-dx-audit): SKILL.md to YAML-only flow

PLAN-v2 Phase 3.

Frontmatter: description now states "produces a structured YAML audit
file" and points at schema/audit.schema.yaml so callers understand the
output shape from the trigger. Version bumped to 0.2.0 to match the
schema, template, and example files.

Roles section: HITL captures fill structured fields - delivery payload
to hitl_evidence.delivery_payload_capture, in-product observations as
findings[].criteria[].evidence strings keyed by criterion id, scoring
decisions as hitl_evidence.scoring_decisions records. Explicitly bars
free-form Pass-2 narratives in the summary; the dual-score data lives
in grade.public_scope / grade.provisional_minimum and the closed Pass-2
criteria live in passes.pass_2.closed_criteria.

How an audit runs: step 0 scaffolds the audit YAML from the template
and leads with the default flow (Pass 1 unattended, Pass 2 HITL
prompted by the agent at step 4). Pre-loaded HITL evidence is called
out as the exception. Step 4 collects HITL evidence and writes a
sibling hitl-evidence.yaml so the next re-audit can pre-load it;
when step 0 pre-loaded a companion file, step 4 updates it in place.
Steps 2-6 reference the YAML fields they populate (audit.findings,
audit.scorecard, audit.grade, audit.hitl_evidence, audit.summary,
audit.recommendations) and include a lint step before handoff.

Output and Reference files: structured YAML output described by field
group; new entries for schema/audit.schema.yaml and schema/README.md;
template reference updated to report-template.yaml.

Path conventions left to the caller: no customers/<name>/... prescribed
in upstream prose; the companion file is described as a sibling of the
audit file.

Acceptance: no references to "fill in the Markdown template" remain;
no "written review (Markdown)" framing; SKILL.md reads coherently
against the YAML-only flow.

Refs: skills/webhook-dx-audit/PLAN-v2.md Phase 3.

* docs(webhook-dx-audit): methodology tightenings from v2 Ordinal run

Two refinements surfaced during the v2 Ordinal dress rehearsal
(downstream commit 7621577 on feat/v2-ordinal-audit):

1. Generalize the JS-rendered-nav carve-out. The rubric currently
   notes this only for Cat 12 push-to-agent doc actions, but Cat 1
   findability hits the same issue when the docs site renders its top
   nav in JavaScript (Mintlify, Docusaurus 3+, GitBook, ReadMe).
   Methodology step 1 now instructs the agent to default Not Assessed
   from a plain fetch and verify in a browser during HITL rather than
   scoring 0 from an empty fetch.

2. Tactic: fetch the OpenAPI spec directly when scoring Cat 4
   machine-readable-spec. An LLM-summarized doc read can confuse the
   top-level `webhooks` key (per-event payload contracts, the 3.1
   feature this criterion scores) with a Tag named "Webhooks" that
   groups CRUD endpoints under `/webhooks`. Curl the spec and check
   `len(spec.webhooks)` programmatically; the same applies to AsyncAPI
   presence and per-event JSON Schema files.

Both edits are methodology-only; rubric anchors, schema, and template
unchanged.

* feat(webhook-dx-audit): require `why` on recommendations + reviewer-artifact rule

Two methodology refinements surfaced during the downstream Outpost-customer
review of the v2 Ordinal report: recommendations consistently lacked an
articulated user-facing benefit, and HITL-captured deliveries muddied
operator-side practice with reviewer-introduced artifacts.

Schema:
- Recommendation gains a required `why` field, separate from `body`. `body`
  describes what to change; `why` names the integrator-side benefit and the
  user-facing pain the gap creates. The split is load-bearing so downstream
  renderers can use each independently and so recommendations read as
  arguments rather than orders.
- schema/audit.schema.example.yaml updated to show `why` on both
  illustrative Stripe recommendations.
- assets/report-template.yaml carries a `why: TBD` placeholder.

methodology.md:
- New "Writing recommendations" section codifies the benefit-not-rule
  framing, anchoring to specific user-facing pain, and the rule against
  duplicating benefit framing in `body`. Calls out the Cat 6
  destination-type-breadth pitfall: phrase the recommendation around adding
  non-HTTP destination types (SQS, Pub/Sub, EventBridge, Kafka, Event Grid)
  with the integrator benefit named, not as "rename your HTTP endpoint" -
  renaming an endpoint does not change what is delivered. Stripe's
  evolution is the cited example: existing webhook product stayed in
  place; new destination types were added alongside.
- New "Distinguishing reviewer artifacts from operator-side practice"
  section: HITL captures often involve reviewer-configured headers, test
  webhooks, and synthetic deliveries. Anything the reviewer set up to
  enable the capture is not evidence of operator behavior. Findings and
  recommendations citing observed deliveries must anchor on operator-
  controlled docs, API surface, or in-product copy; reviewer-set custom
  headers (the borderline case that surfaced this) must not be cited as
  evidence the operator surfaces a feature in practice. Annotate borderline
  HITL records in `audit.hitl_evidence.other_observations` keyed by
  criterion id so reviewer artifacts cannot be mistaken for operator state.

SKILL.md step 6 now points at "Writing recommendations" so the `why`
requirement is discoverable from the workflow walk-through.

Verification: `npm run lint:file` clean on schema/audit.schema.example.yaml
and assets/report-template.yaml against the updated schema.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(webhook-dx-audit): target_scores + depends_on + effort on Recommendation; projected grade

Add three optional Recommendation fields and a projected grade roll-up so
downstream renderers can show current-vs-potential impact, sequence
recommendations by hard dependencies, and surface coarse implementation
effort on each one.

Schema (schema/audit.schema.yaml):

- Recommendation gains three optional fields:
  - `target_scores`: list of {criterion_id, target_score, note?}
    records naming which criteria the recommendation lifts to which
    0/1/2 score. The 2-anchor is honest: documenting a single existing
    auth option reaches 1 (one documented option), not 2 (multiple).
    When multiple recommendations target the same criterion, downstream
    renderers take the max.
  - `depends_on`: list of recommendation IDs that must land before this
    one delivers its full value. Hard dependencies only (e.g. Rec 2's
    verification step references Rec 1's signing documentation). Soft
    sequencing preferences belong in body or summary.
  - `effort`: coarse implementation effort, enum docs|s|m|l. `docs` is
    one page or section with little or no engineering work; `s` is a
    small product change (a button, a new endpoint, a config knob) on
    the order of days; `m` is a new feature surface on the order of
    weeks; `l` is an architectural change on the order of months.

- Two new top-level $defs back the fields:
  - TargetScore: required {criterion_id, target_score}, optional note
  - EffortLevel: enum docs|s|m|l with calibration description

- Grade gains an optional `projected` GradeRollup. Present when at least
  one recommendation has `target_scores`. Computed by taking the max
  target_score across recommendations per criterion (current score
  carries forward for criteria no recommendation touches), rolled up via
  the standard category weighting. Lets downstream renderers display the
  audit as current vs potential.

- ScorecardEntry gains an optional `projected_pct` per category, present
  when `grade.projected` is present. Powers a side-by-side scorecard in
  downstream renderers.

Methodology (references/methodology.md):

- "Writing recommendations" gains a "Populating `target_scores`,
  `depends_on`, `effort`" sub-section covering how to choose each
  recommendation's targets honestly (don't over-promise 2 when the
  rubric anchor isn't reachable), how to scope dependencies (hard only),
  and how to calibrate effort (against the EffortLevel enum, judged on
  what the platform team would do, not on the operator's team size).
- "Computing `grade.projected` and `scorecard[].projected_pct`"
  sub-section codifies the projection rule: per criterion, take the max
  target_score across all recommendations targeting it; criteria no
  recommendation touches carry their current score forward; roll up via
  scoring.md; carry N/A criteria the same way.

SKILL.md step 6 updated to instruct populating the three new fields on
every recommendation that closes a scored gap, and to compute the
projected grade when at least one recommendation has target_scores.

assets/report-template.yaml gains commented-out placeholders for the new
fields with calibration notes.

schema/audit.schema.example.yaml updated: Stripe illustrative audit now
shows `grade.projected` at 95% (A), per-category `projected_pct` on the
two categories the recommendations affect (Cat 5 Security 83 -> 100; Cat
7 Setup 75 -> 100), and the two recommendations carry `target_scores`
and `effort`. The Rec 1 (Terraform provider) example also demonstrates
the `effort: m` calibration for product work; Rec 2 (auth docs framing)
demonstrates `effort: docs`.

Backwards compatibility: all four new fields (target_scores, depends_on,
effort, grade.projected, scorecard[].projected_pct) are optional. Audits
that pre-date this change still lint clean. The projected roll-up only
appears in downstream renderers when target_scores are populated.

Verification: lint clean on schema/audit.schema.example.yaml and
assets/report-template.yaml against the updated schema.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(webhook-dx-audit): clarify effort reflects platform-context cost

Effort ratings on recommendations were calibrated implicitly against a
from-scratch baseline, but the audit reviews a specific operator's
surface and the operator's actual cost depends on what their delivery
backend ships. A recommendation that would be `l` for a platform
building delivery primitives themselves can drop to `m` or `s` when the
underlying capability is shipped by a backend like Outpost, Svix, or
Convoy.

Updated the "Populating `target_scores`, `depends_on`, `effort`" section
in references/methodology.md to make this explicit. The rater consults
`audit.context` when calibrating effort: when the context names a
delivery backend that ships the capability being recommended, rate the
remaining surfacing work (dashboard / API / docs), not the from-scratch
implementation cost. When the audit has no platform context, rate
from-scratch as the safer default and flag the assumption in the audit
`summary` so a downstream skill with platform knowledge can override.

No schema changes. The Stripe example in schema/audit.schema.example
.yaml keeps its `m` rating on the Terraform-provider recommendation
because Stripe builds the provider themselves; no delivery-backend
translation applies there.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(webhook-dx-audit): API-endpoint disambiguation; soften destination-type rename rule

Two related methodology refinements surfaced during HITL review of the
Ordinal report.

API-endpoint disambiguation (Tactics section):

The word "endpoint" is overloaded in webhook contexts. It can mean the
integrator's HTTP receiver (the URL they expose to receive webhook
deliveries) or the platform's management API endpoint (the route
integrators call to create / list / delete webhook destinations). A
bare `POST /webhooks` could read as either. Reviewers in this project
hit the ambiguity twice on Rec 3 of the Ordinal v2 report: once on
"rename POST /webhooks to POST /event-destinations" (read as renaming
an integrator-receiver URL, which is nonsensical) and again on "your
current POST /webhooks stays in place" (same ambiguity).

New Tactics rule: whenever the audit or recommendations name an HTTP
route, qualify it with the role it plays. "the destination-creation API
endpoint POST /webhooks"; "the webhook-management API at
/api-reference/webhooks/"; "the integrator's webhook-receiving URL".
A reader who cannot tell at a glance which side of the wire an endpoint
sits on will misread the recommendation.

Destination-type-breadth rename rule (Writing recommendations):

The previous version of this rule said "do not phrase the
recommendation as 'rename POST /webhooks to POST /event-destinations';
renaming an HTTP endpoint does not change what is delivered, and the
framing confuses an API design decision with the underlying capability."

That was overcorrecting. Once the destination-creation API endpoint is
extended to create non-HTTP destinations (SQS, Pub/Sub, EventBridge,
Kafka, Event Grid), the endpoint name `POST /webhooks` arguably
misrepresents what it does, and renaming to `POST /event-destinations`
or `POST /destinations` is a valid API-design refinement that signals
the broader scope to integrators reading the docs. The rule now reads:
lead with the capability addition (not the rename) as the primary
recommendation; surface the API rename as an optional secondary
refinement; recommend keeping the original endpoint as an alias for
backwards compatibility.

Together these two rules sharpen the audit and report voice when
recommending category-6 destination-type-breadth changes specifically,
and any other recommendation that names an API endpoint generally.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(webhook-dx-audit): "ingest-verify-queue" is a practice, not a pattern; type the implementations

"Ingest-verify-queue" is not a canonical industry-named pattern with
formal references. It is best-practice shorthand for the goal of
acknowledging fast, verifying, and handing off to async processing.
Calling it a "pattern" overclaims; the rubric, methodology, and
program-mapping now call it a "practice".

The four items previously grouped as "concrete implementations of the
pattern" are different categories of thing, and the umbrella implied
each was itself a pattern. Each now carries its own type:

- Hookdeck Event Gateway: a managed solution that ships the practice
  out of the box
- AWS EventBridge + API Gateway: a cloud-native composition
- GCP Pub/Sub + a serverless function: a cloud-native composition
- Queue + worker on the integrator's own infrastructure: a self-hosted
  setup

Only the queue + worker option is genuinely a pattern in the generic
sense; the others are products and cloud compositions.

Rubric Cat 3 criterion updated end to end with the new terminology.
Methodology step 3 and program-mapping Cat 3 row updated to match.

No schema changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: update bundle count to 40 skills (knock-webhooks + orb-webhooks merged since count was last set)

The README claimed "38 skills" in two places (bundle install copy) but
the bundle in .claude-plugin/marketplace.json now lists 40. The two
extra skills (knock-webhooks #64, orb-webhooks #63) were merged to main
after the bundle count was last updated, and the webhook-dx-audit PR
branch picked them up via merge from main without re-syncing the count.

Test plan item from PR #67 calls for "the bundle still totals 38
skills" - that's now 40. README updated in both occurrences (lines 111,
120) to match the actual bundle contents.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants