feat: add orb-webhooks skill#63
Merged
Merged
Conversation
Adds a webhook skill for Orb (usage-based billing) with HMAC-SHA256
manual verification over `v1:{X-Orb-Timestamp}:{rawBody}`, plus
runnable Express, Next.js, and FastAPI examples with tests.
https://claude.ai/code/session_01NNTgQRJss1V7gyzzJ9rjnB
…tplace.json
- README.md: add Orb row (alphabetically between OpenClaw and Paddle), linkified to official docs
- providers.yaml: add orb entry with HMAC-SHA256/`v1:{ts}:{body}` scheme notes, common events, summary-webhooks variant, and `orb-billing` SDK declared for both npm and pip so the version-tracker covers it
- .claude-plugin/marketplace.json: add `orb-webhooks` plugin entry (matching the per-skill pattern from PR #62) and append `./skills/orb-webhooks` to the `webhook-skills` bundle
Skill content (skills/orb-webhooks/) landed in the previous commit via the
generator.
https://claude.ai/code/session_01NNTgQRJss1V7gyzzJ9rjnB
leggetter
added a commit
that referenced
this pull request
Jun 5, 2026
… merged since count was last set) The README claimed "38 skills" in two places (bundle install copy) but the bundle in .claude-plugin/marketplace.json now lists 40. The two extra skills (knock-webhooks #64, orb-webhooks #63) were merged to main after the bundle count was last updated, and the webhook-dx-audit PR branch picked them up via merge from main without re-syncing the count. Test plan item from PR #67 calls for "the bundle still totals 38 skills" - that's now 40. README updated in both occurrences (lines 111, 120) to match the actual bundle contents. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leggetter
added a commit
that referenced
this pull request
Jun 5, 2026
* feat: add webhook-dx-audit skill
Adds a meta/audit skill that reviews the developer experience of any
platform sending outbound webhooks and produces a scored review with
prioritized recommendations across signing, retries, event catalog,
observability, local dev, and agent readiness.
This is a different category from the existing repo skills (which
receive, send, or verify webhooks): the audit evaluates how other
platforms expose their webhook DX. README adds a new "Webhook DX
& Audit Skills" section to make the distinction clear, and the
bundle includes the new skill.
* docs: broaden audit skill scope to event destinations
The skill audits webhooks AND event destinations (SQS, RabbitMQ,
Pub/Sub, EventBridge, Kafka), not just webhooks. Surface the
distinction in the README section title and marketplace metadata,
and add an event-destinations keyword for discovery.
* feat(webhook-dx-audit): align with Event Destinations initiative
The scope is webhooks AND event destinations, not just webhooks.
Industry terminology is shifting (Stripe "event destinations" with
direct EventBridge/Event Grid delivery, Shopify "Event
Subscriptions"). Update the rubric and SKILL.md to assess the
broader concept and benchmark against the Event Destinations
initiative (https://eventdestinations.org).
Changes:
- SKILL.md: add scope paragraph referencing the spec and the
terminology shift; clarify that the audit applies regardless of
what the platform calls it.
- rubric.md intro: add the spec reference and a summary of its
required/recommended capabilities.
- rubric category 5 (Security & authentication): split into
webhook-specific signing criteria and a new "Destination-native
auth" criterion covering IAM, service accounts, managed
identities, and SASL/mTLS for non-HTTP destinations. Webhook
criteria become Not assessed for queue-only platforms.
- rubric category 6 (Delivery semantics & reliability): add a
"Destination type breadth" criterion - the spec's central
required capability.
- rubric category 8 (SDKs & verification libraries): clarify that
this stays webhook-focused because webhooks remain the most
common destination and hand-rolled HMAC is where integrators
get burned; non-HTTP destinations use native SDK auth and are
scored under category 5.
- methodology.md: add a step 0 to identify destination types up
front, since that determines which criteria apply.
* fix(webhook-dx-audit): tighten rubric and methodology from Stripe test run
Test-drove the skill against Stripe (Pass 1, public surface only;
84/B). Stripe's three-destination story (webhooks + EventBridge +
Event Grid) surfaced anchor ambiguities and methodology gaps. This
commit applies all 22 ranked fixes from that test, batched because
they cohere and each is small.
rubric.md (10 fixes):
- Cat 4: machine-readable spec anchors name OpenAPI 3.1 webhooks
block, AsyncAPI, and per-event JSON Schema explicitly; a single
polymorphic event envelope scores 1, not 2.
- Cat 5: add explicit "if both webhooks AND native destinations
are offered, score all six criteria" rule at the top.
- Cat 5 destination-auth-options: clarify it scores configurable
bearer/headers/OAuth2/mTLS independently of the signature scheme.
- Cat 6 failure handling & auto-disable: split anchors into the
two distinct gaps (post-retry behavior docs, auto-disable
feature with reactivation).
- Cat 6 failure alerting: limit scoring to push channels; dashboard
widgets count under cat 10 (observability) instead.
- Cat 6 manual replay: anchors recognize partial coverage
(sandbox-only, UI-only, CLI-only) at level 1.
- Cat 7 IaC: split community-vs-official cliff; community provider
with current coverage can score 1, vendor-maintained scores 2.
- Cat 10 latency: spell out the three signals (attempt count,
next-retry time, per-attempt response latency).
- Cat 12 push-to-agent: add a 1 anchor for partial coverage.
- Cat 12 MCP: 1 anchor names "agent SDK or function-calling
toolkit"; 2 requires MCP or a deliberate scoped surface.
methodology.md (5 fixes):
- Read-what-a-human-reads: distinguish evidence collection (any
source) from scoring (HTML page).
- Step 0 destination types: expand search-term list (endpoint,
partner event source, stream, etc.).
- Step 2 specs: name OpenAPI 3.1 webhooks block, AsyncAPI,
per-event JSON Schema as the three things to look for.
- Step 9 agent readiness: define "scoped sensibly" for llms.txt.
- What good looks like: handle calibration circularity when the
audit subject is itself a reference platform (calibrate against
the broader Event Destinations bar instead).
SKILL.md (2 fixes):
- Add evidence-vs-scoring distinction for .md exports.
- Document the Pass-1-only exit path: skip the human checklist,
mark gated criteria Not assessed, proceed to scoring.
scoring.md (2 fixes):
- Add a second worked example with a Not-Assessed exclusion.
- Add the renormalization formula and a worked example for when
a category is fully dropped.
report-template.md (3 fixes):
- Access field examples signal Pass 1 vs Pass 1+2.
- Caption under the scorecard clarifies Overall is weight-adjusted.
- Findings section: always list every criterion, mark unreached
ones Not assessed inline.
program-mapping.md (3 fixes):
- New row: endpoint health (auto-disable, alerting, reactivation)
-> Hookdeck Event Gateway in front of the consumer endpoint.
Addresses the highest-impact cat 6 gap most platforms have.
- New row: OpenAPI lacks webhooks block -> webhook skill in
hookdeck/webhook-skills as an agent-shaped substitute.
- Broaden the existing webhook-skill row to acknowledge it can
also surface cat 3/6 docs gaps to agent consumers, not just cat 12.
Test artifacts (Stripe audit + findings doc) saved in /tmp;
not committed.
* feat(webhook-dx-audit): add workflow/scenario simulation criterion
Investigation of Stripe, Shopify, and Paddle revealed a real
maturity differentiator the rubric was not capturing:
- Paddle ships named "Scenarios" (subscription_creation = 12
events, renewal = 7, etc.) that fire curated lifecycle sequences
in one trigger.
- Stripe has implicit prerequisite chaining (firing
payment_intent.succeeded also fires payment_intent.created)
plus CLI fixtures for scripted multi-step composition.
- Shopify's webhook trigger is explicitly single-event only with
fixed payload, recommending real Shopify actions for end-to-end
tests.
Add a "Workflow / scenario simulation" criterion under category 11
(local dev / testing) as a sibling to test/sandbox parity. Framed
as a maturity differentiator, not a baseline: 0 is acceptable for
most platforms; 1 covers Stripe-style fixtures or implicit chains;
2 covers Paddle-style named lifecycle scenarios.
Update methodology step 8 with search terms (scenario, fixture,
lifecycle, workflow, trigger sequence) and the three platform
patterns as calibration anchors.
* fix(webhook-dx-audit): three-state taxonomy + dual-score aggregation
The rubric was collapsing three different states into one "Not assessed"
label, which caused Pass-1-only grades to inflate (the Stripe re-audit
went from 84/B to 85/A largely because of this). Split the states and
adjust the math so the labels mean what they say.
Three states (rubric.md):
- Not Supported: capability should exist but doesn't. Score 0;
numerator 0, full weight in denominator. (Existing 0 behavior;
the label clarifies intent in evidence.)
- Not Applicable: a logical rule excludes the criterion *as a
concept* (e.g. Cat 5 destination-native auth on a webhook-only
platform — there are no non-HTTP destinations to score auth
for). Drop from both numerator and denominator. Critically NOT
for "the platform should have this but doesn't" cases — those
are Not Supported = 0.
- Not Assessed: should assess but cannot reach right now (HITL
gap, gated dashboard). Treated differently across the two
roll-ups below; signals HITL would lift the score.
Two roll-ups from the same per-criterion data (scoring.md):
- Public-scope grade. "How good are the parts we could see?"
Drops both Not Applicable and Not Assessed from numerator and
denominator. Honest score over what was reachable.
- Provisional minimum. "What's the floor if HITL never runs?"
Drops Not Applicable only. Treats Not Assessed as 0 in
numerator with full weight in denominator. HITL Pass 2 can
only raise this number.
When HITL completes (no Not Assessed criteria remain), the two
scores converge on a single final grade.
Report template (assets/report-template.md):
- Scorecard now shows both columns per category and overall.
- Header shows the headline number twice: Public-scope leads
when no HITL is planned; Provisional minimum leads when HITL
is planned (conservative bound the customer can rely on).
- Coverage line under the scorecard counts how many criteria
landed in each state.
- Optional "Context" line in the frontmatter for audits that
are existing-customer deliverables.
- Recommendations template encourages "Concrete change (platform
side) / Hookdeck offering (already available or in path)"
framing for existing customers.
Cat 5 (rubric.md):
- Header now spells out three branches (webhook-only, non-HTTP-
only, multi-destination). Per-criterion N/A clauses encode the
logical rules. Stripe and Shopify both cited as multi-
destination examples (Shopify ships HTTP + EventBridge + Pub/
Sub destinations).
Cat 12 CLI for agents:
- Was incorrectly allowed an N/A escape hatch in the prior draft.
Reverted: a CLI is a recommended capability for any developer
platform; absence is a gap (Not Supported = 0), not a logical
exclusion. The 0 anchor language now makes this explicit.
Other updates:
- SKILL.md scope paragraph reflects Shopify is multi-destination,
not webhook-only. Adds a one-paragraph summary of the three
states + two roll-ups.
- methodology.md adds a "Pick the right label" note explaining
when to use each state and why the arithmetic differs.
No customer-specific content; this work is generic to the skill.
* fix(webhook-dx-audit): correct N/A definition examples (Cat 12 CLI is 0, not N/A)
* feat(webhook-dx-audit): N/A logic table as single source of truth
Add an explicit N/A logic table after the Categories list. Apply
mechanically based on the destination types identified at
methodology step 0; do not re-derive N/A from per-criterion text.
The table lists the four possible step-0 facts and which criteria
become N/A for each. Currently seven criteria (Cat 5 x 6 + Cat 8
x 1) can be N/A, all driven by two boolean facts (offers webhooks?
offers non-HTTP destinations?).
To make the table the single source of truth:
- Stripped the redundant per-criterion "(Not Applicable if X)"
clauses from Cat 5 (5 criteria) and Cat 8 (1 criterion).
- Trimmed the Cat 5 header: removed the three-branch list
(webhook-only / non-HTTP-only / multi-destination) since that's
now encoded in the table. Kept the security-philosophy paragraph
because it explains why the criteria differ by destination type.
- Cat 8 verification helper 0 anchor reframed to acknowledge the
upstream Cat 5 dependency (no signature scheme to verify ->
the helper question is downstream of that gap).
New criteria with N/A conditions should add a row to the table
rather than introducing a new inline clause. Comments in the
commit message of any future change should reference the table
row affected.
* feat(webhook-dx-audit): access-level table as source of truth for Not Assessed
Add a deterministic table tagging which criteria require account-level
(L1) or active-usage (L2) access to score, alongside the existing N/A
logic table. Pass-1 audits at L0 (public only) now have a mechanical
rule for which criteria become Not Assessed; the agent does not have
to derive it per-criterion.
Three access levels (rubric.md):
- L0: public docs, SDK source, machine specs, llms.txt
- L1: logged-in session; can read dashboard, settings, account-gated
docs
- L2: L1 plus at least one delivered event observed; delivery logs,
retries, alerting visible in practice
How L1 or L2 was obtained does not matter to the rubric. The auditor
may have signed up themselves, used agent-driven signup (e.g. Stripe
Projects, https://projects.dev), or been given access by the
platform's operator. Future-proof: as agent-signup capabilities
mature, more audits can declare L1/L2 without changing the rubric.
~12 criteria are tagged with required access levels (mostly Cat 1,
2, 7, 9, 10, 11). A few are "L1 or L0 if docs are thorough enough" -
those remain agent judgment within a tighter frame.
Other changes:
- Report template's Access line dropped the "customer-provided
access" wording (that was an Outpost-audit context leak) and now
uses the L0/L1/L2 levels directly. A note clarifies that the
means of obtaining access does not matter, only the level
reached.
- methodology.md "What good looks like" adds Stripe Projects
(projects.dev) as an agent-driven provisioning calibration
anchor for Cat 12 Action-layer scoring.
This makes Not Assessed deterministic at the level the framework
can reasonably enforce. The remaining agent judgment is limited to
the few criteria explicitly tagged "L1 or L0 if ..." in the table.
* fix(webhook-dx-audit): remove Hookdeck Outpost / Svix from calibration anchors
Hookdeck Outpost and Svix are webhook delivery products platforms
use to send events. Naming them as calibration anchors for
sender-DX scoring was the wrong reference frame: integrators
typically experience the *platform* (its docs, signing scheme,
dashboard), not the delivery infrastructure embedded behind it.
And this skill lives in hookdeck/webhook-skills, so naming Hookdeck
specifically as a benchmark would be a conflict of interest.
Use platforms integrators directly experience and benchmark
against: Stripe as the primary anchor; SendGrid (ECDSA signing),
GitHub (event taxonomy), Twilio (per-attempt status callbacks)
for specific features. The Event Destinations initiative
(eventdestinations.org) sets the broader floor.
Hookdeck Outpost stays in program-mapping.md as a gap-closing
recommendation for the platform side. Hookdeck Event Gateway
tools stay in the "Hookdeck tooling" section of methodology.md
as evidence-gathering aids during the audit (Console test URLs
for inspecting payloads, CLI for receiving on localhost) -
those are ingestion tools for the auditor, not benchmarks.
* fix(webhook-dx-audit): rule for L0 scoring from absence; HITL headroom; modern docs platforms
Three small refinements surfaced by the customer audit re-run.
A1: rubric.md access-level table now explicitly authorizes scoring
from L0 absence-of-documentation. If public docs are completely
silent on a capability tagged L1 or L2, score 0 (Not Supported)
from L0 rather than Not Assessed. The access-level requirement is
for VERIFICATION of a documented capability; confirming
non-existence is an L0 finding. Removes the only judgment call I
had to make by interpretation during the re-run.
A2: report-template.md scorecard now surfaces "HITL headroom: NN
points" prominently between the table and the renormalization
caption. Small headroom means HITL won't materially change the
grade; large headroom means HITL is load-bearing. Easier to see
than the gap in the dual-score columns.
A3: Cat 12 push-to-agent criterion now defaults to Not Assessed
(not 0) for docs hosted on modern platforms (Mintlify 2025+,
Docusaurus 3+, GitBook, ReadMe) where Copy-as-Markdown and
Open-in-X are typically JS-rendered. A non-browser fetch may not
see the buttons; the right call is to defer to HITL rather than
score 0 from rendering blindness.
The first customer audit had to interpret all three of these rules;
the framework now encodes them.
* fix(webhook-dx-audit): six refinements from HITL audit feedback
All surfaced by the HITL Pass 2 of a real customer audit. Each
addresses a real ambiguity or editorial leak in the rubric.
scoring.md grade bands:
- Dropped the editorial "Reading" column entirely. Grade letters
alone; the "band is a headline, not the point" note already
carried the framing. Per audit feedback that "painful or
risky"-style language doesn't belong in audit output.
- Added explicit "do not write qualitative judgments of the grade
into the audit report" line.
- Added boundary-zone note for 28-32 (F/D) and 83-87 (B/A) — these
are sanity-check zones where rounding shifts the band.
Cat 4 payload shape guidance:
- Relaxed the 2 anchor. Was "explicit thin-vs-fat rationale, OR
standard envelope like CloudEvents". Now "envelope is consistent
across all event types and documented". CloudEvents alignment
and thin-vs-fat rationale moved to bonus signals worth citing
in evidence but not required for 2. Most platforms with strong
event catalogs don't formally address the meta-framing; the
prior anchor over-penalized them.
Cat 1 free/test access:
- Reworked anchors to handle two underlying questions (does free
tier reach config? are test deliveries free?) as a sliding
scale. 1 covers the partial case (e.g. paid plan required for
config but test deliveries free once configured, the audited
customer's shape) which the prior binary 0/2 anchor missed.
Cat 5 destination auth options:
- Requires auth framing in docs for any score above 0. A platform
shipping an arbitrary header passthrough field without
documenting it as an auth mechanism now correctly scores 0;
previously a strict reading allowed 1 for mere field existence.
Audience scoping (new N/A logic Table 2):
- Two audiences: developer-platform (default, where integrators
are software engineers) and no-code-saas (where integrators are
power users in a UI). For no-code-saas, Cat 7 IaC and Cat 11
workflow simulation and local-to-production transition become
N/A. The third option "mixed" defaults to scoring all criteria
unless the platform clearly serves one exclusively.
- Audience declared at methodology step 0 and in report
frontmatter alongside Access level.
- Existing destination-type N/A logic becomes Table 1; audience
becomes Table 2.
Methodology audit voice guidance:
- Explicit "stay factual, no editorial" rule. Per-category prose
describes observation; reactions and synthesis go in the summary
and recommendations. Examples cited: don't use "surprising",
"impressive", "disappointing", "painful" in per-criterion or
per-category text.
These changes are derived from real auditor experience; the
customer audit itself stays at /tmp (not committed).
* fix(webhook-dx-audit): Cat 1 reframe - discoverability only, no business model
The previous Cat 1 "Free/test access" criterion conflated two
distinct concerns: (a) business model (is the platform/feature
free to access), and (b) DX (can you test webhooks without
producing real production activity). Per repeated audit feedback,
business model is not a DX question and shouldn't penalize
platforms that offer webhooks behind a paid plan. The testability
question is already covered by Cat 2 Test event / trigger and Cat
11 Test / sandbox parity.
Cat 1 changes:
- Removed "Free/test access" criterion entirely.
- Reframed "Signup friction to webhook config" as "In-product
discoverability of webhook configuration". Explicitly handles
plan-gating: plan-gated features are fine as long as the
configuration surface is visible in product navigation.
- Findability of webhook docs criterion now distinguishes deep-nav
(1) from top-level (2) explicitly.
Cat 1 now has 2 criteria, both focused on discoverability:
1. Can a developer find the webhook docs from the top-level docs
or product nav? (the "docs side" of discovery)
2. From a signed-in account on any tier, can a user discover that
the platform offers webhooks and find where they would be
configured? (the "in-product side" of discovery)
Also added a note clarifying what is NOT scored in Cat 1: pre-
purchase evaluation (business model) and production-data isolation
(covered in Cat 2 and Cat 11).
Updated the access-level table to reflect the criterion name
change and removal. Total rubric criterion count drops by one.
Derived from real auditor experience on a paid-plan-gated platform
where the prior rubric incorrectly penalized the business model.
* feat(webhook-dx-audit): add idempotency criteria (Cat 3 + Cat 4)
Cat 3 Documentation quality:
- Added "Idempotency guidance" criterion. Scores whether the docs
(a) identify the unique delivery ID developers should dedupe on
(a top-level event ID in the payload, a webhook-id-style header,
or equivalent), and (b) explain the high-level dedup pattern
(check ID -> process -> store ID -> return success for
duplicates). 0/1/2.
- Removed "idempotency" from the Best-practices coverage anchor
list since it now has its own criterion. Best-practices now
covers: out-of-order delivery, consumer-side retries, timeouts.
Cat 4 Event catalog & schema:
- Added "Per-event unique ID" criterion. Scores whether the
platform delivers a documented per-event unique delivery ID —
in the payload, in headers (e.g. webhook-id, X-GitHub-Delivery,
x-outpost-event-id), or equivalent. Distinct from any domain
ID inside the payload (e.g. post.id is not a delivery ID).
0: none. 1: ID delivered but docs don't identify it as the dedup
key. 2: clearly documented as the dedup key.
The two criteria are explicitly linked: Cat 4 scores whether the
ID exists in the schema; Cat 3 scores whether the docs teach how
to use it. A platform can ship the ID (Cat 4 = 1) without
documenting it (Cat 3 = 0) — exactly the pattern surfaced by the
customer audit, where Outpost ships x-outpost-event-id on every
delivery but the customer didn't surface this to their integrators.
Net effect: rubric grows by two criteria. Platforms that document
idempotency at a high level (signal mention) but don't identify
the dedup ID will now score 1 instead of 2 on the new Cat 3
criterion, surfacing a specific actionable finding.
* docs: expand HITL acronym on first use across audit skill
HITL is used 16+ times across SKILL.md, rubric.md, methodology.md,
scoring.md, and report-template.md without ever being expanded or
defined. Readers outside AI/ML circles can struggle to parse it.
Expand to "human-in-the-loop (HITL)" on the first occurrence in each
file so the abbreviation has a definition before subsequent uses.
Subsequent uses stay as HITL once defined.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs: rename Cat 3 "Documentation quality" to "Implementation guidance"
"Documentation quality" reads as a sweeping judgment on the
platform's docs, but the 5 criteria the category scores
(verification walkthrough, processing & handler guidance,
idempotency guidance, best-practices coverage, accuracy & freshness)
all measure implementation-guidance content for integrators
consuming webhooks. The event catalog and API reference are scored
separately under Cat 4 "Event catalog & schema".
A platform with a comprehensive event catalog but no handler
patterns or signing walkthroughs scores 0% on Cat 3, which reads
confusingly because their webhook docs do exist. The new name
makes it clear that Cat 3 scores integration-implementation
content specifically.
Sweep applied via replace_all to rubric.md (category list and
section heading), scoring.md (weight table), and report-template.md
(scorecard row). 4 files, 4 lines net.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs: scope Cat 3 intro to webhooks explicitly
After renaming Cat 3 from "Documentation quality" to "Implementation
guidance", the previous intro ("The webhook section as a developer
reads it, not the marketing page") no longer fit. The contrast with
marketing was meaningful when the name was generic; under the new
name, implementation guidance is obviously not marketing.
The new intro is explicit about scope. Cat 3's 5 criteria are
webhook-specific in practice (HMAC verification, 2xx HTTP handler
patterns, dedup ID delivered with HTTP webhooks). Non-HTTP
destinations (SQS, Pub/Sub, RabbitMQ, etc.) rely on destination-
native SDKs; their integration-guidance equivalents are scored under
Cat 5 (destination-native auth) and Cat 6 (delivery semantics).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs: clean up Cat 2, 5, 7, 11 intro lines
Cat 2 Onboarding & first event: drop "verified" from "I received a
verified event" since verification (HMAC) is a webhook-only concept.
For non-HTTP destinations, the event is just received, not verified
in the same sense.
Cat 5 Security & authentication: replace the editorial intro
("The capability most often weak and most consequential") with a
scope description that mirrors the rest of the rubric: HTTP webhooks
(signing, replay protection, secret rotation) and non-HTTP
destinations (destination-native auth). Weight note kept.
Cat 7 Setup surfaces: "webhooks" -> "webhooks and event destinations"
to match the audit's full scope (the category's criteria already
cover both).
Cat 11 Local dev: drop the vague "The program calls this out
explicitly" trailing sentence (unclear what "the program" referenced).
Replace with an explicit scope note: criteria focus on HTTP webhooks
(localhost tunnels and replay); non-HTTP destinations rely on
cloud-provider emulators (LocalStack, GCP Pub/Sub emulator, Azure
Service Bus emulator) as equivalents.
The other 8 categories' intros were already clean or were updated
previously (Cat 3 just landed in 933b724 and 37761ff).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs: broaden methodology scope and fix stale Cat 5 example in scoring
Methodology step 3: "Read the webhook docs properly" was webhook-only
in framing but the step's scope covers Categories 3, 4, 5, 6.
Categories 4 (event catalog), 5 (security including non-HTTP
destination auth), and 6 (delivery semantics across all destination
types) cover event destinations beyond webhooks. Broaden the step
title and add destination-type-breadth and per-destination-native-
auth as evidence to capture.
Methodology step 5: "API endpoints for webhook CRUD" and "Terraform
provider and whether it covers webhooks" narrowed Category 7 to
webhooks only. Cat 7's intro now reads "webhooks and event
destinations"; the step now matches: webhook and destination CRUD,
and Terraform coverage of webhooks and destinations.
Scoring Example 1: "Security has 5 criteria" was stale; Cat 5 has
6 criteria (the destination-native-auth criterion was added but
Example 1 was never updated). Examples 2-4 already use 6 criteria.
Example 1 now matches: 6 criteria, score 2/1/1/0/2/2, sum 8, max
12, both roll-ups 67%.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs: report-template summary scope and program-mapping link format
Line 10 summary instruction said "the platform's webhook DX" but the
audit's scope is webhooks AND event destinations (SQS, Pub/Sub,
RabbitMQ, EventBridge, Kafka, Azure Event Grid). A literal reader
might omit non-HTTP destination coverage. Broaden to "webhook and
event-destination DX".
Line 60 referenced "(see program-mapping)" without a file extension
or backticks, reading as a placeholder. Line 43 already references
`rubric.md` with backticks; match that pattern: "(see
`program-mapping.md`)".
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs: catch remaining Cat 3 references after rename
Two spots survived the original "Documentation quality" sweep because
they used lowercase or paraphrased forms.
SKILL.md line 45: agent-responsibilities list said "documentation
quality" (lowercase) which the title-case sweep missed. Rename to
"implementation guidance" to match the new category name.
program-mapping.md line 16: "category 3/6 documentation gaps" was
ambiguous after the rename. Replace with "category 3 implementation-
guidance and category 6 delivery-semantics gaps" so both category
references are explicit.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs: scope Summary list to webhook-surface features only
The Summary instruction told the writer to summarize the platform's
webhook DX but did not say what counts as a webhook-surface feature.
Audit agents reading the instruction listed positive platform signals
they noticed (OpenAPI specs, MCP servers, CLIs) without distinguishing
which ones actually apply to the webhook and event-destination
surface.
This produced misleading Summaries where, for example, an OpenAPI 3.1
spec without a `webhooks` block was listed as evidence of a working
webhook surface even though the spec does not carry webhook payload
contracts (which scores 1 under Cat 4 for that exact reason). The
customer reads the listed item as a strength, then later finds the
caveat that excludes it.
The instruction now scopes the list: include only items that
contribute to the webhook and event-destination surface. An OpenAPI
spec without a `webhooks` block, an MCP server without webhook tools,
or a CLI that does not manage webhook configuration are platform
features that do not apply in the Summary; they belong in their
respective category findings, with the scores that reflect their
limitations.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs: restructure Cat 12 Action layer (combine CLI+MCP, add API access)
Cat 12's Action-layer scoring had two issues:
1. CLI and MCP were scored as separate criteria, requiring each
surface to exist for full credit. In practice an agent only needs
one agent-shaped interface beyond the raw API; either suffices.
2. The MCP criterion accepted "an MCP server exists" without
requiring webhook scope, so a platform-wide MCP that excludes
webhook management could score 2 even though Cat 12 measures
webhook agent-readiness. The Ordinal audit hit this tension
(hosted MCP for the core API but no webhook tools, scored 2).
3. The foundational layer (whether the webhook configuration API
is publicly callable by an agent) was implicit, scattered across
Cat 7 API configuration and Cat 4 machine-readable spec. The
agent-readiness view of the API was not captured as its own
signal in Cat 12.
The Action layer now has two criteria:
- API access for agents: foundational. Documented public HTTP API
for webhook configuration. Overlaps with Cat 7 / Cat 4 but
captures the agent's-eye view distinctly. 0 if dashboard-only or
undocumented, 1 if SDK-only, 2 if documented HTTP API.
- CLI or MCP for the webhook surface: higher-leverage. CLI or MCP
(either suffices) covering webhook management with structured
output / agent-friendly tools. 0 if neither covers webhooks
(explicitly including platform-wide MCPs without webhook tools),
1 if partial coverage, 2 if full.
Methodology step 9 Action sentence updated to walk the new
criteria.
Cat 12 still has 6 criteria total; weight unchanged. Existing
audits that scored MCP at 2 because a platform-wide MCP exists
should re-evaluate under the new combined criterion.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs: instruct HITL to capture and share a real delivery payload
Audits that only have docs evidence produce conditional recommendations
("in default mode the header is X; in Standard Webhooks mode it's Y").
A single actual delivery payload (request headers + body) lets the
auditor recommend directly: name the specific signature header, dedup
ID, timestamp format, and custom headers that are actually in use.
The Ordinal audit hit this exact case: the audit framed signing
conditionally because HITL had not shared an example delivery. Once
Phil shared a screenshot of a real delivery (Standard Webhooks mode
active; webhook-signature, webhook-timestamp, webhook-id headers
present; x-api-key set via the custom-headers feature), the
recommendation became concrete: "document the webhook-signature
you're already sending" rather than "add a signature scheme".
Two updates to the audit skill:
Roles section: add a "Critical HITL capture: an example delivery
payload" paragraph explaining what to capture and why. Whenever the
human fires a test event or observes a real delivery, they capture
and paste back the full delivery payload (all request headers and
the body) so the auditor can score signing, idempotency, event
schema, and destination-auth criteria against the actual delivery
shape rather than docs alone.
How an audit runs step 3 (the HITL checklist examples): add a third
example to the checklist, phrased as "capture and paste back the
full request payload of one real delivery, including all headers
and the body, so I can name the actual signature header, dedup ID,
and any custom headers in the recommendations".
Future audits should now produce concrete signature/dedupe/auth
recommendations whenever HITL is available, since the checklist
specifically requests the payload capture.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs: require verified audience designation with cited signals
The audience declaration drives the audit's N/A logic (Table 2 in
the rubric) but the methodology had it as a brief read-and-judge
step with `developer-platform` as a silent default. In practice
agents either took the default without verification or relied on
HITL Pass 2 to set the designation, leading to misframed audits
when the default didn't match reality.
The Ordinal audit hit this: HITL Pass 2 declared no-code-saas
without site verification; the no-code designation triggered the
Cat 11 audience-N/A logic; later correction to mixed required
re-scoring two criteria. A site-verified audience designation at
audit start would have produced the correct framing from Pass 1.
Three updates:
Methodology step 0: explicit checklist of signals to verify the
designation against (hero copy, nav structure, testimonials,
pricing tiers, API prominence, onboarding CTA framing). Requires
citing at least three signals with quoted marketing copy. `mixed`
listed as a first-class option, with guidance to prefer it when
the platform clearly serves more than one audience. The
`developer-platform` default is allowed only as a Pass-1 fallback
when the homepage cannot be reached; Pass 2 must verify.
SKILL.md "Audience matters" paragraph: `mixed` named as one of
three options (not just a fallback). Adds the verification
requirement and points at the methodology checklist. Notes that
mixed audiences score by judgment per criterion.
report-template Audience header: now requires inline citations of
the signals that informed the designation (e.g. "mixed (primary
marketing teams per hero copy 'X'; secondary agencies via 'Y' nav;
tertiary developers via mid-page API mention)"). The bare
designation alone is no longer sufficient.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs: Cat 3 ingest-at-scale guidance becomes first-class
Cat 3 "Processing & handler guidance" was scored against a generic
anchor ("covers the handler lifecycle"). Platforms that mention
"respond fast" and "process async" in passing could score 2 without
teaching the actual ingest-verify-queue pattern, naming their
response timeout, or pointing integrators at concrete architectures.
This is the criterion most directly tied to Hookdeck Event Gateway's
value prop (and to cloud-native alternatives like AWS EventBridge +
API Gateway, GCP Pub/Sub + serverless function), but the rubric
didn't surface that connection.
Three updates:
rubric.md Cat 3 Processing & handler guidance: criterion text now
spells out the ingest-verify-queue pattern as the production-traffic
contract integrators need: acknowledge quickly with 2xx, verify the
signature, queue work to a background processor so burst traffic
and slow downstream work do not exceed the timeout. 2-anchor now
requires the platform to (a) name the timeout window, (b) explain
the pattern, and (c) point at concrete reference architectures
(Hookdeck Event Gateway, cloud-native EventBridge+API Gateway or
Pub/Sub+serverless function, or queue+worker on the integrator's
own infrastructure). 1-anchor covers partial coverage.
methodology.md step 3 (Read the webhook docs): explicit prompt to
look for the response timeout window, the ingest-verify-queue
pattern, and architecture references. Tactics search-term list adds
"timeout", "respond", "async", "queue", "EventBridge", "Pub/Sub",
"Event Gateway", "ingest".
program-mapping.md: new row mapping the ingest-at-scale gap to
Hookdeck Event Gateway as the integrator's ingest layer (or cloud-
native alternatives EventBridge+API Gateway, Pub/Sub+serverless
function). Distinguished from the existing endpoint-health row:
that one is about reliability for an existing handler, this one is
about teaching integrators the pattern itself.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs: phrasing - 'ingest-at-scale' -> 'ingest reliably'
The pattern matters at any volume, not just at scale. A 5-second
timeout kills a delivery whether the integrator is handling 1 req/sec
or 1000. 'Reliably ingest' captures the goal (don't time out, don't
lose deliveries) better than 'at scale', which implies high volume
specifically.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs: add PLAN-v2.md for review
Captures the v2 pass: migrate audit format to structured YAML
(primary driver: cloud agent + public website for URL-submitted
audits), consolidate v1's rubric and methodology learnings, and
preserve Ordinal's HITL Pass 2 evidence so it does not need to be
re-collected.
Seven phases sketched: schema design, consolidate v1 learnings,
migrate audit template to YAML, update SKILL.md and methodology,
preserve and port Ordinal HITL evidence, decide downstream
backwards-compat path, re-run Ordinal under v2, cascade to
downstream skill.
Includes a complete inventory of HITL-derived facts to carry
forward (active usage observations, signing and delivery shape
from the captured payload, audience verification, scoring
decisions). Cross-check this list at every phase boundary.
Plan is for review before execution; commit per phase once
execution starts.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs: PLAN-v2 - all-in on YAML upstream, lockstep downstream cascade
Per direction: upstream skill emits YAML only. No Markdown audit
output, no renderer, no Markdown template, no transitional
backwards-compat phase. The downstream outpost-customer-audit-report
skill is updated in lockstep to consume YAML.
Changes from the previous PLAN-v2 draft:
- Target layout drops renderers/ and assets/report-template.md
- Phase 2 simplifies: produce assets/report-template.yaml and delete
the Markdown template; no renderer
- Phase 3 SKILL.md update: explicit YAML-only output
- Phase 5 (was "decide on backwards-compatibility") removed; no
decision to make - downstream cascades in lockstep
- Phase 5 (new, was Phase 6) re-runs Ordinal; produces audit.yaml
only; v1 audit.md gets archived to customers/ordinal/archive/
- Phase 6 (new, was Phase 7) cascades to downstream skill in
hookdeck-skills-internal: input becomes YAML, customer report
stays Markdown (still the customer-facing deliverable)
Customer report format kept as Markdown for now since it is the
customer-facing artifact. Open question for review: if the cloud
agent's website ends up rendering customer reports as well, that
decision can flip to YAML in a follow-up.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs: PLAN-v2 - resolve customer report format decision
Customer report stays Markdown. The cloud agent has no current plan
to render customer reports; the customer-facing artifact is sent or
shared as a file. Decision settled, not an open question.
Open Question 3 ("Customer report format") removed from the open
list and added to a new "Resolved decisions" section at the top of
the resolved choices that v2 execution should not relitigate
(upstream YAML-only, customer Markdown, downstream lockstep).
Remaining open questions renumbered.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs: PLAN-v2 - add commit refs, schema sketch, open-question recommendations
Three additions to make PLAN-v2 self-sufficient for a fresh agent
picking up the v2 work cold:
Phase 1 consolidation list: each item now has the v1 commit hash
inline so the rationale is one git show away. Editorial qualifier
rules also annotated as downstream-only with a pointer at the
internal repo's methodology.
Schema sketch (illustrative): inline YAML showing the rough shape of
audit.yaml and hitl-evidence.yaml. Field names, nesting, status
enums, and scoring decision records all present. Marked as a
starting point that Phase 0 refines against the schema linter; not
authoritative.
Open question recommendations: each open question now has a
"Recommendation:" line so a fresh agent has a default to push back
against rather than picking from scratch (schema tooling, YAML lib
and lint config, cloud-agent field reservation, archive location;
re-audit timing already had one). Open questions remain genuine
questions; the recommendations are starting points.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(webhook-dx-audit): schema and lint tooling for v2 audit YAML
PLAN-v2 Phase 0. Adds the JSON Schema (Draft 2020-12, authored in YAML)
that defines the v2 audit format and a Node-based linter that validates
audit or hitl-evidence YAML against it.
- schema/audit.schema.yaml: full audit shape with locked CategoryId and
CriterionId enums, status taxonomy, dual-score support, embedded HITL
evidence, and reserved cloud-agent fields.
- schema/hitl-evidence.schema.yaml: companion shape for the standalone
hitl-evidence pre-load file.
- schema/*.example.yaml: illustrative Stripe-shaped examples that
validate against the schemas.
- schema/README.md: file layout, status taxonomy, dual-score handling,
and how to run the linter.
- scripts/lint-audit.mjs + package.json: ajv + js-yaml CLI that
auto-detects which schema to use and reports JSON-pointer paths on
failure.
Examples use Stripe (the methodology calibration anchor) rather than any
customer; the public repo carries no customer-identifying content.
Refs: skills/webhook-dx-audit/PLAN-v2.md Phase 0.
* docs: PLAN-v2 - drop added_by_report from upstream schema spec
The flag belongs in the downstream report skill's own schema; upstream
audit stays free of downstream concepts. Aligns the plan with the v2
schema that landed in 7f23b1c.
* docs(webhook-dx-audit): consolidate v1 rubric and methodology learnings
PLAN-v2 Phase 1. Walked rubric, methodology, scoring, and program-mapping
end to end against the eleven enumerated v1 commits; every rule reads
cleanly in isolation. Three categories of edit landed:
- YAML field-name translation. Prose references to v1 Markdown structure
("the report's Access frontmatter line", "the report's Access limits",
the audit's `Audience:` header line) now point at v2 YAML fields
(`audience.designation`, `audience.signals`, `access_limits`,
`summary`, `recommendations`).
- Editorial rules tightened. The "Stay factual; no editorial" tactic now
carries the two specific sub-rules from the downstream methodology
Section 3: no company-stage commentary and no unanchored qualifiers.
These apply to upstream audit prose too; lifting them keeps audit-side
voice consistent with how the customer-facing report reads them.
- Summary scoping rule added as a methodology tactic. The Summary should
list only platform features that contribute to the webhook and
event-destination surface; OpenAPI specs without `webhooks` blocks,
platform-wide MCPs without webhook tools, and CLIs that do not manage
webhook config belong in their respective category findings, not in
the summary.
The other Phase 1 items (Cat 3 rename and intro, Cat 12 restructure,
Cat 2/5/7/11 intro cleanups, audience verification at Pass 1, HITL
acronym expansion on first use, methodology steps 3 and 5 broadened to
webhook AND event destinations, Cat 5 six-criteria example correction,
program-mapping reliable-ingestion row) were already in place; verified
without further edits. HITL payload capture lives in SKILL.md and is
Phase 3 territory.
Refs: skills/webhook-dx-audit/PLAN-v2.md Phase 1.
* feat(webhook-dx-audit): migrate audit template to YAML
PLAN-v2 Phase 2. Replaces assets/report-template.md (deleted) with
assets/report-template.yaml: a structural skeleton enumerating all 12
categories and all 54 criterion IDs with placeholder values that lint
clean against schema/audit.schema.yaml. Inline comments explain each
field's purpose, valid values, and the rubric/methodology section that
anchors it.
The template carries the v2-specific guidance directly:
- summary scoping rule (only items contributing to the webhook surface).
- editorial rules (no company-stage commentary, no unanchored qualifiers).
- status taxonomy quick reference (scored / not_supported /
not_applicable / not_assessed).
- Cat 5 Table 1 reminder and Cat 7/11 Table 2 reminder for N/A logic.
- Cat 12 reminder not to re-score Cat 4 / Cat 8 surfaces.
The lint script now covers the template alongside both example files so
schema drift catches it.
Refs: skills/webhook-dx-audit/PLAN-v2.md Phase 2.
* docs(webhook-dx-audit): SKILL.md to YAML-only flow
PLAN-v2 Phase 3.
Frontmatter: description now states "produces a structured YAML audit
file" and points at schema/audit.schema.yaml so callers understand the
output shape from the trigger. Version bumped to 0.2.0 to match the
schema, template, and example files.
Roles section: HITL captures fill structured fields - delivery payload
to hitl_evidence.delivery_payload_capture, in-product observations as
findings[].criteria[].evidence strings keyed by criterion id, scoring
decisions as hitl_evidence.scoring_decisions records. Explicitly bars
free-form Pass-2 narratives in the summary; the dual-score data lives
in grade.public_scope / grade.provisional_minimum and the closed Pass-2
criteria live in passes.pass_2.closed_criteria.
How an audit runs: step 0 scaffolds the audit YAML from the template
and leads with the default flow (Pass 1 unattended, Pass 2 HITL
prompted by the agent at step 4). Pre-loaded HITL evidence is called
out as the exception. Step 4 collects HITL evidence and writes a
sibling hitl-evidence.yaml so the next re-audit can pre-load it;
when step 0 pre-loaded a companion file, step 4 updates it in place.
Steps 2-6 reference the YAML fields they populate (audit.findings,
audit.scorecard, audit.grade, audit.hitl_evidence, audit.summary,
audit.recommendations) and include a lint step before handoff.
Output and Reference files: structured YAML output described by field
group; new entries for schema/audit.schema.yaml and schema/README.md;
template reference updated to report-template.yaml.
Path conventions left to the caller: no customers/<name>/... prescribed
in upstream prose; the companion file is described as a sibling of the
audit file.
Acceptance: no references to "fill in the Markdown template" remain;
no "written review (Markdown)" framing; SKILL.md reads coherently
against the YAML-only flow.
Refs: skills/webhook-dx-audit/PLAN-v2.md Phase 3.
* docs(webhook-dx-audit): methodology tightenings from v2 Ordinal run
Two refinements surfaced during the v2 Ordinal dress rehearsal
(downstream commit 7621577 on feat/v2-ordinal-audit):
1. Generalize the JS-rendered-nav carve-out. The rubric currently
notes this only for Cat 12 push-to-agent doc actions, but Cat 1
findability hits the same issue when the docs site renders its top
nav in JavaScript (Mintlify, Docusaurus 3+, GitBook, ReadMe).
Methodology step 1 now instructs the agent to default Not Assessed
from a plain fetch and verify in a browser during HITL rather than
scoring 0 from an empty fetch.
2. Tactic: fetch the OpenAPI spec directly when scoring Cat 4
machine-readable-spec. An LLM-summarized doc read can confuse the
top-level `webhooks` key (per-event payload contracts, the 3.1
feature this criterion scores) with a Tag named "Webhooks" that
groups CRUD endpoints under `/webhooks`. Curl the spec and check
`len(spec.webhooks)` programmatically; the same applies to AsyncAPI
presence and per-event JSON Schema files.
Both edits are methodology-only; rubric anchors, schema, and template
unchanged.
* feat(webhook-dx-audit): require `why` on recommendations + reviewer-artifact rule
Two methodology refinements surfaced during the downstream Outpost-customer
review of the v2 Ordinal report: recommendations consistently lacked an
articulated user-facing benefit, and HITL-captured deliveries muddied
operator-side practice with reviewer-introduced artifacts.
Schema:
- Recommendation gains a required `why` field, separate from `body`. `body`
describes what to change; `why` names the integrator-side benefit and the
user-facing pain the gap creates. The split is load-bearing so downstream
renderers can use each independently and so recommendations read as
arguments rather than orders.
- schema/audit.schema.example.yaml updated to show `why` on both
illustrative Stripe recommendations.
- assets/report-template.yaml carries a `why: TBD` placeholder.
methodology.md:
- New "Writing recommendations" section codifies the benefit-not-rule
framing, anchoring to specific user-facing pain, and the rule against
duplicating benefit framing in `body`. Calls out the Cat 6
destination-type-breadth pitfall: phrase the recommendation around adding
non-HTTP destination types (SQS, Pub/Sub, EventBridge, Kafka, Event Grid)
with the integrator benefit named, not as "rename your HTTP endpoint" -
renaming an endpoint does not change what is delivered. Stripe's
evolution is the cited example: existing webhook product stayed in
place; new destination types were added alongside.
- New "Distinguishing reviewer artifacts from operator-side practice"
section: HITL captures often involve reviewer-configured headers, test
webhooks, and synthetic deliveries. Anything the reviewer set up to
enable the capture is not evidence of operator behavior. Findings and
recommendations citing observed deliveries must anchor on operator-
controlled docs, API surface, or in-product copy; reviewer-set custom
headers (the borderline case that surfaced this) must not be cited as
evidence the operator surfaces a feature in practice. Annotate borderline
HITL records in `audit.hitl_evidence.other_observations` keyed by
criterion id so reviewer artifacts cannot be mistaken for operator state.
SKILL.md step 6 now points at "Writing recommendations" so the `why`
requirement is discoverable from the workflow walk-through.
Verification: `npm run lint:file` clean on schema/audit.schema.example.yaml
and assets/report-template.yaml against the updated schema.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(webhook-dx-audit): target_scores + depends_on + effort on Recommendation; projected grade
Add three optional Recommendation fields and a projected grade roll-up so
downstream renderers can show current-vs-potential impact, sequence
recommendations by hard dependencies, and surface coarse implementation
effort on each one.
Schema (schema/audit.schema.yaml):
- Recommendation gains three optional fields:
- `target_scores`: list of {criterion_id, target_score, note?}
records naming which criteria the recommendation lifts to which
0/1/2 score. The 2-anchor is honest: documenting a single existing
auth option reaches 1 (one documented option), not 2 (multiple).
When multiple recommendations target the same criterion, downstream
renderers take the max.
- `depends_on`: list of recommendation IDs that must land before this
one delivers its full value. Hard dependencies only (e.g. Rec 2's
verification step references Rec 1's signing documentation). Soft
sequencing preferences belong in body or summary.
- `effort`: coarse implementation effort, enum docs|s|m|l. `docs` is
one page or section with little or no engineering work; `s` is a
small product change (a button, a new endpoint, a config knob) on
the order of days; `m` is a new feature surface on the order of
weeks; `l` is an architectural change on the order of months.
- Two new top-level $defs back the fields:
- TargetScore: required {criterion_id, target_score}, optional note
- EffortLevel: enum docs|s|m|l with calibration description
- Grade gains an optional `projected` GradeRollup. Present when at least
one recommendation has `target_scores`. Computed by taking the max
target_score across recommendations per criterion (current score
carries forward for criteria no recommendation touches), rolled up via
the standard category weighting. Lets downstream renderers display the
audit as current vs potential.
- ScorecardEntry gains an optional `projected_pct` per category, present
when `grade.projected` is present. Powers a side-by-side scorecard in
downstream renderers.
Methodology (references/methodology.md):
- "Writing recommendations" gains a "Populating `target_scores`,
`depends_on`, `effort`" sub-section covering how to choose each
recommendation's targets honestly (don't over-promise 2 when the
rubric anchor isn't reachable), how to scope dependencies (hard only),
and how to calibrate effort (against the EffortLevel enum, judged on
what the platform team would do, not on the operator's team size).
- "Computing `grade.projected` and `scorecard[].projected_pct`"
sub-section codifies the projection rule: per criterion, take the max
target_score across all recommendations targeting it; criteria no
recommendation touches carry their current score forward; roll up via
scoring.md; carry N/A criteria the same way.
SKILL.md step 6 updated to instruct populating the three new fields on
every recommendation that closes a scored gap, and to compute the
projected grade when at least one recommendation has target_scores.
assets/report-template.yaml gains commented-out placeholders for the new
fields with calibration notes.
schema/audit.schema.example.yaml updated: Stripe illustrative audit now
shows `grade.projected` at 95% (A), per-category `projected_pct` on the
two categories the recommendations affect (Cat 5 Security 83 -> 100; Cat
7 Setup 75 -> 100), and the two recommendations carry `target_scores`
and `effort`. The Rec 1 (Terraform provider) example also demonstrates
the `effort: m` calibration for product work; Rec 2 (auth docs framing)
demonstrates `effort: docs`.
Backwards compatibility: all four new fields (target_scores, depends_on,
effort, grade.projected, scorecard[].projected_pct) are optional. Audits
that pre-date this change still lint clean. The projected roll-up only
appears in downstream renderers when target_scores are populated.
Verification: lint clean on schema/audit.schema.example.yaml and
assets/report-template.yaml against the updated schema.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs(webhook-dx-audit): clarify effort reflects platform-context cost
Effort ratings on recommendations were calibrated implicitly against a
from-scratch baseline, but the audit reviews a specific operator's
surface and the operator's actual cost depends on what their delivery
backend ships. A recommendation that would be `l` for a platform
building delivery primitives themselves can drop to `m` or `s` when the
underlying capability is shipped by a backend like Outpost, Svix, or
Convoy.
Updated the "Populating `target_scores`, `depends_on`, `effort`" section
in references/methodology.md to make this explicit. The rater consults
`audit.context` when calibrating effort: when the context names a
delivery backend that ships the capability being recommended, rate the
remaining surfacing work (dashboard / API / docs), not the from-scratch
implementation cost. When the audit has no platform context, rate
from-scratch as the safer default and flag the assumption in the audit
`summary` so a downstream skill with platform knowledge can override.
No schema changes. The Stripe example in schema/audit.schema.example
.yaml keeps its `m` rating on the Terraform-provider recommendation
because Stripe builds the provider themselves; no delivery-backend
translation applies there.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs(webhook-dx-audit): API-endpoint disambiguation; soften destination-type rename rule
Two related methodology refinements surfaced during HITL review of the
Ordinal report.
API-endpoint disambiguation (Tactics section):
The word "endpoint" is overloaded in webhook contexts. It can mean the
integrator's HTTP receiver (the URL they expose to receive webhook
deliveries) or the platform's management API endpoint (the route
integrators call to create / list / delete webhook destinations). A
bare `POST /webhooks` could read as either. Reviewers in this project
hit the ambiguity twice on Rec 3 of the Ordinal v2 report: once on
"rename POST /webhooks to POST /event-destinations" (read as renaming
an integrator-receiver URL, which is nonsensical) and again on "your
current POST /webhooks stays in place" (same ambiguity).
New Tactics rule: whenever the audit or recommendations name an HTTP
route, qualify it with the role it plays. "the destination-creation API
endpoint POST /webhooks"; "the webhook-management API at
/api-reference/webhooks/"; "the integrator's webhook-receiving URL".
A reader who cannot tell at a glance which side of the wire an endpoint
sits on will misread the recommendation.
Destination-type-breadth rename rule (Writing recommendations):
The previous version of this rule said "do not phrase the
recommendation as 'rename POST /webhooks to POST /event-destinations';
renaming an HTTP endpoint does not change what is delivered, and the
framing confuses an API design decision with the underlying capability."
That was overcorrecting. Once the destination-creation API endpoint is
extended to create non-HTTP destinations (SQS, Pub/Sub, EventBridge,
Kafka, Event Grid), the endpoint name `POST /webhooks` arguably
misrepresents what it does, and renaming to `POST /event-destinations`
or `POST /destinations` is a valid API-design refinement that signals
the broader scope to integrators reading the docs. The rule now reads:
lead with the capability addition (not the rename) as the primary
recommendation; surface the API rename as an optional secondary
refinement; recommend keeping the original endpoint as an alias for
backwards compatibility.
Together these two rules sharpen the audit and report voice when
recommending category-6 destination-type-breadth changes specifically,
and any other recommendation that names an API endpoint generally.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs(webhook-dx-audit): "ingest-verify-queue" is a practice, not a pattern; type the implementations
"Ingest-verify-queue" is not a canonical industry-named pattern with
formal references. It is best-practice shorthand for the goal of
acknowledging fast, verifying, and handing off to async processing.
Calling it a "pattern" overclaims; the rubric, methodology, and
program-mapping now call it a "practice".
The four items previously grouped as "concrete implementations of the
pattern" are different categories of thing, and the umbrella implied
each was itself a pattern. Each now carries its own type:
- Hookdeck Event Gateway: a managed solution that ships the practice
out of the box
- AWS EventBridge + API Gateway: a cloud-native composition
- GCP Pub/Sub + a serverless function: a cloud-native composition
- Queue + worker on the integrator's own infrastructure: a self-hosted
setup
Only the queue + worker option is genuinely a pattern in the generic
sense; the others are products and cloud compositions.
Rubric Cat 3 criterion updated end to end with the new terminology.
Methodology step 3 and program-mapping Cat 3 row updated to match.
No schema changes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs: update bundle count to 40 skills (knock-webhooks + orb-webhooks merged since count was last set)
The README claimed "38 skills" in two places (bundle install copy) but
the bundle in .claude-plugin/marketplace.json now lists 40. The two
extra skills (knock-webhooks #64, orb-webhooks #63) were merged to main
after the bundle count was last updated, and the webhook-dx-audit PR
branch picked them up via merge from main without re-syncing the count.
Test plan item from PR #67 calls for "the bundle still totals 38
skills" - that's now 40. README updated in both occurrences (lines 111,
120) to match the actual bundle contents.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a complete
orb-webhooksprovider skill for Orb (usage-based billing). HMAC-SHA256 manual verification with the unusual signed-content formatv1:{X-Orb-Timestamp}:{rawBody}(literalv1prefix, ISO-8601 timestamp, colon separators).What's included
skills/orb-webhooks/SKILL.md— entry point with frontmatter and the verification coreskills/orb-webhooks/references/— overview (event taxonomy + summary-webhooks variant), setup (dashboard config + per-endpoint secret), verification (signature algorithm, gotchas, idempotency recommendation)skills/orb-webhooks/examples/— Express, Next.js App Router, FastAPI handlers with testsproviders.yaml,README.md,.claude-plugin/marketplace.json(both as a standalone plugin and added to thewebhook-skillsbundle)Notes
X-Orb-Signature: v1=<hex>carries the HMAC;X-Orb-Timestamp: <ISO-8601>carries the timestamp separately.v1:{X-Orb-Timestamp}:{raw-body}— literalv1, colon, ISO timestamp (as a string, not a Unix epoch), colon, raw body bytes. Pass the raw request body; don'tJSON.parseand re-serialize.X-Orb-Timestampand recommends consumers pick a threshold. The skill recommends a 5-minute window in handlers plus event-id idempotency for at-least-once delivery safety.customer.created,customer.credit_balance_dropped), subscriptions (subscription.created/.started/.ended/.plan_changed/.edited/.usage_exceeded), invoices (invoice.issued/.payment_succeeded/.payment_failed/.edited), data exports (data_exports.transfer_success).orb-billingon both npm and PyPI (same package name on both). Neither SDK exposes a Stripe-styleunwrap()/constructEvent()helper at the time of writing — manual HMAC verification is the canonical path. The SDK is declared inproviders.yaml'ssdksfield so future review runs will catch stale pins.Test plan
cd skills/orb-webhooks/examples/express && npm install && npm testcd skills/orb-webhooks/examples/nextjs && npm install && npm testcd skills/orb-webhooks/examples/fastapi && python3 -m venv venv && source venv/bin/activate && pip install -r requirements.txt && pytest test_webhook.py -v"v1=" + HMAC-SHA256(secret, "v1:" + iso_ts + ":" + body).hexdigest())webhook-skillsmarketplace bundle now lists 38 skill paths (37 → 38)Generation details
./scripts/generate-skills.sh generate orb --config providers.yaml --model claude-opus-4-7npx hookdeck-cli listen 3000 orb --path /webhooks/orbhttps://claude.ai/code/session_01NNTgQRJss1V7gyzzJ9rjnB
Generated by Claude Code