fix: address scrapfly review feedback + integrate into README and providers.yaml

leggetter · leggetter · commit 96f2c4bcb1a8 · 2026-05-11T23:30:17.000+01:00
- Correct X-Scrapfly-Webhook-Env header values (test/live, not production) - Document actual payload envelope (context.webhook, context.job) - Warn that payload echoes the signing secret at context.webhook.secret - Distinguish default vs opt-in crawler events - Add retry schedule and 100-failure auto-disable behavior - Note paid-plan requirement (FREE plan has webhook queue size 0) - Read scrape URL from payload.result.url (not the webhook context overlay) - Add Scrapfly row to README Provider Skills table - Add scrapfly entry to providers.yaml (docs URLs, notes, testScenario) https://claude.ai/code/session_01NNTgQRJss1V7gyzzJ9rjnB
diff --git a/README.md b/README.md
@@ -53,6 +53,7 @@ Skills for receiving and verifying webhooks from specific providers. Each includ
 | Postmark | [`postmark-webhooks`](skills/postmark-webhooks/) | Authenticate Postmark webhooks (Basic Auth/Token), handle email delivery, bounce, open, click, and spam events |
 | Replicate | [`replicate-webhooks`](skills/replicate-webhooks/) | Verify Replicate webhook signatures, handle ML prediction lifecycle events |
 | Resend | [`resend-webhooks`](skills/resend-webhooks/) | Verify Resend webhook signatures, handle email delivery and bounce events |
+| Scrapfly | [`scrapfly-webhooks`](skills/scrapfly-webhooks/) | Verify Scrapfly webhook signatures (HMAC-SHA256, uppercase/lowercase hex), dispatch scrape, extraction, and screenshot jobs |
 | SendGrid | [`sendgrid-webhooks`](skills/sendgrid-webhooks/) | Verify SendGrid webhook signatures (ECDSA), handle email delivery events |
 | Shopify | [`shopify-webhooks`](skills/shopify-webhooks/) | Verify Shopify HMAC signatures, handle order and product webhook events |
 | Slack | [`slack-webhooks`](skills/slack-webhooks/) | Verify Slack Events API signatures (HMAC-SHA256, `X-Slack-Signature`), handle message, app_mention, and reaction events |
diff --git a/providers.yaml b/providers.yaml
@@ -465,6 +465,61 @@ providers:
         - Bounce
         - Delivery
 
+  - name: scrapfly
+    displayName: Scrapfly
+    docs:
+      scrape_webhook: https://scrapfly.io/docs/scrape-api/webhook
+      extraction_webhook: https://scrapfly.io/docs/extraction-api/webhook
+      screenshot_webhook: https://scrapfly.io/docs/screenshot-api/webhook
+      scrape_getting_started: https://scrapfly.io/docs/scrape-api/getting-started
+      extraction_getting_started: https://scrapfly.io/docs/extraction-api/getting-started
+      screenshot_getting_started: https://scrapfly.io/docs/screenshot-api/getting-started
+    notes: >
+      Web-scraping API platform with three products that share a single async-job +
+      webhook system: Scrape API, Extraction API, Screenshot API. One webhook URL
+      registered in the dashboard (https://scrapfly.io/dashboard/webhook) receives
+      deliveries from all three products. PAID PLAN REQUIRED (first paid tier).
+
+      No API exists for creating/updating/deleting webhooks programmatically. The
+      destination URL CANNOT be passed per-call. Instead, each API call references
+      an already-registered webhook by name via the `webhook_name` query parameter
+      (e.g. `…/scrape?…&webhook_name=samples-capture`).
+
+      Signature verification: HMAC-SHA256 over the RAW request body bytes (do not
+      JSON.parse and re-stringify — that changes the byte sequence). Compare against
+      either `X-Scrapfly-Webhook-Signature` (uppercase hex) or
+      `X-Scrapfly-Webhook-Signature-Lowercase` (lowercase hex) using constant-time
+      equality. The secret is per-webhook, displayed in the dashboard alongside the
+      webhook configuration (NOT the account API key).
+
+      Dispatch by `X-Scrapfly-Webhook-Resource-Type` header (one of `scrape`,
+      `extraction`, `screenshot`). Other headers: `X-Scrapfly-Webhook-Job-Id` (UUID,
+      use as idempotency key for at-least-once delivery), `X-Scrapfly-Webhook-Env`
+      (`test`|`live`), `X-Scrapfly-Webhook-Project`, `X-Scrapfly-Webhook-Name`,
+      `X-Scrapfly-Webhook-Id`, optional `X-Scrapfly-Log-Uuid`/`X-Scrapfly-Log-Url`.
+
+      No timestamp/replay envelope (unlike Stripe). Recommend idempotency by job-id;
+      do NOT invent a `t=…` window.
+
+      Payload = the full response body of the corresponding API plus a `context`
+      overlay: `context.webhook` (`{ name, secret, consecutive_failed_count, … }` —
+      WARN handlers: `secret` field exposes the signing secret in the payload, do
+      not log or echo) and `context.job` (`{ uuid, … }`). Product-specific shapes
+      documented in the getting-started pages above.
+
+      Delivery: retry 30s → 1min → 5min → 30min → 1h → 1d. A webhook is DISABLED
+      after 100 consecutive failures — handlers should return 2xx fast and surface
+      errors out-of-band.
+
+      No official SDK construct for verification (plain HMAC is correct). Do NOT
+      pull in a third-party HMAC library; use the stdlib (`crypto.createHmac` in
+      Node, `hmac` / `hashlib` in Python).
+    testScenario:
+      events:
+        - scrape
+        - extraction
+        - screenshot
+
   - name: sendgrid
     displayName: SendGrid
     docs:
diff --git a/skills/scrapfly-webhooks/SKILL.md b/skills/scrapfly-webhooks/SKILL.md
@@ -95,7 +95,9 @@ app.post('/webhooks/scrapfly',
     // Route by resource type for scrape / extraction / screenshot APIs
     switch (resourceType) {
       case 'scrape':
-        console.log('Scrape result:', payload.result?.status_code, payload.context?.url);
+        // Scrape API places the fetched URL at result.url; the webhook overlay's
+        // context only carries `webhook` and `job` sub-objects.
+        console.log('Scrape result:', payload.result?.status_code, payload.result?.url);
         break;
       case 'extraction':
         console.log('Extraction result:', payload.result?.data);
@@ -177,7 +179,7 @@ Crawler API webhooks carry an `event` string in the body (also exposed as `X-Scr
 | `X-Scrapfly-Webhook-Name` | Name of the configured webhook |
 | `X-Scrapfly-Webhook-Resource-Type` | `scrape`, `extraction`, or `screenshot` |
 | `X-Scrapfly-Webhook-Job-Id` | Unique job identifier (use for reconciliation) |
-| `X-Scrapfly-Webhook-Env` | Environment (e.g. `production`) |
+| `X-Scrapfly-Webhook-Env` | Environment (`test` or `live`) |
 | `X-Scrapfly-Webhook-Project` | Project name |
 | `X-Scrapfly-Crawl-Event-Name` | Crawler API event name (e.g. `crawler_finished`) |
 
diff --git a/skills/scrapfly-webhooks/examples/express/src/index.js b/skills/scrapfly-webhooks/examples/express/src/index.js
@@ -72,8 +72,10 @@ app.post('/webhooks/scrapfly',
     // Route by resource type for the Scrape / Extraction / Screenshot APIs.
     switch (resourceType) {
       case 'scrape':
+        // Scrape API places the fetched URL at result.url (see scrapfly.io/docs/scrape-api/getting-started).
+        // The webhook overlay's payload.context only carries `webhook` and `job` sub-objects.
         console.log('Scrape result:', {
-          url: payload?.context?.url,
+          url: payload?.result?.url,
           status: payload?.result?.status_code,
         });
         // TODO: Persist HTML / extracted fields, enqueue parsing, ...
diff --git a/skills/scrapfly-webhooks/examples/express/test/webhook.test.js b/skills/scrapfly-webhooks/examples/express/test/webhook.test.js
@@ -96,8 +96,15 @@ describe('Scrapfly Webhook Endpoint', () => {
 
     it('returns 200 for a valid scrape webhook', async () => {
       const body = JSON.stringify({
-        context: { url: 'https://web-scraping.dev/products' },
-        result: { status_code: 200, content: '<html></html>' },
+        result: {
+          url: 'https://web-scraping.dev/products',
+          status_code: 200,
+          content: '<html></html>',
+        },
+        context: {
+          webhook: { name: 'my-webhook', secret: 'test_scrapfly_signing_secret', consecutive_failed_count: 0 },
+          job: { uuid: '550e8400-e29b-41d4-a716-446655440000' },
+        },
       });
       const sig = generateScrapflySignature(Buffer.from(body), secret);
 
diff --git a/skills/scrapfly-webhooks/examples/fastapi/main.py b/skills/scrapfly-webhooks/examples/fastapi/main.py
@@ -85,9 +85,10 @@ async def scrapfly_webhook(
     resource_type = x_scrapfly_webhook_resource_type
 
     if resource_type == "scrape":
+        # Scrape API places the fetched URL at result.url. The webhook overlay's
+        # payload["context"] only carries `webhook` and `job` sub-objects.
         result = payload.get("result", {})
-        context = payload.get("context", {})
-        print(f"Scrape result: url={context.get('url')} status={result.get('status_code')}")
+        print(f"Scrape result: url={result.get('url')} status={result.get('status_code')}")
         # TODO: Persist HTML / extracted fields, enqueue parsing
     elif resource_type == "extraction":
         print(f"Extraction result: {payload.get('result', {}).get('data')}")
diff --git a/skills/scrapfly-webhooks/examples/fastapi/test_webhook.py b/skills/scrapfly-webhooks/examples/fastapi/test_webhook.py
@@ -74,8 +74,19 @@ def test_tampered_body(self, client, secret):
     def test_valid_scrape_webhook(self, client, secret):
         body = json.dumps(
             {
-                "context": {"url": "https://web-scraping.dev/products"},
-                "result": {"status_code": 200, "content": "<html></html>"},
+                "result": {
+                    "url": "https://web-scraping.dev/products",
+                    "status_code": 200,
+                    "content": "<html></html>",
+                },
+                "context": {
+                    "webhook": {
+                        "name": "my-webhook",
+                        "secret": secret,
+                        "consecutive_failed_count": 0,
+                    },
+                    "job": {"uuid": "550e8400-e29b-41d4-a716-446655440000"},
+                },
             }
         ).encode("utf-8")
         sig = generate_scrapfly_signature(body, secret)
diff --git a/skills/scrapfly-webhooks/examples/nextjs/app/webhooks/scrapfly/route.ts b/skills/scrapfly-webhooks/examples/nextjs/app/webhooks/scrapfly/route.ts
@@ -75,8 +75,10 @@ export async function POST(request: NextRequest) {
 
   switch (resourceType) {
     case 'scrape':
+      // Scrape API places the fetched URL at result.url. The webhook overlay's
+      // payload.context only carries `webhook` and `job` sub-objects.
       console.log('Scrape result:', {
-        url: payload?.context?.url,
+        url: payload?.result?.url,
         status: payload?.result?.status_code,
       });
       break;
diff --git a/skills/scrapfly-webhooks/examples/nextjs/test/webhook.test.ts b/skills/scrapfly-webhooks/examples/nextjs/test/webhook.test.ts
@@ -53,8 +53,11 @@ describe('Scrapfly Webhook Endpoint (Next.js)', () => {
 
   it('returns 200 for a valid scrape webhook', async () => {
     const body = JSON.stringify({
-      context: { url: 'https://web-scraping.dev/products' },
-      result: { status_code: 200 },
+      result: { url: 'https://web-scraping.dev/products', status_code: 200 },
+      context: {
+        webhook: { name: 'my-webhook', secret, consecutive_failed_count: 0 },
+        job: { uuid: '550e8400-e29b-41d4-a716-446655440000' },
+      },
     });
     const sig = generateScrapflySignature(body, secret);
 
diff --git a/skills/scrapfly-webhooks/references/overview.md b/skills/scrapfly-webhooks/references/overview.md
@@ -16,22 +16,50 @@ The `X-Scrapfly-Webhook-Resource-Type` header tells you which product the delive
 | `extraction` | An async Extraction API job finishes | Persist structured data, enqueue follow-up enrichment |
 | `screenshot` | An async Screenshot API job finishes | Store image URL, notify users, generate thumbnails |
 
-The body of a `scrape` / `extraction` / `screenshot` webhook is the same JSON envelope you'd get from the synchronous API call, with extra webhook context (`webhook_name`, `webhook_uuid`, `job_uuid`).
+The body of a `scrape` / `extraction` / `screenshot` webhook is the full JSON response of the corresponding synchronous API call with a `context` overlay added:
+
+```json
+{
+  "...api_response": "...",
+  "context": {
+    "...api_context": "...",
+    "webhook": {
+      "name": "my-webhook",
+      "secret": "<signing secret — DO NOT log>",
+      "consecutive_failed_count": 0
+    },
+    "job": {
+      "uuid": "550e8400-e29b-41d4-a716-446655440000"
+    }
+  }
+}
+```
+
+The webhook overlay always carries:
+
+- `context.webhook.name` — webhook name configured in the dashboard
+- `context.webhook.secret` — the signing secret (**never log or echo this field**)
+- `context.webhook.consecutive_failed_count` — current consecutive-failure count
+- `context.job.uuid` — job UUID (same value as `X-Scrapfly-Webhook-Job-Id`)
+
+Product-specific fields (such as `result.content`, `result.data`, `result.screenshot_url`, or the API's own `context.url`) come from the underlying API response — see the [Scrape](https://scrapfly.io/docs/scrape-api/getting-started), [Extraction](https://scrapfly.io/docs/extraction-api/getting-started), and [Screenshot](https://scrapfly.io/docs/screenshot-api/getting-started) getting-started pages for shapes.
 
 ## Crawler Events
 
 The Crawler API is a separate product that delivers **lifecycle events** rather than a single result. Each event has an `event` field in the body (and an `X-Scrapfly-Crawl-Event-Name` header):
 
-| Event | Triggered When |
-|-------|----------------|
-| `crawler_started` | Crawl job started |
-| `crawler_url_visited` | A URL was fetched successfully |
-| `crawler_url_discovered` | A new URL was added to the queue |
-| `crawler_url_skipped` | A URL was skipped (deduped, filtered) |
-| `crawler_url_failed` | A URL fetch failed |
-| `crawler_stopped` | The crawl stopped (budget/limit reached) |
-| `crawler_cancelled` | The crawl was cancelled |
-| `crawler_finished` | The crawl ran to completion |
+| Event | Default? | Triggered When |
+|-------|----------|----------------|
+| `crawler_started` | Yes | Crawl job started |
+| `crawler_stopped` | Yes | The crawl stopped (budget/limit reached) |
+| `crawler_cancelled` | Yes | The crawl was cancelled |
+| `crawler_finished` | Yes | The crawl ran to completion |
+| `crawler_url_visited` | Opt-in | A URL was fetched successfully |
+| `crawler_url_discovered` | Opt-in | A new URL was added to the queue |
+| `crawler_url_skipped` | Opt-in | A URL was skipped (deduped, filtered) |
+| `crawler_url_failed` | Opt-in | A URL fetch failed |
+
+By default Scrapfly only delivers the four lifecycle events: `crawler_started`, `crawler_stopped`, `crawler_cancelled`, `crawler_finished`. The per-URL events (`crawler_url_visited`, `crawler_url_discovered`, `crawler_url_skipped`, `crawler_url_failed`) are high-volume and must be enabled explicitly via the `webhook_events` parameter when submitting the crawl job.
 
 Example Crawler payload:
 
@@ -62,11 +90,32 @@ Example Crawler payload:
 | `X-Scrapfly-Webhook-Name` | Name of the webhook configured in the dashboard |
 | `X-Scrapfly-Webhook-Resource-Type` | `scrape`, `extraction`, or `screenshot` |
 | `X-Scrapfly-Webhook-Job-Id` | Job UUID returned at enqueue time — reconciliation key |
-| `X-Scrapfly-Webhook-Env` | Environment label (e.g. `production`) |
+| `X-Scrapfly-Webhook-Env` | Environment label (`test` or `live`) |
 | `X-Scrapfly-Webhook-Project` | Project name |
 | `X-Scrapfly-Crawl-Event-Name` | Crawler API event name (e.g. `crawler_finished`) |
 | `X-Scrapfly-Log-Uuid` / `X-Scrapfly-Log-Url` | Pointers to the Scrapfly log entry for the delivery |
 
+## Delivery & Retries
+
+Scrapfly delivery is **at-least-once**. Use `X-Scrapfly-Webhook-Job-Id` as your idempotency key — duplicates carry the same job UUID.
+
+Retry schedule on non-2xx responses (or timeout):
+
+| Attempt | Delay after previous |
+|---------|----------------------|
+| 1 | initial delivery |
+| 2 | 30 s |
+| 3 | 1 min |
+| 4 | 5 min |
+| 5 | 30 min |
+| 6 | 1 h |
+| 7 | 1 d |
+
+After **100 consecutive failures** Scrapfly automatically **disables** the webhook — no further deliveries are attempted until you re-enable it in the dashboard. Because of this, handlers should:
+
+- Return 2xx as soon as the signature is verified and the job is enqueued.
+- Surface processing errors out-of-band (logs, alerts, dead-letter queue) rather than 5xx-ing back to Scrapfly.
+
 ## Full Event Reference
 
 - [Scrape API webhook](https://scrapfly.io/docs/scrape-api/webhook)
diff --git a/skills/scrapfly-webhooks/references/setup.md b/skills/scrapfly-webhooks/references/setup.md
@@ -3,6 +3,7 @@
 ## Prerequisites
 
 - A Scrapfly account ([sign up](https://scrapfly.io))
+- A **paid Scrapfly plan**. Webhooks are not available on the FREE plan — its webhook queue size is 0, so no deliveries are ever dispatched even after configuration. Any paid tier enables delivery.
 - A publicly reachable webhook endpoint URL (use [Hookdeck CLI](https://hookdeck.com/docs/cli) for local development)
 
 ## Create a Webhook in the Scrapfly Dashboard
diff --git a/skills/scrapfly-webhooks/references/verification.md b/skills/scrapfly-webhooks/references/verification.md
@@ -72,6 +72,19 @@ Notes:
 - Use `await request.body()` in FastAPI to get `bytes`. Do not call `await request.json()` before verifying.
 - `hmac.compare_digest` is the documented constant-time comparator.
 
+## Security: Do Not Log the Raw Payload
+
+Scrapfly echoes the webhook signing secret in the body at `context.webhook.secret`. This is unusual compared to other providers and easy to miss.
+
+- **Never** log the raw payload, dump it to stdout in production, or forward it to third-party tools (Sentry, Datadog, Slack, etc.) without redacting `context.webhook.secret` first.
+- If you persist webhooks for replay/debugging, strip or redact `context.webhook.secret` before storage.
+- Anyone with the secret can forge valid signatures for your endpoint.
+
+```javascript
+// Redact before logging / forwarding
+const safe = { ...payload, context: { ...payload.context, webhook: { ...payload.context?.webhook, secret: '[REDACTED]' } } };
+```
+
 ## Common Gotchas
 
 - **Parsed JSON breaks signatures.** Verify against the exact bytes Scrapfly sent. In Express, mount `express.raw({ type: '*/*' })` on the webhook route (not `express.json`). In Next.js App Router, read with `await request.text()`. In FastAPI, use `await request.body()`.