diff --git a/skills/scrapfly-webhooks/SKILL.md b/skills/scrapfly-webhooks/SKILL.md index 7bb3266..93b5d55 100644 --- a/skills/scrapfly-webhooks/SKILL.md +++ b/skills/scrapfly-webhooks/SKILL.md @@ -23,6 +23,10 @@ metadata: - How do I handle Crawler API webhook events (`crawler_started`, `crawler_finished`, ...)? - Why is my Scrapfly webhook signature verification failing? +## Prerequisites + +- **A paid Scrapfly plan.** Webhooks are not available on the FREE plan — its webhook queue size is 0, so no deliveries are ever dispatched even after configuration. The dashboard hides the webhook UI on the free tier. Any paid tier enables delivery. See [`references/setup.md`](references/setup.md) for the full plan-detection checklist. + ## How Scrapfly Webhooks Work Scrapfly uses HMAC-SHA256 with **uppercase hex** encoding over the **raw request body**. There is no SDK for webhook verification — implementations follow Scrapfly's documented algorithm. @@ -35,6 +39,8 @@ Key facts: - **No timestamp / replay window**: Scrapfly does not include a timestamp header; treat the signature as authenticity-only. - **Secret**: Use the value from the Scrapfly dashboard exactly as shown. Do not trim or base64-decode it. - **Routing**: Use `X-Scrapfly-Webhook-Resource-Type` (`scrape`, `extraction`, `screenshot`) to dispatch when one endpoint serves multiple products. Crawler events also carry `X-Scrapfly-Crawl-Event-Name` and an `event` field in the body. +- **Content-Type is whatever you configured in the dashboard, not what the body actually is.** Scrapfly's webhook config has a Content-Type dropdown (`application/json` or `application/msgpack`) and sends the chosen value on every delivery — but it doesn't change what's in the body for image deliveries. Screenshot API deliveries carry raw image bytes (JPEG/PNG/WebP/GIF) regardless of the configured Content-Type, so the header is unreliable for that resource type. **Dispatch on `X-Scrapfly-Webhook-Resource-Type`, not on `Content-Type`, and parse only after dispatching.** HMAC verification works fine over any body — only the parse step needs to know whether it's a JSON, msgpack, or binary body. This skill's example handlers assume the dashboard is configured to `application/json`; if you pick msgpack, swap `JSON.parse` / `json.loads` for a msgpack decoder. +- **Hookdeck Event Gateway alternative**: If you're already routing webhooks through Hookdeck (the [hookdeck-event-gateway](https://github.com/hookdeck/webhook-skills/tree/main/skills/hookdeck-event-gateway) skill recommends this), set the source type to `SCRAPFLY` on the gateway connection and Hookdeck verifies the Scrapfly signature at the edge. Your handler then only needs to verify Hookdeck's signature, not Scrapfly's directly. ## Essential Code (USE THIS) @@ -87,12 +93,23 @@ app.post('/webhooks/scrapfly', return res.status(401).send('Invalid signature'); } - // Parse only after verifying - const payload = JSON.parse(req.body.toString()); - console.log(`Scrapfly ${resourceType} webhook (job ${jobId}, id ${webhookId})`); - // Route by resource type for scrape / extraction / screenshot APIs + // CRITICAL: dispatch BEFORE JSON.parse — Screenshot API deliveries carry + // raw image bytes (JPEG/PNG/WebP/GIF) regardless of the Content-Type you + // configured in the Scrapfly dashboard. Content-Type is whatever you + // picked (application/json by default; application/msgpack is also an + // option). JSON.parse on a binary body throws after the signature + // has already verified. + if (resourceType === 'screenshot') { + console.log(`Screenshot received: ${req.body.length} bytes (binary)`); + // req.body is the raw image. Persist it to storage and return 200. + return res.status(200).send('OK'); + } + + // Remaining resource types deliver JSON payloads. + const payload = JSON.parse(req.body.toString()); + switch (resourceType) { case 'scrape': // Scrape API places the fetched URL at result.url; the webhook overlay's @@ -100,10 +117,9 @@ app.post('/webhooks/scrapfly', console.log('Scrape result:', payload.result?.status_code, payload.result?.url); break; case 'extraction': - console.log('Extraction result:', payload.result?.data); - break; - case 'screenshot': - console.log('Screenshot result:', payload.result?.screenshot_url); + // Extraction body shape: { content_type, data: {...}, context: {...} }. + // Extracted fields live at payload.data, NOT payload.result.data. + console.log('Extraction result:', payload.content_type, payload.data); break; default: // Crawler API uses event names in the body diff --git a/skills/scrapfly-webhooks/examples/express/src/index.js b/skills/scrapfly-webhooks/examples/express/src/index.js index c82217f..ee45820 100644 --- a/skills/scrapfly-webhooks/examples/express/src/index.js +++ b/skills/scrapfly-webhooks/examples/express/src/index.js @@ -59,6 +59,30 @@ app.post('/webhooks/scrapfly', return res.status(401).send('Invalid signature'); } + console.log(`Scrapfly webhook (id=${webhookId} resource=${resourceType} job=${jobId})`); + + // Dispatch BEFORE JSON parsing — screenshot deliveries are raw image + // bytes (JPEG / PNG / WebP / GIF) regardless of the Content-Type you + // configured in the Scrapfly dashboard (json or msgpack — both apply + // verbatim to scrape/extraction deliveries, but screenshot bodies are + // binary either way). Trying to JSON.parse a binary body throws after + // verification has already succeeded. + if (resourceType === 'screenshot') { + console.log(`Screenshot received: ${req.body.length} bytes (binary, ${jobId})`); + // req.body is a Buffer of the rendered image. Persist it to storage + // (S3, disk, ImgIX, etc.) keyed on jobId / webhookId. Do not call + // JSON.parse here. The synchronous Screenshot API response exposes + // metadata in headers like X-Scrapfly-Screenshot-Url, but the + // webhook delivery only carries the image bytes — fetch the metadata + // from the synchronous response or the Scrapfly dashboard if needed. + return res.status(200).send('OK'); + } + + // Remaining resource types are serialised per your dashboard's + // Content-Type setting. This handler assumes `application/json`; if + // you configured `application/msgpack`, swap JSON.parse for a msgpack + // decoder (e.g. `@msgpack/msgpack`). + let payload; try { payload = JSON.parse(req.body.toString('utf8')); @@ -67,9 +91,7 @@ app.post('/webhooks/scrapfly', return res.status(400).send('Invalid JSON payload'); } - console.log(`Scrapfly webhook (id=${webhookId} resource=${resourceType} job=${jobId})`); - - // Route by resource type for the Scrape / Extraction / Screenshot APIs. + // Route by resource type for the Scrape / Extraction APIs. switch (resourceType) { case 'scrape': // Scrape API places the fetched URL at result.url (see scrapfly.io/docs/scrape-api/getting-started). @@ -82,15 +104,16 @@ app.post('/webhooks/scrapfly', break; case 'extraction': - console.log('Extraction result:', payload?.result?.data); + // Extraction body shape (from a real capture): + // { content_type, data: { ... }, context: { webhook, job } } + // The extracted fields live at payload.data, NOT payload.result.data. + console.log('Extraction result:', { + content_type: payload?.content_type, + data: payload?.data, + }); // TODO: Save structured data, trigger downstream enrichment break; - case 'screenshot': - console.log('Screenshot result URL:', payload?.result?.screenshot_url); - // TODO: Store image, generate thumbnail, notify user - break; - default: { // Crawler API uses lifecycle events in the body and an // X-Scrapfly-Crawl-Event-Name header. diff --git a/skills/scrapfly-webhooks/examples/express/test/webhook.test.js b/skills/scrapfly-webhooks/examples/express/test/webhook.test.js index 08be77d..b648734 100644 --- a/skills/scrapfly-webhooks/examples/express/test/webhook.test.js +++ b/skills/scrapfly-webhooks/examples/express/test/webhook.test.js @@ -121,8 +121,13 @@ describe('Scrapfly Webhook Endpoint', () => { expect(res.text).toBe('OK'); }); - it('returns 200 for a valid extraction webhook', async () => { - const body = JSON.stringify({ result: { data: { title: 'Test' } } }); + it('returns 200 for a valid extraction webhook (data lives at payload.data, not payload.result.data)', async () => { + // Real extraction body shape from a live capture: + // { content_type, data: {...}, context: {...} } + const body = JSON.stringify({ + content_type: 'application/json', + data: { price: '$19.99', product_name: 'Widget', stock: 'In stock' }, + }); const sig = generateScrapflySignature(Buffer.from(body), secret); const res = await request(app) @@ -135,22 +140,50 @@ describe('Scrapfly Webhook Endpoint', () => { expect(res.status).toBe(200); }); - it('returns 200 for a valid screenshot webhook', async () => { - const body = JSON.stringify({ - result: { screenshot_url: 'https://scrapfly.io/screenshots/abc.png' }, - }); - const sig = generateScrapflySignature(Buffer.from(body), secret); + it('returns 200 for a screenshot webhook with a BINARY body (not JSON)', async () => { + // Scrapfly Screenshot deliveries carry raw image bytes (JPEG/PNG/WebP/GIF), + // NOT JSON — even though Scrapfly lies in the Content-Type header by + // sending `application/json` for screenshot payloads (upstream quirk). + // The handler must dispatch on the resource type header BEFORE JSON.parse, + // otherwise the parse blows up after signature verification has already + // succeeded. + // + // Test note: we use `application/octet-stream` here so supertest doesn't + // auto-JSON-stringify the Buffer. The handler ignores Content-Type and + // reads the raw body via express.raw({ type: '*/*' }), so this is + // equivalent to the real on-wire scenario as far as the handler's + // dispatch logic is concerned. + // + // Minimal 1×1 JPEG (12 bytes — JFIF/SOI header is enough; we never + // decode the image, just verify the handler tolerates binary). + const binaryBody = Buffer.from([ + 0xff, 0xd8, 0xff, 0xe0, 0x00, 0x10, 0x4a, 0x46, 0x49, 0x46, 0x00, 0x01, + ]); + const sig = generateScrapflySignature(binaryBody, secret); const res = await request(app) .post('/webhooks/scrapfly') - .set('Content-Type', 'application/json') + .set('Content-Type', 'application/octet-stream') .set('X-Scrapfly-Webhook-Signature', sig) .set('X-Scrapfly-Webhook-Resource-Type', 'screenshot') - .send(body); + .send(binaryBody); expect(res.status).toBe(200); }); + it('rejects a screenshot delivery with an invalid signature', async () => { + const binaryBody = Buffer.from([0xff, 0xd8, 0xff, 0xe0, 0x00, 0x10]); + + const res = await request(app) + .post('/webhooks/scrapfly') + .set('Content-Type', 'application/octet-stream') + .set('X-Scrapfly-Webhook-Signature', 'AABBCCDD') + .set('X-Scrapfly-Webhook-Resource-Type', 'screenshot') + .send(binaryBody); + + expect(res.status).toBe(401); + }); + const crawlerEvents = [ 'crawler_started', 'crawler_url_visited', diff --git a/skills/scrapfly-webhooks/examples/fastapi/main.py b/skills/scrapfly-webhooks/examples/fastapi/main.py index 9fc93b6..c3680e2 100644 --- a/skills/scrapfly-webhooks/examples/fastapi/main.py +++ b/skills/scrapfly-webhooks/examples/fastapi/main.py @@ -71,12 +71,6 @@ async def scrapfly_webhook( print("ERROR: Scrapfly webhook signature verification failed") raise HTTPException(status_code=401, detail="Invalid signature") - try: - payload = json.loads(raw_body.decode("utf-8")) - except json.JSONDecodeError as exc: - print(f"ERROR: Failed to parse Scrapfly webhook payload: {exc}") - raise HTTPException(status_code=400, detail="Invalid JSON payload") - print( f"Scrapfly webhook (id={x_scrapfly_webhook_id} " f"resource={x_scrapfly_webhook_resource_type} job={x_scrapfly_webhook_job_id})" @@ -84,6 +78,35 @@ async def scrapfly_webhook( resource_type = x_scrapfly_webhook_resource_type + # Dispatch BEFORE JSON parsing — screenshot deliveries are raw image + # bytes (JPEG / PNG / WebP / GIF) regardless of the Content-Type you + # configured in the Scrapfly dashboard (json or msgpack — both apply + # verbatim to scrape/extraction deliveries, but screenshot bodies are + # binary either way). Trying to json.loads a binary body raises + # JSONDecodeError after the signature has already verified. + if resource_type == "screenshot": + print( + f"Screenshot received: {len(raw_body)} bytes " + f"(binary, {x_scrapfly_webhook_job_id})" + ) + # raw_body is the rendered image bytes. Persist it to storage + # (S3, disk, etc.) keyed on job_id / webhook_id. Do not call + # json.loads here. The synchronous Screenshot API response + # exposes metadata in headers like X-Scrapfly-Screenshot-Url; + # the webhook delivery only carries the image bytes. + return {"received": True} + + # Remaining resource types are serialised per your dashboard's + # Content-Type setting. This handler assumes `application/json`; if + # you configured `application/msgpack`, swap json.loads for a msgpack + # decoder (e.g. the `msgpack` package). + + try: + payload = json.loads(raw_body.decode("utf-8")) + except json.JSONDecodeError as exc: + print(f"ERROR: Failed to parse Scrapfly webhook payload: {exc}") + raise HTTPException(status_code=400, detail="Invalid JSON payload") + if resource_type == "scrape": # Scrape API places the fetched URL at result.url. The webhook overlay's # payload["context"] only carries `webhook` and `job` sub-objects. @@ -91,11 +114,14 @@ async def scrapfly_webhook( print(f"Scrape result: url={result.get('url')} status={result.get('status_code')}") # TODO: Persist HTML / extracted fields, enqueue parsing elif resource_type == "extraction": - print(f"Extraction result: {payload.get('result', {}).get('data')}") + # Extraction body shape (from a real capture): + # { content_type, data: { ... }, context: { webhook, job } } + # The extracted fields live at payload["data"], NOT payload["result"]["data"]. + print( + f"Extraction result: content_type={payload.get('content_type')} " + f"data={payload.get('data')}" + ) # TODO: Save structured data, trigger enrichment - elif resource_type == "screenshot": - print(f"Screenshot URL: {payload.get('result', {}).get('screenshot_url')}") - # TODO: Store image, generate thumbnail else: # Crawler API uses lifecycle events in the body. event = x_scrapfly_crawl_event_name or payload.get("event") diff --git a/skills/scrapfly-webhooks/examples/fastapi/test_webhook.py b/skills/scrapfly-webhooks/examples/fastapi/test_webhook.py index 20271cd..5042605 100644 --- a/skills/scrapfly-webhooks/examples/fastapi/test_webhook.py +++ b/skills/scrapfly-webhooks/examples/fastapi/test_webhook.py @@ -105,12 +105,19 @@ def test_valid_scrape_webhook(self, client, secret): assert response.status_code == 200 assert response.json() == {"received": True} - @pytest.mark.parametrize( - "resource_type", - ["scrape", "extraction", "screenshot"], - ) - def test_resource_types(self, client, secret, resource_type): - body = json.dumps({"result": {"status_code": 200}}).encode("utf-8") + def test_extraction_payload_data_at_top_level(self, client, secret): + """Real extraction body shape: { content_type, data: {...} } — data lives + at payload.data, NOT payload.result.data.""" + body = json.dumps( + { + "content_type": "application/json", + "data": { + "price": "$19.99", + "product_name": "Widget", + "stock": "In stock", + }, + } + ).encode("utf-8") sig = generate_scrapfly_signature(body, secret) response = client.post( @@ -119,11 +126,48 @@ def test_resource_types(self, client, secret, resource_type): headers={ "Content-Type": "application/json", "X-Scrapfly-Webhook-Signature": sig, - "X-Scrapfly-Webhook-Resource-Type": resource_type, + "X-Scrapfly-Webhook-Resource-Type": "extraction", + }, + ) + assert response.status_code == 200 + + def test_screenshot_binary_body(self, client, secret): + """Scrapfly Screenshot deliveries carry raw image bytes (JPEG / PNG / + WebP / GIF), NOT JSON — even though Content-Type says application/json + (upstream Scrapfly quirk). The handler must dispatch on the resource + type header BEFORE json.loads to avoid raising JSONDecodeError after + signature verification has already succeeded.""" + binary_body = bytes( + [0xFF, 0xD8, 0xFF, 0xE0, 0x00, 0x10, 0x4A, 0x46, 0x49, 0x46, 0x00, 0x01] + ) + sig = generate_scrapfly_signature(binary_body, secret) + + response = client.post( + "/webhooks/scrapfly", + content=binary_body, + headers={ + # Scrapfly lies here — header says JSON but the body is binary. + "Content-Type": "application/json", + "X-Scrapfly-Webhook-Signature": sig, + "X-Scrapfly-Webhook-Resource-Type": "screenshot", }, ) assert response.status_code == 200 + def test_screenshot_rejects_invalid_signature(self, client, secret): + binary_body = bytes([0xFF, 0xD8, 0xFF, 0xE0]) + + response = client.post( + "/webhooks/scrapfly", + content=binary_body, + headers={ + "Content-Type": "application/json", + "X-Scrapfly-Webhook-Signature": "AABBCCDD", + "X-Scrapfly-Webhook-Resource-Type": "screenshot", + }, + ) + assert response.status_code == 401 + @pytest.mark.parametrize( "event", [ diff --git a/skills/scrapfly-webhooks/examples/nextjs/app/webhooks/scrapfly/route.ts b/skills/scrapfly-webhooks/examples/nextjs/app/webhooks/scrapfly/route.ts index e30b4f1..ed215e4 100644 --- a/skills/scrapfly-webhooks/examples/nextjs/app/webhooks/scrapfly/route.ts +++ b/skills/scrapfly-webhooks/examples/nextjs/app/webhooks/scrapfly/route.ts @@ -11,7 +11,7 @@ import { createHmac, timingSafeEqual } from 'crypto'; * Header: X-Scrapfly-Webhook-Signature (uppercase hex) */ function verifyScrapflySignature( - rawBody: string, + rawBody: Buffer, signatureHeader: string | null, secret: string ): boolean { @@ -39,9 +39,10 @@ function verifyScrapflySignature( } export async function POST(request: NextRequest) { - // Read raw body as text — JSON.parse + re-stringify would change the bytes - // and break the signature. - const rawBody = await request.text(); + // Read raw body as bytes — screenshot deliveries are raw image bytes + // (JPEG / PNG / WebP / GIF), not text. Using request.text() would + // decode the bytes as UTF-8 and corrupt the body for verification. + const rawBody = Buffer.from(await request.arrayBuffer()); const signature = request.headers.get('x-scrapfly-webhook-signature'); const resourceType = request.headers.get('x-scrapfly-webhook-resource-type'); @@ -63,16 +64,37 @@ export async function POST(request: NextRequest) { return NextResponse.json({ error: 'Invalid signature' }, { status: 401 }); } + console.log(`Scrapfly webhook (id=${webhookId} resource=${resourceType} job=${jobId})`); + + // Dispatch BEFORE JSON parsing — screenshot deliveries are raw image + // bytes (JPEG / PNG / WebP / GIF) regardless of the Content-Type you + // configured in the Scrapfly dashboard (json or msgpack — both apply + // verbatim to scrape/extraction deliveries, but screenshot bodies are + // binary either way). Trying to JSON.parse a binary body throws after + // verification has already succeeded. + if (resourceType === 'screenshot') { + console.log(`Screenshot received: ${rawBody.length} bytes (binary, ${jobId})`); + // rawBody is the raw image bytes. Persist it to storage keyed on + // jobId / webhookId. Do not call JSON.parse here. The synchronous + // Screenshot API response exposes metadata in headers like + // X-Scrapfly-Screenshot-Url; the webhook delivery only carries the + // image bytes. + return NextResponse.json({ received: true }); + } + + // Remaining resource types are serialised per your dashboard's + // Content-Type setting. This handler assumes `application/json`; if + // you configured `application/msgpack`, swap JSON.parse for a msgpack + // decoder (e.g. `@msgpack/msgpack`). + let payload: any; try { - payload = JSON.parse(rawBody); + payload = JSON.parse(rawBody.toString('utf8')); } catch (err) { console.error('Failed to parse Scrapfly webhook payload:', err); return NextResponse.json({ error: 'Invalid JSON payload' }, { status: 400 }); } - console.log(`Scrapfly webhook (id=${webhookId} resource=${resourceType} job=${jobId})`); - switch (resourceType) { case 'scrape': // Scrape API places the fetched URL at result.url. The webhook overlay's @@ -84,11 +106,13 @@ export async function POST(request: NextRequest) { break; case 'extraction': - console.log('Extraction result:', payload?.result?.data); - break; - - case 'screenshot': - console.log('Screenshot result URL:', payload?.result?.screenshot_url); + // Extraction body shape (from a real capture): + // { content_type, data: { ... }, context: { webhook, job } } + // The extracted fields live at payload.data, NOT payload.result.data. + console.log('Extraction result:', { + content_type: payload?.content_type, + data: payload?.data, + }); break; default: { diff --git a/skills/scrapfly-webhooks/examples/nextjs/test/webhook.test.ts b/skills/scrapfly-webhooks/examples/nextjs/test/webhook.test.ts index 7cc77a8..46fe409 100644 --- a/skills/scrapfly-webhooks/examples/nextjs/test/webhook.test.ts +++ b/skills/scrapfly-webhooks/examples/nextjs/test/webhook.test.ts @@ -5,11 +5,11 @@ import { POST } from '../app/webhooks/scrapfly/route'; const secret = process.env.SCRAPFLY_WEBHOOK_SECRET!; -function generateScrapflySignature(rawBody: string, secret: string): string { +function generateScrapflySignature(rawBody: string | Buffer | Uint8Array, secret: string): string { return createHmac('sha256', secret).update(rawBody).digest('hex').toUpperCase(); } -function makeRequest(body: string, headers: Record = {}): NextRequest { +function makeRequest(body: string | Buffer | Uint8Array, headers: Record = {}): NextRequest { return new NextRequest('http://localhost:3000/webhooks/scrapfly', { method: 'POST', headers: { 'Content-Type': 'application/json', ...headers }, @@ -72,17 +72,36 @@ describe('Scrapfly Webhook Endpoint (Next.js)', () => { expect(await res.json()).toEqual({ received: true }); }); - it.each([ - 'scrape', - 'extraction', - 'screenshot', - ])('returns 200 for resource type %s', async (resourceType) => { - const body = JSON.stringify({ result: { status_code: 200 } }); + it('returns 200 for a valid extraction webhook (data lives at payload.data, not payload.result.data)', async () => { + // Real extraction body shape from a live capture: + // { content_type, data: {...}, context: {...} } + const body = JSON.stringify({ + content_type: 'application/json', + data: { price: '$19.99', product_name: 'Widget', stock: 'In stock' }, + }); const sig = generateScrapflySignature(body, secret); const res = await POST(makeRequest(body, { 'X-Scrapfly-Webhook-Signature': sig, - 'X-Scrapfly-Webhook-Resource-Type': resourceType, + 'X-Scrapfly-Webhook-Resource-Type': 'extraction', + })); + + expect(res.status).toBe(200); + }); + + it('returns 200 for a screenshot webhook with a BINARY body (not JSON)', async () => { + // Scrapfly Screenshot deliveries carry raw image bytes (JPEG/PNG/WebP/GIF), + // NOT JSON — even though Content-Type says application/json (upstream + // Scrapfly quirk). The handler must dispatch on the resource type header + // BEFORE JSON.parse to avoid blowing up after signature verification. + const binaryBody = Buffer.from([ + 0xff, 0xd8, 0xff, 0xe0, 0x00, 0x10, 0x4a, 0x46, 0x49, 0x46, 0x00, 0x01, + ]); + const sig = generateScrapflySignature(binaryBody, secret); + + const res = await POST(makeRequest(binaryBody, { + 'X-Scrapfly-Webhook-Signature': sig, + 'X-Scrapfly-Webhook-Resource-Type': 'screenshot', })); expect(res.status).toBe(200); diff --git a/skills/scrapfly-webhooks/references/setup.md b/skills/scrapfly-webhooks/references/setup.md index 62e98b0..4808cef 100644 --- a/skills/scrapfly-webhooks/references/setup.md +++ b/skills/scrapfly-webhooks/references/setup.md @@ -14,7 +14,8 @@ 4. Fill in: - **Name** — A short identifier. You will pass this as `webhook_name=` on API calls. Names are scoped per project + environment. - **URL** — Your endpoint, e.g. `https://your-app.example.com/webhooks/scrapfly`. - - (Optional) Environment / project scoping. + - **Content Type** — Pick `application/json` (the default; matches this skill's example handlers) or `application/msgpack`. Scrapfly sends this value verbatim on every delivery. It applies to Scrape and Extraction bodies; Screenshot deliveries are raw image bytes either way (see [references/verification.md](verification.md) for details). If you pick `application/msgpack`, swap the JSON parser in the scrape/extraction branches of your handler for a msgpack decoder. + - (Optional) **Concurrency Limit** and environment / project scoping. 5. Save the webhook. Scrapfly will display a **signing secret** — copy it. The dashboard is the only place this secret is shown. ## Configure the Signing Secret in Your App diff --git a/skills/scrapfly-webhooks/references/verification.md b/skills/scrapfly-webhooks/references/verification.md index e8994d7..8b87594 100644 --- a/skills/scrapfly-webhooks/references/verification.md +++ b/skills/scrapfly-webhooks/references/verification.md @@ -87,13 +87,23 @@ const safe = { ...payload, context: { ...payload.context, webhook: { ...payload. ## Common Gotchas -- **Parsed JSON breaks signatures.** Verify against the exact bytes Scrapfly sent. In Express, mount `express.raw({ type: '*/*' })` on the webhook route (not `express.json`). In Next.js App Router, read with `await request.text()`. In FastAPI, use `await request.body()`. +- **Parsed JSON breaks signatures.** Verify against the exact bytes Scrapfly sent. In Express, mount `express.raw({ type: '*/*' })` on the webhook route (not `express.json`). In Next.js App Router, read with `await request.arrayBuffer()` and wrap with `Buffer.from(...)` — do **not** use `await request.text()`, which UTF-8-decodes the bytes (corrupts binary screenshot bodies). In FastAPI, use `await request.body()`. +- **Screenshot deliveries are binary regardless of the Content-Type you configured.** Scrapfly's webhook config has a Content-Type dropdown (`application/json` or `application/msgpack`); whichever you pick is sent on every delivery as the `Content-Type` header. For Scrape and Extraction API deliveries the body shape matches what you configured (JSON or msgpack). For Screenshot API deliveries the body is raw image bytes (JPEG / PNG / WebP / GIF) no matter what's in the header — an upstream Scrapfly quirk: the configured Content-Type is sent verbatim, but the screenshot delivery is the binary image and never actually a serialised object. **Dispatch on `X-Scrapfly-Webhook-Resource-Type`, not on `Content-Type`, and only parse the body after dispatching.** Verification works fine over any body — only the parse step needs to know what to expect. The example handlers in this skill follow that pattern: signature check → resource-type dispatch → JSON parse only for scrape / extraction / crawler. +- **JSON vs msgpack parsing.** The example handlers in this skill assume the Scrapfly dashboard is configured to send `application/json`. If you configured `application/msgpack` instead, swap the parse step in the scrape / extraction / crawler branches for a msgpack decoder (e.g. [`@msgpack/msgpack`](https://www.npmjs.com/package/@msgpack/msgpack) on Node, [`msgpack`](https://pypi.org/project/msgpack/) on Python). The signature check, the binary screenshot path, and the rest of the handler don't change. - **Case of the hex digest.** Scrapfly's primary header is uppercase, but the `-Lowercase` variant exists for a reason. Always normalise both sides before comparing (the snippets above use `.toUpperCase()` / `.upper()`). - **Header casing in HTTP frameworks.** HTTP header names are case-insensitive. Express lowercases everything; Next.js's `headers.get(...)` is also case-insensitive. Read `x-scrapfly-webhook-signature`. - **No timestamp tolerance.** Don't reject for old timestamps — there isn't one. If you need replay protection, dedupe on `X-Scrapfly-Webhook-Id`. - **Secret format.** Use the dashboard string verbatim. There is no `whsec_` prefix to strip and no base64 decode step. - **Body encoding.** The HMAC is over bytes, not text. Avoid any middleware that transforms encoding (gzip middleware, BOM strippers, etc.) on the route. +## Alternative: Verify at the Gateway with Hookdeck + +If your handlers sit behind Hookdeck Event Gateway, you can offload Scrapfly signature verification to the gateway and have your handler verify only Hookdeck's signature downstream. Hookdeck has a built-in `SCRAPFLY` source type that knows the exact algorithm (uppercase hex HMAC-SHA256, dual-case headers, raw-body) and will reject invalid deliveries at the edge. + +Set up by creating a Hookdeck source with `--type SCRAPFLY` and the same signing secret you configured in Scrapfly, then have your handler verify `x-hookdeck-signature` instead of `X-Scrapfly-Webhook-Signature`. See the [hookdeck-event-gateway](https://github.com/hookdeck/webhook-skills/tree/main/skills/hookdeck-event-gateway) and [hookdeck-event-gateway-webhooks](https://github.com/hookdeck/webhook-skills/tree/main/skills/hookdeck-event-gateway-webhooks) skills for the downstream verification pattern. + +A known caveat (May 2026): if your Scrapfly dashboard is configured to send `application/json` (the default), Hookdeck rejects Scrapfly screenshot deliveries with `UNPARSABLE_JSON` because the configured Content-Type doesn't match the binary screenshot body. This is upstream Scrapfly behaviour: the Content-Type dropdown applies to every delivery, but screenshot bodies are binary regardless. Pending resolution, route screenshot deliveries directly to your handler without the gateway preset, or configure the webhook to send `application/msgpack` if you can use a parser that tolerates binary bodies on the screenshot path. + ## Debugging Verification Failures 1. **Log both signatures side-by-side** (the computed expected and the received header) — they should be identical, byte for byte, after normalising case.