Skip to content

Commit 96f2c4b

Browse files
committed
fix: address scrapfly review feedback + integrate into README and providers.yaml
- Correct X-Scrapfly-Webhook-Env header values (test/live, not production) - Document actual payload envelope (context.webhook, context.job) - Warn that payload echoes the signing secret at context.webhook.secret - Distinguish default vs opt-in crawler events - Add retry schedule and 100-failure auto-disable behavior - Note paid-plan requirement (FREE plan has webhook queue size 0) - Read scrape URL from payload.result.url (not the webhook context overlay) - Add Scrapfly row to README Provider Skills table - Add scrapfly entry to providers.yaml (docs URLs, notes, testScenario) https://claude.ai/code/session_01NNTgQRJss1V7gyzzJ9rjnB
1 parent ceb1a32 commit 96f2c4b

12 files changed

Lines changed: 171 additions & 24 deletions

File tree

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,7 @@ Skills for receiving and verifying webhooks from specific providers. Each includ
5353
| Postmark | [`postmark-webhooks`](skills/postmark-webhooks/) | Authenticate Postmark webhooks (Basic Auth/Token), handle email delivery, bounce, open, click, and spam events |
5454
| Replicate | [`replicate-webhooks`](skills/replicate-webhooks/) | Verify Replicate webhook signatures, handle ML prediction lifecycle events |
5555
| Resend | [`resend-webhooks`](skills/resend-webhooks/) | Verify Resend webhook signatures, handle email delivery and bounce events |
56+
| Scrapfly | [`scrapfly-webhooks`](skills/scrapfly-webhooks/) | Verify Scrapfly webhook signatures (HMAC-SHA256, uppercase/lowercase hex), dispatch scrape, extraction, and screenshot jobs |
5657
| SendGrid | [`sendgrid-webhooks`](skills/sendgrid-webhooks/) | Verify SendGrid webhook signatures (ECDSA), handle email delivery events |
5758
| Shopify | [`shopify-webhooks`](skills/shopify-webhooks/) | Verify Shopify HMAC signatures, handle order and product webhook events |
5859
| Slack | [`slack-webhooks`](skills/slack-webhooks/) | Verify Slack Events API signatures (HMAC-SHA256, `X-Slack-Signature`), handle message, app_mention, and reaction events |

providers.yaml

Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -465,6 +465,61 @@ providers:
465465
- Bounce
466466
- Delivery
467467

468+
- name: scrapfly
469+
displayName: Scrapfly
470+
docs:
471+
scrape_webhook: https://scrapfly.io/docs/scrape-api/webhook
472+
extraction_webhook: https://scrapfly.io/docs/extraction-api/webhook
473+
screenshot_webhook: https://scrapfly.io/docs/screenshot-api/webhook
474+
scrape_getting_started: https://scrapfly.io/docs/scrape-api/getting-started
475+
extraction_getting_started: https://scrapfly.io/docs/extraction-api/getting-started
476+
screenshot_getting_started: https://scrapfly.io/docs/screenshot-api/getting-started
477+
notes: >
478+
Web-scraping API platform with three products that share a single async-job +
479+
webhook system: Scrape API, Extraction API, Screenshot API. One webhook URL
480+
registered in the dashboard (https://scrapfly.io/dashboard/webhook) receives
481+
deliveries from all three products. PAID PLAN REQUIRED (first paid tier).
482+
483+
No API exists for creating/updating/deleting webhooks programmatically. The
484+
destination URL CANNOT be passed per-call. Instead, each API call references
485+
an already-registered webhook by name via the `webhook_name` query parameter
486+
(e.g. `…/scrape?…&webhook_name=samples-capture`).
487+
488+
Signature verification: HMAC-SHA256 over the RAW request body bytes (do not
489+
JSON.parse and re-stringify — that changes the byte sequence). Compare against
490+
either `X-Scrapfly-Webhook-Signature` (uppercase hex) or
491+
`X-Scrapfly-Webhook-Signature-Lowercase` (lowercase hex) using constant-time
492+
equality. The secret is per-webhook, displayed in the dashboard alongside the
493+
webhook configuration (NOT the account API key).
494+
495+
Dispatch by `X-Scrapfly-Webhook-Resource-Type` header (one of `scrape`,
496+
`extraction`, `screenshot`). Other headers: `X-Scrapfly-Webhook-Job-Id` (UUID,
497+
use as idempotency key for at-least-once delivery), `X-Scrapfly-Webhook-Env`
498+
(`test`|`live`), `X-Scrapfly-Webhook-Project`, `X-Scrapfly-Webhook-Name`,
499+
`X-Scrapfly-Webhook-Id`, optional `X-Scrapfly-Log-Uuid`/`X-Scrapfly-Log-Url`.
500+
501+
No timestamp/replay envelope (unlike Stripe). Recommend idempotency by job-id;
502+
do NOT invent a `t=…` window.
503+
504+
Payload = the full response body of the corresponding API plus a `context`
505+
overlay: `context.webhook` (`{ name, secret, consecutive_failed_count, … }` —
506+
WARN handlers: `secret` field exposes the signing secret in the payload, do
507+
not log or echo) and `context.job` (`{ uuid, … }`). Product-specific shapes
508+
documented in the getting-started pages above.
509+
510+
Delivery: retry 30s → 1min → 5min → 30min → 1h → 1d. A webhook is DISABLED
511+
after 100 consecutive failures — handlers should return 2xx fast and surface
512+
errors out-of-band.
513+
514+
No official SDK construct for verification (plain HMAC is correct). Do NOT
515+
pull in a third-party HMAC library; use the stdlib (`crypto.createHmac` in
516+
Node, `hmac` / `hashlib` in Python).
517+
testScenario:
518+
events:
519+
- scrape
520+
- extraction
521+
- screenshot
522+
468523
- name: sendgrid
469524
displayName: SendGrid
470525
docs:

skills/scrapfly-webhooks/SKILL.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -95,7 +95,9 @@ app.post('/webhooks/scrapfly',
9595
// Route by resource type for scrape / extraction / screenshot APIs
9696
switch (resourceType) {
9797
case 'scrape':
98-
console.log('Scrape result:', payload.result?.status_code, payload.context?.url);
98+
// Scrape API places the fetched URL at result.url; the webhook overlay's
99+
// context only carries `webhook` and `job` sub-objects.
100+
console.log('Scrape result:', payload.result?.status_code, payload.result?.url);
99101
break;
100102
case 'extraction':
101103
console.log('Extraction result:', payload.result?.data);
@@ -177,7 +179,7 @@ Crawler API webhooks carry an `event` string in the body (also exposed as `X-Scr
177179
| `X-Scrapfly-Webhook-Name` | Name of the configured webhook |
178180
| `X-Scrapfly-Webhook-Resource-Type` | `scrape`, `extraction`, or `screenshot` |
179181
| `X-Scrapfly-Webhook-Job-Id` | Unique job identifier (use for reconciliation) |
180-
| `X-Scrapfly-Webhook-Env` | Environment (e.g. `production`) |
182+
| `X-Scrapfly-Webhook-Env` | Environment (`test` or `live`) |
181183
| `X-Scrapfly-Webhook-Project` | Project name |
182184
| `X-Scrapfly-Crawl-Event-Name` | Crawler API event name (e.g. `crawler_finished`) |
183185

skills/scrapfly-webhooks/examples/express/src/index.js

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -72,8 +72,10 @@ app.post('/webhooks/scrapfly',
7272
// Route by resource type for the Scrape / Extraction / Screenshot APIs.
7373
switch (resourceType) {
7474
case 'scrape':
75+
// Scrape API places the fetched URL at result.url (see scrapfly.io/docs/scrape-api/getting-started).
76+
// The webhook overlay's payload.context only carries `webhook` and `job` sub-objects.
7577
console.log('Scrape result:', {
76-
url: payload?.context?.url,
78+
url: payload?.result?.url,
7779
status: payload?.result?.status_code,
7880
});
7981
// TODO: Persist HTML / extracted fields, enqueue parsing, ...

skills/scrapfly-webhooks/examples/express/test/webhook.test.js

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -96,8 +96,15 @@ describe('Scrapfly Webhook Endpoint', () => {
9696

9797
it('returns 200 for a valid scrape webhook', async () => {
9898
const body = JSON.stringify({
99-
context: { url: 'https://web-scraping.dev/products' },
100-
result: { status_code: 200, content: '<html></html>' },
99+
result: {
100+
url: 'https://web-scraping.dev/products',
101+
status_code: 200,
102+
content: '<html></html>',
103+
},
104+
context: {
105+
webhook: { name: 'my-webhook', secret: 'test_scrapfly_signing_secret', consecutive_failed_count: 0 },
106+
job: { uuid: '550e8400-e29b-41d4-a716-446655440000' },
107+
},
101108
});
102109
const sig = generateScrapflySignature(Buffer.from(body), secret);
103110

skills/scrapfly-webhooks/examples/fastapi/main.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -85,9 +85,10 @@ async def scrapfly_webhook(
8585
resource_type = x_scrapfly_webhook_resource_type
8686

8787
if resource_type == "scrape":
88+
# Scrape API places the fetched URL at result.url. The webhook overlay's
89+
# payload["context"] only carries `webhook` and `job` sub-objects.
8890
result = payload.get("result", {})
89-
context = payload.get("context", {})
90-
print(f"Scrape result: url={context.get('url')} status={result.get('status_code')}")
91+
print(f"Scrape result: url={result.get('url')} status={result.get('status_code')}")
9192
# TODO: Persist HTML / extracted fields, enqueue parsing
9293
elif resource_type == "extraction":
9394
print(f"Extraction result: {payload.get('result', {}).get('data')}")

skills/scrapfly-webhooks/examples/fastapi/test_webhook.py

Lines changed: 13 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -74,8 +74,19 @@ def test_tampered_body(self, client, secret):
7474
def test_valid_scrape_webhook(self, client, secret):
7575
body = json.dumps(
7676
{
77-
"context": {"url": "https://web-scraping.dev/products"},
78-
"result": {"status_code": 200, "content": "<html></html>"},
77+
"result": {
78+
"url": "https://web-scraping.dev/products",
79+
"status_code": 200,
80+
"content": "<html></html>",
81+
},
82+
"context": {
83+
"webhook": {
84+
"name": "my-webhook",
85+
"secret": secret,
86+
"consecutive_failed_count": 0,
87+
},
88+
"job": {"uuid": "550e8400-e29b-41d4-a716-446655440000"},
89+
},
7990
}
8091
).encode("utf-8")
8192
sig = generate_scrapfly_signature(body, secret)

skills/scrapfly-webhooks/examples/nextjs/app/webhooks/scrapfly/route.ts

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -75,8 +75,10 @@ export async function POST(request: NextRequest) {
7575

7676
switch (resourceType) {
7777
case 'scrape':
78+
// Scrape API places the fetched URL at result.url. The webhook overlay's
79+
// payload.context only carries `webhook` and `job` sub-objects.
7880
console.log('Scrape result:', {
79-
url: payload?.context?.url,
81+
url: payload?.result?.url,
8082
status: payload?.result?.status_code,
8183
});
8284
break;

skills/scrapfly-webhooks/examples/nextjs/test/webhook.test.ts

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -53,8 +53,11 @@ describe('Scrapfly Webhook Endpoint (Next.js)', () => {
5353

5454
it('returns 200 for a valid scrape webhook', async () => {
5555
const body = JSON.stringify({
56-
context: { url: 'https://web-scraping.dev/products' },
57-
result: { status_code: 200 },
56+
result: { url: 'https://web-scraping.dev/products', status_code: 200 },
57+
context: {
58+
webhook: { name: 'my-webhook', secret, consecutive_failed_count: 0 },
59+
job: { uuid: '550e8400-e29b-41d4-a716-446655440000' },
60+
},
5861
});
5962
const sig = generateScrapflySignature(body, secret);
6063

skills/scrapfly-webhooks/references/overview.md

Lines changed: 61 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -16,22 +16,50 @@ The `X-Scrapfly-Webhook-Resource-Type` header tells you which product the delive
1616
| `extraction` | An async Extraction API job finishes | Persist structured data, enqueue follow-up enrichment |
1717
| `screenshot` | An async Screenshot API job finishes | Store image URL, notify users, generate thumbnails |
1818

19-
The body of a `scrape` / `extraction` / `screenshot` webhook is the same JSON envelope you'd get from the synchronous API call, with extra webhook context (`webhook_name`, `webhook_uuid`, `job_uuid`).
19+
The body of a `scrape` / `extraction` / `screenshot` webhook is the full JSON response of the corresponding synchronous API call with a `context` overlay added:
20+
21+
```json
22+
{
23+
"...api_response": "...",
24+
"context": {
25+
"...api_context": "...",
26+
"webhook": {
27+
"name": "my-webhook",
28+
"secret": "<signing secret — DO NOT log>",
29+
"consecutive_failed_count": 0
30+
},
31+
"job": {
32+
"uuid": "550e8400-e29b-41d4-a716-446655440000"
33+
}
34+
}
35+
}
36+
```
37+
38+
The webhook overlay always carries:
39+
40+
- `context.webhook.name` — webhook name configured in the dashboard
41+
- `context.webhook.secret` — the signing secret (**never log or echo this field**)
42+
- `context.webhook.consecutive_failed_count` — current consecutive-failure count
43+
- `context.job.uuid` — job UUID (same value as `X-Scrapfly-Webhook-Job-Id`)
44+
45+
Product-specific fields (such as `result.content`, `result.data`, `result.screenshot_url`, or the API's own `context.url`) come from the underlying API response — see the [Scrape](https://scrapfly.io/docs/scrape-api/getting-started), [Extraction](https://scrapfly.io/docs/extraction-api/getting-started), and [Screenshot](https://scrapfly.io/docs/screenshot-api/getting-started) getting-started pages for shapes.
2046

2147
## Crawler Events
2248

2349
The Crawler API is a separate product that delivers **lifecycle events** rather than a single result. Each event has an `event` field in the body (and an `X-Scrapfly-Crawl-Event-Name` header):
2450

25-
| Event | Triggered When |
26-
|-------|----------------|
27-
| `crawler_started` | Crawl job started |
28-
| `crawler_url_visited` | A URL was fetched successfully |
29-
| `crawler_url_discovered` | A new URL was added to the queue |
30-
| `crawler_url_skipped` | A URL was skipped (deduped, filtered) |
31-
| `crawler_url_failed` | A URL fetch failed |
32-
| `crawler_stopped` | The crawl stopped (budget/limit reached) |
33-
| `crawler_cancelled` | The crawl was cancelled |
34-
| `crawler_finished` | The crawl ran to completion |
51+
| Event | Default? | Triggered When |
52+
|-------|----------|----------------|
53+
| `crawler_started` | Yes | Crawl job started |
54+
| `crawler_stopped` | Yes | The crawl stopped (budget/limit reached) |
55+
| `crawler_cancelled` | Yes | The crawl was cancelled |
56+
| `crawler_finished` | Yes | The crawl ran to completion |
57+
| `crawler_url_visited` | Opt-in | A URL was fetched successfully |
58+
| `crawler_url_discovered` | Opt-in | A new URL was added to the queue |
59+
| `crawler_url_skipped` | Opt-in | A URL was skipped (deduped, filtered) |
60+
| `crawler_url_failed` | Opt-in | A URL fetch failed |
61+
62+
By default Scrapfly only delivers the four lifecycle events: `crawler_started`, `crawler_stopped`, `crawler_cancelled`, `crawler_finished`. The per-URL events (`crawler_url_visited`, `crawler_url_discovered`, `crawler_url_skipped`, `crawler_url_failed`) are high-volume and must be enabled explicitly via the `webhook_events` parameter when submitting the crawl job.
3563

3664
Example Crawler payload:
3765

@@ -62,11 +90,32 @@ Example Crawler payload:
6290
| `X-Scrapfly-Webhook-Name` | Name of the webhook configured in the dashboard |
6391
| `X-Scrapfly-Webhook-Resource-Type` | `scrape`, `extraction`, or `screenshot` |
6492
| `X-Scrapfly-Webhook-Job-Id` | Job UUID returned at enqueue time — reconciliation key |
65-
| `X-Scrapfly-Webhook-Env` | Environment label (e.g. `production`) |
93+
| `X-Scrapfly-Webhook-Env` | Environment label (`test` or `live`) |
6694
| `X-Scrapfly-Webhook-Project` | Project name |
6795
| `X-Scrapfly-Crawl-Event-Name` | Crawler API event name (e.g. `crawler_finished`) |
6896
| `X-Scrapfly-Log-Uuid` / `X-Scrapfly-Log-Url` | Pointers to the Scrapfly log entry for the delivery |
6997

98+
## Delivery & Retries
99+
100+
Scrapfly delivery is **at-least-once**. Use `X-Scrapfly-Webhook-Job-Id` as your idempotency key — duplicates carry the same job UUID.
101+
102+
Retry schedule on non-2xx responses (or timeout):
103+
104+
| Attempt | Delay after previous |
105+
|---------|----------------------|
106+
| 1 | initial delivery |
107+
| 2 | 30 s |
108+
| 3 | 1 min |
109+
| 4 | 5 min |
110+
| 5 | 30 min |
111+
| 6 | 1 h |
112+
| 7 | 1 d |
113+
114+
After **100 consecutive failures** Scrapfly automatically **disables** the webhook — no further deliveries are attempted until you re-enable it in the dashboard. Because of this, handlers should:
115+
116+
- Return 2xx as soon as the signature is verified and the job is enqueued.
117+
- Surface processing errors out-of-band (logs, alerts, dead-letter queue) rather than 5xx-ing back to Scrapfly.
118+
70119
## Full Event Reference
71120

72121
- [Scrape API webhook](https://scrapfly.io/docs/scrape-api/webhook)

0 commit comments

Comments
 (0)