You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: skills/scrapfly-webhooks/references/overview.md
+61-12Lines changed: 61 additions & 12 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -16,22 +16,50 @@ The `X-Scrapfly-Webhook-Resource-Type` header tells you which product the delive
16
16
|`extraction`| An async Extraction API job finishes | Persist structured data, enqueue follow-up enrichment |
17
17
|`screenshot`| An async Screenshot API job finishes | Store image URL, notify users, generate thumbnails |
18
18
19
-
The body of a `scrape` / `extraction` / `screenshot` webhook is the same JSON envelope you'd get from the synchronous API call, with extra webhook context (`webhook_name`, `webhook_uuid`, `job_uuid`).
19
+
The body of a `scrape` / `extraction` / `screenshot` webhook is the full JSON response of the corresponding synchronous API call with a `context` overlay added:
20
+
21
+
```json
22
+
{
23
+
"...api_response": "...",
24
+
"context": {
25
+
"...api_context": "...",
26
+
"webhook": {
27
+
"name": "my-webhook",
28
+
"secret": "<signing secret — DO NOT log>",
29
+
"consecutive_failed_count": 0
30
+
},
31
+
"job": {
32
+
"uuid": "550e8400-e29b-41d4-a716-446655440000"
33
+
}
34
+
}
35
+
}
36
+
```
37
+
38
+
The webhook overlay always carries:
39
+
40
+
-`context.webhook.name` — webhook name configured in the dashboard
41
+
-`context.webhook.secret` — the signing secret (**never log or echo this field**)
42
+
-`context.webhook.consecutive_failed_count` — current consecutive-failure count
43
+
-`context.job.uuid` — job UUID (same value as `X-Scrapfly-Webhook-Job-Id`)
44
+
45
+
Product-specific fields (such as `result.content`, `result.data`, `result.screenshot_url`, or the API's own `context.url`) come from the underlying API response — see the [Scrape](https://scrapfly.io/docs/scrape-api/getting-started), [Extraction](https://scrapfly.io/docs/extraction-api/getting-started), and [Screenshot](https://scrapfly.io/docs/screenshot-api/getting-started) getting-started pages for shapes.
20
46
21
47
## Crawler Events
22
48
23
49
The Crawler API is a separate product that delivers **lifecycle events** rather than a single result. Each event has an `event` field in the body (and an `X-Scrapfly-Crawl-Event-Name` header):
24
50
25
-
| Event | Triggered When |
26
-
|-------|----------------|
27
-
|`crawler_started`| Crawl job started |
28
-
|`crawler_url_visited`| A URL was fetched successfully |
29
-
|`crawler_url_discovered`| A new URL was added to the queue |
30
-
|`crawler_url_skipped`| A URL was skipped (deduped, filtered) |
31
-
|`crawler_url_failed`| A URL fetch failed |
32
-
|`crawler_stopped`| The crawl stopped (budget/limit reached) |
33
-
|`crawler_cancelled`| The crawl was cancelled |
34
-
|`crawler_finished`| The crawl ran to completion |
51
+
| Event | Default? | Triggered When |
52
+
|-------|----------|----------------|
53
+
|`crawler_started`| Yes | Crawl job started |
54
+
|`crawler_stopped`| Yes | The crawl stopped (budget/limit reached) |
55
+
|`crawler_cancelled`| Yes | The crawl was cancelled |
56
+
|`crawler_finished`| Yes | The crawl ran to completion |
57
+
|`crawler_url_visited`| Opt-in | A URL was fetched successfully |
58
+
|`crawler_url_discovered`| Opt-in | A new URL was added to the queue |
59
+
|`crawler_url_skipped`| Opt-in | A URL was skipped (deduped, filtered) |
60
+
|`crawler_url_failed`| Opt-in | A URL fetch failed |
61
+
62
+
By default Scrapfly only delivers the four lifecycle events: `crawler_started`, `crawler_stopped`, `crawler_cancelled`, `crawler_finished`. The per-URL events (`crawler_url_visited`, `crawler_url_discovered`, `crawler_url_skipped`, `crawler_url_failed`) are high-volume and must be enabled explicitly via the `webhook_events` parameter when submitting the crawl job.
35
63
36
64
Example Crawler payload:
37
65
@@ -62,11 +90,32 @@ Example Crawler payload:
62
90
|`X-Scrapfly-Webhook-Name`| Name of the webhook configured in the dashboard |
63
91
|`X-Scrapfly-Webhook-Resource-Type`|`scrape`, `extraction`, or `screenshot`|
64
92
|`X-Scrapfly-Webhook-Job-Id`| Job UUID returned at enqueue time — reconciliation key |
|`X-Scrapfly-Webhook-Env`| Environment label (`test` or `live`) |
66
94
|`X-Scrapfly-Webhook-Project`| Project name |
67
95
|`X-Scrapfly-Crawl-Event-Name`| Crawler API event name (e.g. `crawler_finished`) |
68
96
|`X-Scrapfly-Log-Uuid` / `X-Scrapfly-Log-Url`| Pointers to the Scrapfly log entry for the delivery |
69
97
98
+
## Delivery & Retries
99
+
100
+
Scrapfly delivery is **at-least-once**. Use `X-Scrapfly-Webhook-Job-Id` as your idempotency key — duplicates carry the same job UUID.
101
+
102
+
Retry schedule on non-2xx responses (or timeout):
103
+
104
+
| Attempt | Delay after previous |
105
+
|---------|----------------------|
106
+
| 1 | initial delivery |
107
+
| 2 | 30 s |
108
+
| 3 | 1 min |
109
+
| 4 | 5 min |
110
+
| 5 | 30 min |
111
+
| 6 | 1 h |
112
+
| 7 | 1 d |
113
+
114
+
After **100 consecutive failures** Scrapfly automatically **disables** the webhook — no further deliveries are attempted until you re-enable it in the dashboard. Because of this, handlers should:
115
+
116
+
- Return 2xx as soon as the signature is verified and the job is enqueued.
117
+
- Surface processing errors out-of-band (logs, alerts, dead-letter queue) rather than 5xx-ing back to Scrapfly.
118
+
70
119
## Full Event Reference
71
120
72
121
-[Scrape API webhook](https://scrapfly.io/docs/scrape-api/webhook)
0 commit comments