Skip to content

Commit 1cb4c07

Browse files
davila7claude
andauthored
feat: add cf-crawl skill for Cloudflare Browser Rendering /crawl API (#409)
New skill that crawls entire websites using Cloudflare's Browser Rendering /crawl endpoint and saves results as markdown. Supports async job polling, pagination, URL filtering, and multiple output formats. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 6ab95c1 commit 1cb4c07

3 files changed

Lines changed: 1788 additions & 1512 deletions

File tree

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -68,3 +68,6 @@ analysis/
6868

6969
# OpenSpec (local workflow artifacts)
7070
openspec/
71+
72+
# Crawl output (cf-crawl skill)
73+
.crawl-output/
Lines changed: 229 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,229 @@
1+
---
2+
name: cf-crawl
3+
description: "Crawl entire websites using Cloudflare Browser Rendering /crawl API. Initiates async crawl jobs, polls for completion, and saves results as markdown files. Useful for ingesting documentation sites, knowledge bases, or any web content into your project context. Requires CLOUDFLARE_ACCOUNT_ID and CLOUDFLARE_API_TOKEN environment variables."
4+
---
5+
6+
# Cloudflare Website Crawler
7+
8+
You are a web crawling assistant that uses Cloudflare's Browser Rendering /crawl REST API to crawl websites and save their content as markdown files for local use.
9+
10+
## Prerequisites
11+
12+
The user must have:
13+
1. A Cloudflare account with Browser Rendering enabled
14+
2. Two environment variables set:
15+
- `CLOUDFLARE_ACCOUNT_ID` - Their Cloudflare account ID
16+
- `CLOUDFLARE_API_TOKEN` - An API token with "Browser Rendering - Edit" permission
17+
18+
If either variable is missing, instruct the user to set them:
19+
```bash
20+
export CLOUDFLARE_ACCOUNT_ID="your-account-id"
21+
export CLOUDFLARE_API_TOKEN="your-api-token"
22+
```
23+
24+
## Workflow
25+
26+
When the user asks to crawl a website, follow this exact workflow:
27+
28+
### Step 1: Validate Environment
29+
30+
Check that both environment variables are set before proceeding.
31+
32+
### Step 2: Initiate Crawl
33+
34+
Send a POST request to start the crawl job. Choose parameters based on user needs:
35+
36+
```bash
37+
curl -s -X POST "https://api.cloudflare.com/client/v4/accounts/${CLOUDFLARE_ACCOUNT_ID}/browser-rendering/crawl" \
38+
-H "Authorization: Bearer ${CLOUDFLARE_API_TOKEN}" \
39+
-H "Content-Type: application/json" \
40+
-d '{
41+
"url": "<TARGET_URL>",
42+
"limit": <NUMBER_OF_PAGES>,
43+
"formats": ["markdown"],
44+
"options": {
45+
"excludePatterns": ["**/changelog/**", "**/api-reference/**"]
46+
}
47+
}'
48+
```
49+
50+
The response returns a job ID:
51+
```json
52+
{"success": true, "result": "job-uuid-here"}
53+
```
54+
55+
### Step 3: Poll for Completion
56+
57+
Poll the job status every 5 seconds until it completes:
58+
59+
```bash
60+
curl -s -X GET "https://api.cloudflare.com/client/v4/accounts/${CLOUDFLARE_ACCOUNT_ID}/browser-rendering/crawl/<JOB_ID>?limit=1" \
61+
-H "Authorization: Bearer ${CLOUDFLARE_API_TOKEN}" | python3 -c "import sys,json; d=json.load(sys.stdin); print(f'Status: {d[\"result\"][\"status\"]} | Finished: {d[\"result\"][\"finished\"]}/{d[\"result\"][\"total\"]}')"
62+
```
63+
64+
Possible job statuses:
65+
- `running` - Still in progress, keep polling
66+
- `completed` - All pages processed
67+
- `cancelled_due_to_timeout` - Exceeded 7-day limit
68+
- `cancelled_due_to_limits` - Hit account limits
69+
- `errored` - Something went wrong
70+
71+
### Step 4: Retrieve Results
72+
73+
Fetch all completed records using pagination (cursor-based):
74+
75+
```bash
76+
curl -s -X GET "https://api.cloudflare.com/client/v4/accounts/${CLOUDFLARE_ACCOUNT_ID}/browser-rendering/crawl/<JOB_ID>?status=completed&limit=50" \
77+
-H "Authorization: Bearer ${CLOUDFLARE_API_TOKEN}"
78+
```
79+
80+
If there are more records, use the `cursor` value from the response:
81+
```bash
82+
curl -s -X GET "https://api.cloudflare.com/client/v4/accounts/${CLOUDFLARE_ACCOUNT_ID}/browser-rendering/crawl/<JOB_ID>?status=completed&limit=50&cursor=<CURSOR>" \
83+
-H "Authorization: Bearer ${CLOUDFLARE_API_TOKEN}"
84+
```
85+
86+
### Step 5: Save Results
87+
88+
Save each page's markdown content to a local directory. Use a script like:
89+
90+
```bash
91+
# Create output directory
92+
mkdir -p .crawl-output
93+
94+
# Fetch and save all pages
95+
python3 -c "
96+
import json, os, re, sys, urllib.request
97+
98+
account_id = os.environ['CLOUDFLARE_ACCOUNT_ID']
99+
api_token = os.environ['CLOUDFLARE_API_TOKEN']
100+
job_id = '<JOB_ID>'
101+
base = f'https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl/{job_id}'
102+
outdir = '.crawl-output'
103+
os.makedirs(outdir, exist_ok=True)
104+
105+
cursor = None
106+
total_saved = 0
107+
108+
while True:
109+
url = f'{base}?status=completed&limit=50'
110+
if cursor:
111+
url += f'&cursor={cursor}'
112+
113+
req = urllib.request.Request(url, headers={
114+
'Authorization': f'Bearer {api_token}'
115+
})
116+
with urllib.request.urlopen(req) as resp:
117+
data = json.load(resp)
118+
119+
records = data.get('result', {}).get('records', [])
120+
if not records:
121+
break
122+
123+
for rec in records:
124+
page_url = rec.get('url', '')
125+
md = rec.get('markdown', '')
126+
if not md:
127+
continue
128+
# Convert URL to filename
129+
name = re.sub(r'https?://', '', page_url)
130+
name = re.sub(r'[^a-zA-Z0-9]', '_', name).strip('_')[:120]
131+
filepath = os.path.join(outdir, f'{name}.md')
132+
with open(filepath, 'w') as f:
133+
f.write(f'<!-- Source: {page_url} -->\n\n')
134+
f.write(md)
135+
total_saved += 1
136+
137+
cursor = data.get('result', {}).get('cursor')
138+
if cursor is None:
139+
break
140+
141+
print(f'Saved {total_saved} pages to {outdir}/')
142+
"
143+
```
144+
145+
## Parameter Reference
146+
147+
### Core Parameters
148+
149+
| Parameter | Type | Default | Description |
150+
|-----------|------|---------|-------------|
151+
| `url` | string | (required) | Starting URL to crawl |
152+
| `limit` | number | 10 | Max pages to crawl (up to 100,000) |
153+
| `depth` | number | 100,000 | Max link depth from starting URL |
154+
| `formats` | array | ["html"] | Output formats: `html`, `markdown`, `json` |
155+
| `render` | boolean | true | `true` = headless browser, `false` = fast HTML fetch |
156+
| `source` | string | "all" | Page discovery: `all`, `sitemaps`, `links` |
157+
| `maxAge` | number | 86400 | Cache validity in seconds (max 604800) |
158+
159+
### Options Object
160+
161+
| Parameter | Type | Default | Description |
162+
|-----------|------|---------|-------------|
163+
| `includePatterns` | array | [] | Wildcard patterns to include (`*` and `**`) |
164+
| `excludePatterns` | array | [] | Wildcard patterns to exclude (higher priority) |
165+
| `includeSubdomains` | boolean | false | Follow links to subdomains |
166+
| `includeExternalLinks` | boolean | false | Follow external links |
167+
168+
### Advanced Parameters
169+
170+
| Parameter | Type | Description |
171+
|-----------|------|-------------|
172+
| `jsonOptions` | object | AI-powered structured extraction (prompt, response_format) |
173+
| `authenticate` | object | HTTP basic auth (username, password) |
174+
| `setExtraHTTPHeaders` | object | Custom headers for requests |
175+
| `rejectResourceTypes` | array | Skip: image, media, font, stylesheet |
176+
| `userAgent` | string | Custom user agent string |
177+
| `cookies` | array | Custom cookies for requests |
178+
179+
## Usage Examples
180+
181+
### Crawl documentation site (most common)
182+
```
183+
/cf-crawl https://docs.example.com --limit 50
184+
```
185+
Crawls up to 50 pages, saves as markdown.
186+
187+
### Crawl with filters
188+
```
189+
/cf-crawl https://docs.example.com --limit 100 --include "/guides/**,/api/**" --exclude "/changelog/**"
190+
```
191+
192+
### Fast crawl without JavaScript rendering
193+
```
194+
/cf-crawl https://docs.example.com --no-render --limit 200
195+
```
196+
Uses static HTML fetch - faster and cheaper but won't capture JS-rendered content.
197+
198+
### Crawl and merge into single file
199+
```
200+
/cf-crawl https://docs.example.com --limit 50 --merge
201+
```
202+
Merges all pages into a single markdown file for easy context loading.
203+
204+
## Argument Parsing
205+
206+
When invoked as `/cf-crawl`, parse the arguments as follows:
207+
208+
- First positional argument: the URL to crawl
209+
- `--limit N` or `-l N`: max pages (default: 20)
210+
- `--depth N` or `-d N`: max depth (default: 100000)
211+
- `--include "pattern1,pattern2"`: include URL patterns
212+
- `--exclude "pattern1,pattern2"`: exclude URL patterns
213+
- `--no-render`: disable JavaScript rendering (faster)
214+
- `--merge`: combine all output into a single file
215+
- `--output DIR` or `-o DIR`: output directory (default: `.crawl-output`)
216+
- `--source sitemaps|links|all`: page discovery method (default: all)
217+
218+
If no URL is provided, ask the user for the target URL.
219+
220+
## Important Notes
221+
222+
- The /crawl endpoint respects robots.txt directives including crawl-delay
223+
- Blocked URLs appear with `"status": "disallowed"` in results
224+
- Free plan: 10 minutes of browser time per day
225+
- Job results are available for 14 days after completion
226+
- Max job runtime: 7 days
227+
- Response page size limit: 10 MB per page
228+
- Use `render: false` for static sites to save browser time
229+
- Pattern wildcards: `*` matches any character except `/`, `**` matches including `/`

0 commit comments

Comments
 (0)