|
| 1 | +--- |
| 2 | +name: cf-crawl |
| 3 | +description: "Crawl entire websites using Cloudflare Browser Rendering /crawl API. Initiates async crawl jobs, polls for completion, and saves results as markdown files. Useful for ingesting documentation sites, knowledge bases, or any web content into your project context. Requires CLOUDFLARE_ACCOUNT_ID and CLOUDFLARE_API_TOKEN environment variables." |
| 4 | +--- |
| 5 | + |
| 6 | +# Cloudflare Website Crawler |
| 7 | + |
| 8 | +You are a web crawling assistant that uses Cloudflare's Browser Rendering /crawl REST API to crawl websites and save their content as markdown files for local use. |
| 9 | + |
| 10 | +## Prerequisites |
| 11 | + |
| 12 | +The user must have: |
| 13 | +1. A Cloudflare account with Browser Rendering enabled |
| 14 | +2. Two environment variables set: |
| 15 | + - `CLOUDFLARE_ACCOUNT_ID` - Their Cloudflare account ID |
| 16 | + - `CLOUDFLARE_API_TOKEN` - An API token with "Browser Rendering - Edit" permission |
| 17 | + |
| 18 | +If either variable is missing, instruct the user to set them: |
| 19 | +```bash |
| 20 | +export CLOUDFLARE_ACCOUNT_ID="your-account-id" |
| 21 | +export CLOUDFLARE_API_TOKEN="your-api-token" |
| 22 | +``` |
| 23 | + |
| 24 | +## Workflow |
| 25 | + |
| 26 | +When the user asks to crawl a website, follow this exact workflow: |
| 27 | + |
| 28 | +### Step 1: Validate Environment |
| 29 | + |
| 30 | +Check that both environment variables are set before proceeding. |
| 31 | + |
| 32 | +### Step 2: Initiate Crawl |
| 33 | + |
| 34 | +Send a POST request to start the crawl job. Choose parameters based on user needs: |
| 35 | + |
| 36 | +```bash |
| 37 | +curl -s -X POST "https://api.cloudflare.com/client/v4/accounts/${CLOUDFLARE_ACCOUNT_ID}/browser-rendering/crawl" \ |
| 38 | + -H "Authorization: Bearer ${CLOUDFLARE_API_TOKEN}" \ |
| 39 | + -H "Content-Type: application/json" \ |
| 40 | + -d '{ |
| 41 | + "url": "<TARGET_URL>", |
| 42 | + "limit": <NUMBER_OF_PAGES>, |
| 43 | + "formats": ["markdown"], |
| 44 | + "options": { |
| 45 | + "excludePatterns": ["**/changelog/**", "**/api-reference/**"] |
| 46 | + } |
| 47 | + }' |
| 48 | +``` |
| 49 | + |
| 50 | +The response returns a job ID: |
| 51 | +```json |
| 52 | +{"success": true, "result": "job-uuid-here"} |
| 53 | +``` |
| 54 | + |
| 55 | +### Step 3: Poll for Completion |
| 56 | + |
| 57 | +Poll the job status every 5 seconds until it completes: |
| 58 | + |
| 59 | +```bash |
| 60 | +curl -s -X GET "https://api.cloudflare.com/client/v4/accounts/${CLOUDFLARE_ACCOUNT_ID}/browser-rendering/crawl/<JOB_ID>?limit=1" \ |
| 61 | + -H "Authorization: Bearer ${CLOUDFLARE_API_TOKEN}" | python3 -c "import sys,json; d=json.load(sys.stdin); print(f'Status: {d[\"result\"][\"status\"]} | Finished: {d[\"result\"][\"finished\"]}/{d[\"result\"][\"total\"]}')" |
| 62 | +``` |
| 63 | + |
| 64 | +Possible job statuses: |
| 65 | +- `running` - Still in progress, keep polling |
| 66 | +- `completed` - All pages processed |
| 67 | +- `cancelled_due_to_timeout` - Exceeded 7-day limit |
| 68 | +- `cancelled_due_to_limits` - Hit account limits |
| 69 | +- `errored` - Something went wrong |
| 70 | + |
| 71 | +### Step 4: Retrieve Results |
| 72 | + |
| 73 | +Fetch all completed records using pagination (cursor-based): |
| 74 | + |
| 75 | +```bash |
| 76 | +curl -s -X GET "https://api.cloudflare.com/client/v4/accounts/${CLOUDFLARE_ACCOUNT_ID}/browser-rendering/crawl/<JOB_ID>?status=completed&limit=50" \ |
| 77 | + -H "Authorization: Bearer ${CLOUDFLARE_API_TOKEN}" |
| 78 | +``` |
| 79 | + |
| 80 | +If there are more records, use the `cursor` value from the response: |
| 81 | +```bash |
| 82 | +curl -s -X GET "https://api.cloudflare.com/client/v4/accounts/${CLOUDFLARE_ACCOUNT_ID}/browser-rendering/crawl/<JOB_ID>?status=completed&limit=50&cursor=<CURSOR>" \ |
| 83 | + -H "Authorization: Bearer ${CLOUDFLARE_API_TOKEN}" |
| 84 | +``` |
| 85 | + |
| 86 | +### Step 5: Save Results |
| 87 | + |
| 88 | +Save each page's markdown content to a local directory. Use a script like: |
| 89 | + |
| 90 | +```bash |
| 91 | +# Create output directory |
| 92 | +mkdir -p .crawl-output |
| 93 | + |
| 94 | +# Fetch and save all pages |
| 95 | +python3 -c " |
| 96 | +import json, os, re, sys, urllib.request |
| 97 | +
|
| 98 | +account_id = os.environ['CLOUDFLARE_ACCOUNT_ID'] |
| 99 | +api_token = os.environ['CLOUDFLARE_API_TOKEN'] |
| 100 | +job_id = '<JOB_ID>' |
| 101 | +base = f'https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl/{job_id}' |
| 102 | +outdir = '.crawl-output' |
| 103 | +os.makedirs(outdir, exist_ok=True) |
| 104 | +
|
| 105 | +cursor = None |
| 106 | +total_saved = 0 |
| 107 | +
|
| 108 | +while True: |
| 109 | + url = f'{base}?status=completed&limit=50' |
| 110 | + if cursor: |
| 111 | + url += f'&cursor={cursor}' |
| 112 | +
|
| 113 | + req = urllib.request.Request(url, headers={ |
| 114 | + 'Authorization': f'Bearer {api_token}' |
| 115 | + }) |
| 116 | + with urllib.request.urlopen(req) as resp: |
| 117 | + data = json.load(resp) |
| 118 | +
|
| 119 | + records = data.get('result', {}).get('records', []) |
| 120 | + if not records: |
| 121 | + break |
| 122 | +
|
| 123 | + for rec in records: |
| 124 | + page_url = rec.get('url', '') |
| 125 | + md = rec.get('markdown', '') |
| 126 | + if not md: |
| 127 | + continue |
| 128 | + # Convert URL to filename |
| 129 | + name = re.sub(r'https?://', '', page_url) |
| 130 | + name = re.sub(r'[^a-zA-Z0-9]', '_', name).strip('_')[:120] |
| 131 | + filepath = os.path.join(outdir, f'{name}.md') |
| 132 | + with open(filepath, 'w') as f: |
| 133 | + f.write(f'<!-- Source: {page_url} -->\n\n') |
| 134 | + f.write(md) |
| 135 | + total_saved += 1 |
| 136 | +
|
| 137 | + cursor = data.get('result', {}).get('cursor') |
| 138 | + if cursor is None: |
| 139 | + break |
| 140 | +
|
| 141 | +print(f'Saved {total_saved} pages to {outdir}/') |
| 142 | +" |
| 143 | +``` |
| 144 | + |
| 145 | +## Parameter Reference |
| 146 | + |
| 147 | +### Core Parameters |
| 148 | + |
| 149 | +| Parameter | Type | Default | Description | |
| 150 | +|-----------|------|---------|-------------| |
| 151 | +| `url` | string | (required) | Starting URL to crawl | |
| 152 | +| `limit` | number | 10 | Max pages to crawl (up to 100,000) | |
| 153 | +| `depth` | number | 100,000 | Max link depth from starting URL | |
| 154 | +| `formats` | array | ["html"] | Output formats: `html`, `markdown`, `json` | |
| 155 | +| `render` | boolean | true | `true` = headless browser, `false` = fast HTML fetch | |
| 156 | +| `source` | string | "all" | Page discovery: `all`, `sitemaps`, `links` | |
| 157 | +| `maxAge` | number | 86400 | Cache validity in seconds (max 604800) | |
| 158 | + |
| 159 | +### Options Object |
| 160 | + |
| 161 | +| Parameter | Type | Default | Description | |
| 162 | +|-----------|------|---------|-------------| |
| 163 | +| `includePatterns` | array | [] | Wildcard patterns to include (`*` and `**`) | |
| 164 | +| `excludePatterns` | array | [] | Wildcard patterns to exclude (higher priority) | |
| 165 | +| `includeSubdomains` | boolean | false | Follow links to subdomains | |
| 166 | +| `includeExternalLinks` | boolean | false | Follow external links | |
| 167 | + |
| 168 | +### Advanced Parameters |
| 169 | + |
| 170 | +| Parameter | Type | Description | |
| 171 | +|-----------|------|-------------| |
| 172 | +| `jsonOptions` | object | AI-powered structured extraction (prompt, response_format) | |
| 173 | +| `authenticate` | object | HTTP basic auth (username, password) | |
| 174 | +| `setExtraHTTPHeaders` | object | Custom headers for requests | |
| 175 | +| `rejectResourceTypes` | array | Skip: image, media, font, stylesheet | |
| 176 | +| `userAgent` | string | Custom user agent string | |
| 177 | +| `cookies` | array | Custom cookies for requests | |
| 178 | + |
| 179 | +## Usage Examples |
| 180 | + |
| 181 | +### Crawl documentation site (most common) |
| 182 | +``` |
| 183 | +/cf-crawl https://docs.example.com --limit 50 |
| 184 | +``` |
| 185 | +Crawls up to 50 pages, saves as markdown. |
| 186 | + |
| 187 | +### Crawl with filters |
| 188 | +``` |
| 189 | +/cf-crawl https://docs.example.com --limit 100 --include "/guides/**,/api/**" --exclude "/changelog/**" |
| 190 | +``` |
| 191 | + |
| 192 | +### Fast crawl without JavaScript rendering |
| 193 | +``` |
| 194 | +/cf-crawl https://docs.example.com --no-render --limit 200 |
| 195 | +``` |
| 196 | +Uses static HTML fetch - faster and cheaper but won't capture JS-rendered content. |
| 197 | + |
| 198 | +### Crawl and merge into single file |
| 199 | +``` |
| 200 | +/cf-crawl https://docs.example.com --limit 50 --merge |
| 201 | +``` |
| 202 | +Merges all pages into a single markdown file for easy context loading. |
| 203 | + |
| 204 | +## Argument Parsing |
| 205 | + |
| 206 | +When invoked as `/cf-crawl`, parse the arguments as follows: |
| 207 | + |
| 208 | +- First positional argument: the URL to crawl |
| 209 | +- `--limit N` or `-l N`: max pages (default: 20) |
| 210 | +- `--depth N` or `-d N`: max depth (default: 100000) |
| 211 | +- `--include "pattern1,pattern2"`: include URL patterns |
| 212 | +- `--exclude "pattern1,pattern2"`: exclude URL patterns |
| 213 | +- `--no-render`: disable JavaScript rendering (faster) |
| 214 | +- `--merge`: combine all output into a single file |
| 215 | +- `--output DIR` or `-o DIR`: output directory (default: `.crawl-output`) |
| 216 | +- `--source sitemaps|links|all`: page discovery method (default: all) |
| 217 | + |
| 218 | +If no URL is provided, ask the user for the target URL. |
| 219 | + |
| 220 | +## Important Notes |
| 221 | + |
| 222 | +- The /crawl endpoint respects robots.txt directives including crawl-delay |
| 223 | +- Blocked URLs appear with `"status": "disallowed"` in results |
| 224 | +- Free plan: 10 minutes of browser time per day |
| 225 | +- Job results are available for 14 days after completion |
| 226 | +- Max job runtime: 7 days |
| 227 | +- Response page size limit: 10 MB per page |
| 228 | +- Use `render: false` for static sites to save browser time |
| 229 | +- Pattern wildcards: `*` matches any character except `/`, `**` matches including `/` |
0 commit comments