Skip to content

Commit 5e96df2

Browse files
authored
Merge pull request #11 from brightdata/feat/scraper-self-healing
feat(scraper): add self-healing `scraper heal` command
2 parents a671610 + e502789 commit 5e96df2

4 files changed

Lines changed: 1369 additions & 4 deletions

File tree

README.md

Lines changed: 103 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,8 @@
2727
| `brightdata discover` | AI-powered web discovery - find and rank results by intent with optional full-page content |
2828
| `brightdata scraper create` | Build a Bright Data scraper from a natural-language description using AI |
2929
| `brightdata scraper run` | Run a Bright Data scraper on a URL and return the data |
30+
| `brightdata scraper heal` | Fix an existing scraper in place via AI self-healing (stops at an approval gate) |
31+
| `brightdata scraper approve` | Approve (or reject) a self-healing fix that is awaiting approval |
3032
| `brightdata pipelines` | Extract structured data from 40+ platforms (Amazon, LinkedIn, TikTok…) |
3133
| `brightdata browser` | Control a real browser via Bright Data's Scraping Browser — navigate, snapshot, click, type, and more |
3234
| `brightdata zones` | List and inspect your Bright Data proxy zones |
@@ -50,6 +52,8 @@
5052
- [discover](#discover)
5153
- [scraper create](#scraper-create)
5254
- [scraper run](#scraper-run)
55+
- [scraper heal](#scraper-heal)
56+
- [scraper approve](#scraper-approve)
5357
- [pipelines](#pipelines)
5458
- [browser](#browser)
5559
- [status](#status)
@@ -475,6 +479,105 @@ brightdata scraper run c_mp3tuab31lswoxvpws --input-file urls.json
475479

476480
---
477481

482+
### `scraper heal`
483+
484+
Fix an existing scraper **in place** when it ran but returned wrong, empty, or partial data. The `collector_id` stays the same — the scraper is improved, not replaced. This is the maintenance twin of `scraper create`: it triggers Bright Data's AI self-healing flow (`POST /dca/collectors/{id}/refactor_template`), then polls progress.
485+
486+
```bash
487+
brightdata scraper heal <collector_id> "<prompt>" [options]
488+
```
489+
490+
**You are the detector.** The CLI never decides on its own that a scraper is broken — you inspect the run output and decide. The `<prompt>` is required (max 1000 chars); name exactly what is wrong and what the correct output should be. Vague prompts produce vague heals.
491+
492+
| Flag | Description |
493+
|---|---|
494+
| `--url <url>` | Verify target woven into the success `next_step` hint (not sent to the heal call) |
495+
| `--auto-approve` | When the heal hits the approval gate, approve it automatically and poll through to `done` (default: stop and let you review) |
496+
| `--timeout <seconds>` | Polling timeout (default: `600`) |
497+
| `--max-retries <n>` | Max retries on the AI-Flow concurrent-job-cap `429` (default: `4`) |
498+
| `--no-retry` | Fail immediately on `429` instead of waiting through the cap |
499+
| `-o, --output <path>` | Write output to file |
500+
| `--json` / `--pretty` | JSON output (raw / indented) |
501+
| `--legacy-output` | Emit the bare AI-progress payload instead of the envelope |
502+
| `--timing` | Show request timing |
503+
| `-k, --api-key <key>` | Override API key |
504+
505+
**The approval gate**
506+
507+
Self-healing is human-in-the-loop. Without `--auto-approve`, `heal` runs the fix and then **stops at an approval gate** rather than committing it, exiting `0` with a `status: "awaiting_approval"` envelope:
508+
509+
```json
510+
{
511+
"collector_id": "c_mp3tuab31lswoxvpws",
512+
"status": "awaiting_approval",
513+
"prompt": "Price returns null — the selector moved …",
514+
"preview_result": [ { "title": "", "price": { "value": 51.77, "currency": "GBP" } }, ],
515+
"diff_summary": "proposed template has 1 step(s) — review at view_url",
516+
"view_url": "https://brightdata.com/cp/scrapers/c_mp3tuab31lswoxvpws",
517+
"next_step": "bdata scraper approve c_mp3tuab31lswoxvpws --url https://example.com/product/1"
518+
}
519+
```
520+
521+
`preview_result` shows the sample rows the fixed scraper would produce — review them, then run the `next_step` (`scraper approve`) to commit. `awaiting_approval` is **not** a failure; it means the fix is ready and waiting for your decision. A failed heal (`429` cap exhausted, timeout, terminal `failed`) is **non-destructive** — the existing scraper is unchanged and still works as before.
522+
523+
**Examples**
524+
525+
```bash
526+
# Heal a scraper, stop at the gate, and get a ready-to-run verify command back
527+
brightdata scraper heal c_mp3tuab31lswoxvpws \
528+
"The price field returns null — the selector moved into a span with \
529+
data-testid. Capture price and currency again." \
530+
--url https://example.com/product/1 --pretty -o heal.json
531+
532+
# Fully autonomous: heal and approve in one command (no manual review)
533+
brightdata scraper heal c_mp3tuab31lswoxvpws \
534+
"Reviews stopped extracting after the page redesign" --auto-approve
535+
```
536+
537+
---
538+
539+
### `scraper approve`
540+
541+
Commit (or reject) a self-healing fix that `scraper heal` left **awaiting approval**. Calls `POST /dca/collectors/{id}/resume_automation_job`, then polls the refactor job to `done`.
542+
543+
```bash
544+
brightdata scraper approve <collector_id> [options]
545+
```
546+
547+
| Flag | Description |
548+
|---|---|
549+
| `--reject` | Reject the proposed fix instead of approving it |
550+
| `--url <url>` | Verify target woven into the success `next_step` hint |
551+
| `--timeout <seconds>` | Polling timeout (default: `600`) |
552+
| `-o, --output <path>` | Write output to file |
553+
| `--json` / `--pretty` | JSON output (raw / indented) |
554+
| `--legacy-output` | Emit the bare AI-progress payload instead of the envelope |
555+
| `--timing` | Show request timing |
556+
| `-k, --api-key <key>` | Override API key |
557+
558+
On success the job advances to `status: "done"` and the envelope hands back a `next_step` = `scraper run <id> <url>` so you can verify the committed fix. `--reject` discards the proposed fix (`status: "rejected"`) — re-run `scraper heal` with a sharper prompt to try again. If a heal needs multiple approvals, `approve` may stop at `awaiting_approval` again — just run it once more.
559+
560+
**The self-healing loop**
561+
562+
```bash
563+
# 1. Run and inspect the data
564+
brightdata scraper run c_mp3tuab31lswoxvpws https://example.com/product/1 --json -o out.json
565+
566+
# 2. If the data is wrong, heal (stops at the approval gate)
567+
brightdata scraper heal c_mp3tuab31lswoxvpws \
568+
"Price returns null — the selector moved; capture price + currency." \
569+
--url https://example.com/product/1 --pretty -o heal.json
570+
571+
# 3. Review heal.json's preview_result, then approve
572+
brightdata scraper approve c_mp3tuab31lswoxvpws \
573+
--url https://example.com/product/1 --pretty -o approve.json
574+
575+
# 4. Verify the committed fix
576+
brightdata scraper run c_mp3tuab31lswoxvpws https://example.com/product/1 --pretty
577+
```
578+
579+
---
580+
478581
### `pipelines`
479582

480583
Extract structured data from 40+ platforms using Bright Data's Web Scraper API. Triggers an async collection job, polls until ready, and returns results.

0 commit comments

Comments
 (0)