|
1 | | -# 🧙 ScrapeWizard – MVP1 |
| 1 | +# 🧙 ScrapeWizard |
2 | 2 |
|
3 | | -**AI-Assisted Scraper Builder for Developers** |
| 3 | +**Agentic Web Scraper Builder & Self-Healing Automation Studio** |
4 | 4 |
|
5 | | -ScrapeWizard is an **AI-powered scraper generator** designed to help developers build reliable Playwright scrapers in minutes. It follows a clear principle: **“AI helps you BUILD scrapers – it does NOT run them.”** |
| 5 | +ScrapeWizard is a professional, developer-first platform for building, running, and maintaining reliable web scrapers. By combining high-fidelity browser recording with an offline, multi-tier self-healing engine, ScrapeWizard ensures your scrapers survive target site markup changes and structural mutations without manual code updates. |
6 | 6 |
|
7 | | -## 🟢 What ScrapeWizard Is Today (v1.2.0) |
| 7 | +> [!IMPORTANT] |
| 8 | +> **Key Philosophy:** AI helps you *build* and *heal* scrapers—it does *not* execute arbitrary LLM calls during hot paths, ensuring high performance, zero runtime LLM costs, and 100% deterministic scraper execution. |
8 | 9 |
|
9 | | -ScrapeWizard MVP1 is a professional developer tool for rapidly generating maintainable, standalone scrapers. |
| 10 | +--- |
| 11 | + |
| 12 | +## 🚀 Key Features |
10 | 13 |
|
11 | | -### Core Capabilities: |
12 | | -- ✅ **Interactive CLI Builder**: Guided process from URL to code. |
13 | | -- ✅ **AI Analysis**: Automatic structure, pattern, and field detection. |
14 | | -- ✅ **Multiple LLM Support**: Choose between OpenAI, Anthropic, OpenRouter, or Local (Ollama) providers. |
15 | | -- ✅ **AI Cost Transparency**: Real-time token tracking and cost estimation for every build. |
16 | | -- ✅ **Smart Assessment**: Pre-flight checks for anti-bot measures. |
17 | | -- **Unified Decision Gates (v1.1)**: Critical checkpoints where the user owns the "WHAT" while the AI handles the "HOW": |
18 | | - - **Gate 1: Output Format**: Choose CSV, Excel, or JSON upfront. |
19 | | - - **Gate 2: Pagination Scope**: Define scrape depth (Single page, 5-page limit, or all pages). |
20 | | - - **Gate 3: Data Quality Firewall**: Monitors extraction results; if missing data is detected, it triggers a recovery loop. |
21 | | -- **Interactive Recovery**: Never get stuck. If a run fails, choose: |
22 | | - - 🩺 **Auto-Repair**: AI fixes specific selectors for missing fields. |
23 | | - - 🔄 **Full Retry**: Re-generate the entire strategy from scratch. |
24 | | -- **Scraper Runtime Contract (SRC)**: AI implementation of specific classes only. Infrastructure (Browser, Pagination loop, I/O) is owned by the ScrapeWizard SDK, eliminating hallucinations. |
25 | | -- **Dynamic Waiting**: Automatic handling of hydration delays via `smart_wait()`. |
26 | | -- **Hardening & Portability**: Content-based hashing for deduplication and detailed debug logs indicating exactly why any items were skipped. |
| 14 | +* **⚡ ScrapeWizard Studio Dashboard:** A premium, local-first web interface built with React, Tailwind, React Query, and Zustand. Monitor scrape jobs, view run histories, inspect visual diff crops, and review healed steps. |
| 15 | +* **🩺 Multi-Tier Offline Self-Healing:** When site markup breaks, our local engine attempts to heal the broken locator automatically using 5 deterministic similarity tiers (attributes, tag structure, geometry, and parent-child hierarchy) before resorting to AI repairs. |
| 16 | +* **📹 High-Fidelity Recorder:** Interactive recording page featuring full support for frames/iframes, multi-page flows, and automated password-masking (secrets are masked at capture time inside step files and logs). |
| 17 | +* **📊 Unified Decision Gates:** Control scraper behavior upfront through interactive steps: |
| 18 | + * *Gate 1: Output Format* — Export results directly to CSV, Excel (XLSX), or JSON. |
| 19 | + * *Gate 2: Pagination Scope* — Control traversal depth (Single page, page limits, or complete crawls). |
| 20 | + * *Gate 3: Data Quality Firewall* — Monitors extraction output and triggers local self-healing or LLM repair loops if fields become empty. |
| 21 | +* **📦 Zero-Dependency Execution:** Easily packaged as a standard Python wheel containing the bundled frontend. No Node.js runtime is required for end-users. |
| 22 | + |
| 23 | +--- |
27 | 24 |
|
28 | | -## Installation |
| 25 | +## 🛠️ Installation |
29 | 26 |
|
30 | 27 | ```bash |
31 | | -pip install -r requirements.txt |
| 28 | +# Install the ScrapeWizard package |
| 29 | +pip install scrapewizard |
| 30 | + |
| 31 | +# Install Playwright browser dependencies |
32 | 32 | playwright install chromium |
33 | 33 |
|
34 | | -# Linux Only: Install system dependencies |
| 34 | +# Linux/CI environments only: |
35 | 35 | playwright install-deps |
36 | | - |
37 | | -# Optional: only needed if you use the Anthropic provider |
38 | | -pip install anthropic |
39 | 36 | ``` |
40 | 37 |
|
41 | | -## Commands & Examples |
| 38 | +--- |
42 | 39 |
|
43 | | -### 1. `login` - Secure API Key Storage |
44 | | -Store your AI providers' API key safely in your system's keyring. No plain text storage. |
45 | | -```bash |
46 | | -scrapewizard login "sk-or-v1-xyz..." |
47 | | -``` |
| 40 | +## 💻 CLI Commands |
48 | 41 |
|
49 | | -### 2. `setup` - Configuration |
50 | | -Initial setup to configure your LLM provider and default model. |
| 42 | +### 1. `start` - Launch ScrapeWizard Studio |
| 43 | +Boots up the FastAPI backend, initializes the database, and launches the React frontend dashboard in your default browser. |
51 | 44 | ```bash |
52 | | -scrapewizard setup |
| 45 | +scrapewizard start --port 8000 |
53 | 46 | ``` |
54 | 47 |
|
55 | | -### 3. `build` - Create a Scraper |
56 | | -The main command to start a new scraping project. |
57 | | - |
58 | | -**Zero-Click Mode (Default - "Just Works"):** |
| 48 | +### 2. `login` - Secure Provider Keys |
| 49 | +Securely saves your LLM provider keys (OpenAI, Anthropic, OpenRouter, or Ollama) using your system's secure keyring. |
59 | 50 | ```bash |
60 | | -# Provide URL - ScrapeWizard guides you through simplified format and pagination gates |
61 | | -scrapewizard build --url "https://books.toscrape.com" |
| 51 | +scrapewizard login "sk-or-v1-xyz..." |
62 | 52 | ``` |
63 | 53 |
|
64 | | -**Ad-hoc AI Override:** |
| 54 | +### 3. `setup` - Configure Global Defaults |
| 55 | +Configures default LLM providers, active models, and workspace settings. |
65 | 56 | ```bash |
66 | | -# Specify provider and model for a single build session |
67 | | -scrapewizard build --url "https://books.toscrape.com" \ |
68 | | - --ai-provider anthropic \ |
69 | | - --ai-model claude-3-5-sonnet-20240620 |
| 57 | +scrapewizard setup |
70 | 58 | ``` |
71 | 59 |
|
72 | | -**Interactive Mode (Custom Control):** |
| 60 | +### 4. `build` - Generate a Scraper |
| 61 | +Starts a new scraping project from a target URL. |
73 | 62 | ```bash |
74 | | -# Ask me "One Smart Question" about fields or format |
75 | | -scrapewizard build --url "https://books.toscrape.com" --interactive |
76 | | -``` |
| 63 | +# Guided build using default settings |
| 64 | +scrapewizard build --url "https://books.toscrape.com" |
77 | 65 |
|
78 | | -**Expert Mode (Full Technical Output):** |
79 | | -```bash |
80 | | -# Shows debug logs, state transitions, LLM warnings, and repair loops |
| 66 | +# Expert Mode: Shows debug logs, database states, and raw model logs |
81 | 67 | scrapewizard build --url "https://books.toscrape.com" --expert |
| 68 | + |
| 69 | +# Interactive Mode: Ask smart clarification questions about formatting/fields |
| 70 | +scrapewizard build --url "https://books.toscrape.com" --interactive |
82 | 71 | ``` |
83 | 72 |
|
84 | | -### 4. `list` - View Projects |
85 | | -List all previously created scraper projects. |
| 73 | +### 5. `list` - View Local Projects |
| 74 | +Lists all active scraping projects, URLs, execution states, and last modified times. |
86 | 75 | ```bash |
87 | 76 | scrapewizard list |
88 | 77 | ``` |
89 | 78 |
|
90 | | -### 5. `resume` - Continue Work |
91 | | -Resume a project that was stopped or failed. |
| 79 | +### 6. `resume` - Continue Scraper Builder |
| 80 | +Resumes a guide or scraper generation run that was interrupted. |
92 | 81 | ```bash |
93 | | -scrapewizard resume "PROJECT_ID" |
| 82 | +scrapewizard resume "<PROJECT_ID>" |
94 | 83 | ``` |
95 | 84 |
|
96 | | -### 6. `doctor` - Health Check |
97 | | -Verify your environment, dependencies, and LLM connectivity. |
| 85 | +### 7. `doctor` - Environment Diagnostics |
| 86 | +Checks Python/OS versions, configuration files, Playwright installations, and validates LLM connection health. |
98 | 87 | ```bash |
99 | 88 | scrapewizard doctor |
100 | 89 | ``` |
101 | 90 |
|
102 | | -### 7. `clean` - Cleanup |
103 | | -Remove temporary files or old projects to save space. |
| 91 | +### 8. `clean` - Cleanup Temporary Workspace |
| 92 | +Purges cached test runs and deleted project files to free up disk space. |
104 | 93 | ```bash |
105 | 94 | scrapewizard clean |
106 | 95 | ``` |
107 | 96 |
|
108 | | -### 8. `version` - Version Info |
109 | | -Check the current version of ScrapeWizard. |
110 | | -```bash |
111 | | -scrapewizard version |
112 | | -``` |
113 | | - |
114 | 97 | --- |
115 | 98 |
|
116 | | -## ⚙️ Configuration |
| 99 | +## ⚙️ The Self-Healing Hierarchy (Tiers 0-5) |
117 | 100 |
|
118 | | -### Global Config |
119 | | -Stored in `~/.scrapewizard/config.json`. Managed via the `setup` command. |
| 101 | +When a web element mutated (e.g. classes renamed, layout shifted, attributes altered), the ScrapeWizard engine steps through a deterministic self-healing hierarchy to re-identify the element offline: |
120 | 102 |
|
121 | | -### Local Config Overrides |
122 | | -You can now override global settings (model, provider, etc.) on a per-project basis using a `.scrapewizardrc` file in your project root. |
| 103 | +1. **Tier 0 (Direct Match):** Evaluates the primary selector. |
| 104 | +2. **Tier 1 (Selector Ladder):** Tries fallback CSS selectors recorded during fingerprinting. |
| 105 | +3. **Tier 2 (Attribute & Text Score):** Computes text content and property matching similarity. |
| 106 | +4. **Tier 3 (Structural Matching):** Evaluates parent/sibling tag relationships. |
| 107 | +5. **Tier 4 (Geometry & Visuals):** Compares coordinates, dimensions, and visual bounds. |
| 108 | +6. **Tier 5 (Navigation Context):** Analyzes step sequence history to infer the correct element. |
| 109 | +7. **Tier 6 (LLM Recovery - Opt-in):** Triggers only if offline tiers fail to find a match above the confidence margin. |
123 | 110 |
|
124 | | -```json |
125 | | -{ |
126 | | - "model": "gpt-4-local-override", |
127 | | - "provider": "openai" |
128 | | -} |
129 | | -``` |
| 111 | +> [!TIP] |
| 112 | +> To prevent wrong-element matches, the self-healing system requires a strict scoring margin threshold (0.10) between the top match and secondary candidates. Heals are only persisted if the full re-run passes green. |
130 | 113 |
|
131 | | -## 🏗️ Project Output |
| 114 | +--- |
132 | 115 |
|
133 | | -Projects are saved in `~/scrapewizard_projects/`. |
134 | | -Each project contains a self-contained `output/` folder: |
135 | | -- `generated_scraper.py`: The ScrapeWizard Scraper Plugin (subclasses `BaseScraper`). |
136 | | -- `storage_state.json`: Full session state (Cookies + LocalStorage) for manual bypass/login. |
137 | | -- `data.json` / `data.csv` / `data.xlsx`: Your scraped records (cleaned and filtered). |
138 | | -- `analysis_snapshot.json`: The raw DOM analysis used by the AI. |
139 | | -- `llm_logs/`: Raw AI responses for deep debugging and transparency. |
| 116 | +## 🏗️ Project Output Structure |
140 | 117 |
|
141 | | -## Golden Test Suite |
142 | | -To verify the system integrity, run the automated golden tests: |
143 | | -```bash |
144 | | -python tests/golden_sites/books.py |
145 | | -``` |
| 118 | +Every project created is saved in `~/.scrapewizard/projects/<PROJECT_ID>/` containing: |
| 119 | +* `generated_scraper.py` — The final executable scraper plugin subclassing `BaseScraper`. |
| 120 | +* `storage_state.json` — Cookies and local storage snapshot to bypass logins. |
| 121 | +* `data.json` / `data.csv` — Scraped structured datasets. |
| 122 | +* `analysis_snapshot.json` — Pre-flight DOM audit. |
| 123 | +* `llm_logs/` — Trace of raw AI prompts and responses for debug audit. |
146 | 124 |
|
147 | | -## 🔭 Project Direction |
| 125 | +--- |
148 | 126 |
|
149 | | -ScrapeWizard is evolving from a CLI scraper builder into a **local-first UI/UX test automation |
150 | | -platform** (record once → self-healing tests → admin portal), built on the same engine. |
151 | | -The CLI scraper documented above remains the current, working product. |
| 127 | +## 🧪 Golden Test Suite |
| 128 | +Verify local setup and self-healing rate by running: |
| 129 | +```bash |
| 130 | +python3 -m pytest tests/ -v --ignore=tests/golden_sites |
| 131 | +``` |
152 | 132 |
|
153 | | -- **[PLATFORM_PLAN.md](PLATFORM_PLAN.md)** — the full roadmap and architecture (source of truth) |
154 | | -- **[BUILD_GUIDE.md](BUILD_GUIDE.md)** — step-by-step how-to for building each stage |
155 | | -- **[FRONTEND_PLAN.md](FRONTEND_PLAN.md)** — detailed spec for the application (the GUI/portal) |
156 | | -- **[APP_BUILD_STEPS.md](APP_BUILD_STEPS.md)** — step-by-step build order: backend API + SQLite, then frontend slices |
157 | | -- **[MARKET_READY_PLAN.md](MARKET_READY_PLAN.md)** — final-mile plan: fix all audited issues, reach market standard, package & deploy |
| 133 | +--- |
158 | 134 |
|
159 | | -## License |
160 | | -MIT |
| 135 | +## 📄 License |
| 136 | +MIT License |
0 commit comments