Skip to content

Commit f556a27

Browse files
MousebackMouseback
authored andcommitted
feat: implement self-healing engine and scrapewizard studio UI
1 parent 06271e5 commit f556a27

35 files changed

Lines changed: 2289 additions & 1166 deletions

.github/workflows/ci.yml

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -63,3 +63,29 @@ jobs:
6363
))
6464
Orchestrator(d)
6565
print("fresh-install check passed")
66+
67+
frontend:
68+
name: Frontend Lint & Build
69+
runs-on: ubuntu-latest
70+
steps:
71+
- uses: actions/checkout@v4
72+
73+
- name: Set up Node.js
74+
uses: actions/setup-node@v4
75+
with:
76+
node-version: '20'
77+
cache: 'npm'
78+
cache-dependency-path: studio/frontend/package-lock.json
79+
80+
- name: Install dependencies
81+
working-directory: studio/frontend
82+
run: npm ci
83+
84+
- name: Run lint
85+
working-directory: studio/frontend
86+
run: npm run lint
87+
88+
- name: Run build
89+
working-directory: studio/frontend
90+
run: npm run build
91+

.github/workflows/release.yml

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,21 @@ jobs:
2121
with:
2222
python-version: '3.12'
2323

24+
- name: Set up Node.js
25+
uses: actions/setup-node@v4
26+
with:
27+
node-version: '20'
28+
cache: 'npm'
29+
cache-dependency-path: studio/frontend/package-lock.json
30+
31+
- name: Install frontend dependencies
32+
working-directory: studio/frontend
33+
run: npm ci
34+
35+
- name: Build frontend
36+
working-directory: studio/frontend
37+
run: npm run build
38+
2439
- name: Install build tools
2540
run: |
2641
python -m pip install --upgrade pip

README.md

Lines changed: 80 additions & 104 deletions
Original file line numberDiff line numberDiff line change
@@ -1,160 +1,136 @@
1-
# 🧙 ScrapeWizard – MVP1
1+
# 🧙 ScrapeWizard
22

3-
**AI-Assisted Scraper Builder for Developers**
3+
**Agentic Web Scraper Builder & Self-Healing Automation Studio**
44

5-
ScrapeWizard is an **AI-powered scraper generator** designed to help developers build reliable Playwright scrapers in minutes. It follows a clear principle: **“AI helps you BUILD scrapers – it does NOT run them.”**
5+
ScrapeWizard is a professional, developer-first platform for building, running, and maintaining reliable web scrapers. By combining high-fidelity browser recording with an offline, multi-tier self-healing engine, ScrapeWizard ensures your scrapers survive target site markup changes and structural mutations without manual code updates.
66

7-
## 🟢 What ScrapeWizard Is Today (v1.2.0)
7+
> [!IMPORTANT]
8+
> **Key Philosophy:** AI helps you *build* and *heal* scrapers—it does *not* execute arbitrary LLM calls during hot paths, ensuring high performance, zero runtime LLM costs, and 100% deterministic scraper execution.
89
9-
ScrapeWizard MVP1 is a professional developer tool for rapidly generating maintainable, standalone scrapers.
10+
---
11+
12+
## 🚀 Key Features
1013

11-
### Core Capabilities:
12-
-**Interactive CLI Builder**: Guided process from URL to code.
13-
-**AI Analysis**: Automatic structure, pattern, and field detection.
14-
-**Multiple LLM Support**: Choose between OpenAI, Anthropic, OpenRouter, or Local (Ollama) providers.
15-
-**AI Cost Transparency**: Real-time token tracking and cost estimation for every build.
16-
-**Smart Assessment**: Pre-flight checks for anti-bot measures.
17-
- **Unified Decision Gates (v1.1)**: Critical checkpoints where the user owns the "WHAT" while the AI handles the "HOW":
18-
- **Gate 1: Output Format**: Choose CSV, Excel, or JSON upfront.
19-
- **Gate 2: Pagination Scope**: Define scrape depth (Single page, 5-page limit, or all pages).
20-
- **Gate 3: Data Quality Firewall**: Monitors extraction results; if missing data is detected, it triggers a recovery loop.
21-
- **Interactive Recovery**: Never get stuck. If a run fails, choose:
22-
- 🩺 **Auto-Repair**: AI fixes specific selectors for missing fields.
23-
- 🔄 **Full Retry**: Re-generate the entire strategy from scratch.
24-
- **Scraper Runtime Contract (SRC)**: AI implementation of specific classes only. Infrastructure (Browser, Pagination loop, I/O) is owned by the ScrapeWizard SDK, eliminating hallucinations.
25-
- **Dynamic Waiting**: Automatic handling of hydration delays via `smart_wait()`.
26-
- **Hardening & Portability**: Content-based hashing for deduplication and detailed debug logs indicating exactly why any items were skipped.
14+
* **⚡ ScrapeWizard Studio Dashboard:** A premium, local-first web interface built with React, Tailwind, React Query, and Zustand. Monitor scrape jobs, view run histories, inspect visual diff crops, and review healed steps.
15+
* **🩺 Multi-Tier Offline Self-Healing:** When site markup breaks, our local engine attempts to heal the broken locator automatically using 5 deterministic similarity tiers (attributes, tag structure, geometry, and parent-child hierarchy) before resorting to AI repairs.
16+
* **📹 High-Fidelity Recorder:** Interactive recording page featuring full support for frames/iframes, multi-page flows, and automated password-masking (secrets are masked at capture time inside step files and logs).
17+
* **📊 Unified Decision Gates:** Control scraper behavior upfront through interactive steps:
18+
* *Gate 1: Output Format* — Export results directly to CSV, Excel (XLSX), or JSON.
19+
* *Gate 2: Pagination Scope* — Control traversal depth (Single page, page limits, or complete crawls).
20+
* *Gate 3: Data Quality Firewall* — Monitors extraction output and triggers local self-healing or LLM repair loops if fields become empty.
21+
* **📦 Zero-Dependency Execution:** Easily packaged as a standard Python wheel containing the bundled frontend. No Node.js runtime is required for end-users.
22+
23+
---
2724

28-
## Installation
25+
## 🛠️ Installation
2926

3027
```bash
31-
pip install -r requirements.txt
28+
# Install the ScrapeWizard package
29+
pip install scrapewizard
30+
31+
# Install Playwright browser dependencies
3232
playwright install chromium
3333

34-
# Linux Only: Install system dependencies
34+
# Linux/CI environments only:
3535
playwright install-deps
36-
37-
# Optional: only needed if you use the Anthropic provider
38-
pip install anthropic
3936
```
4037

41-
## Commands & Examples
38+
---
4239

43-
### 1. `login` - Secure API Key Storage
44-
Store your AI providers' API key safely in your system's keyring. No plain text storage.
45-
```bash
46-
scrapewizard login "sk-or-v1-xyz..."
47-
```
40+
## 💻 CLI Commands
4841

49-
### 2. `setup` - Configuration
50-
Initial setup to configure your LLM provider and default model.
42+
### 1. `start` - Launch ScrapeWizard Studio
43+
Boots up the FastAPI backend, initializes the database, and launches the React frontend dashboard in your default browser.
5144
```bash
52-
scrapewizard setup
45+
scrapewizard start --port 8000
5346
```
5447

55-
### 3. `build` - Create a Scraper
56-
The main command to start a new scraping project.
57-
58-
**Zero-Click Mode (Default - "Just Works"):**
48+
### 2. `login` - Secure Provider Keys
49+
Securely saves your LLM provider keys (OpenAI, Anthropic, OpenRouter, or Ollama) using your system's secure keyring.
5950
```bash
60-
# Provide URL - ScrapeWizard guides you through simplified format and pagination gates
61-
scrapewizard build --url "https://books.toscrape.com"
51+
scrapewizard login "sk-or-v1-xyz..."
6252
```
6353

64-
**Ad-hoc AI Override:**
54+
### 3. `setup` - Configure Global Defaults
55+
Configures default LLM providers, active models, and workspace settings.
6556
```bash
66-
# Specify provider and model for a single build session
67-
scrapewizard build --url "https://books.toscrape.com" \
68-
--ai-provider anthropic \
69-
--ai-model claude-3-5-sonnet-20240620
57+
scrapewizard setup
7058
```
7159

72-
**Interactive Mode (Custom Control):**
60+
### 4. `build` - Generate a Scraper
61+
Starts a new scraping project from a target URL.
7362
```bash
74-
# Ask me "One Smart Question" about fields or format
75-
scrapewizard build --url "https://books.toscrape.com" --interactive
76-
```
63+
# Guided build using default settings
64+
scrapewizard build --url "https://books.toscrape.com"
7765

78-
**Expert Mode (Full Technical Output):**
79-
```bash
80-
# Shows debug logs, state transitions, LLM warnings, and repair loops
66+
# Expert Mode: Shows debug logs, database states, and raw model logs
8167
scrapewizard build --url "https://books.toscrape.com" --expert
68+
69+
# Interactive Mode: Ask smart clarification questions about formatting/fields
70+
scrapewizard build --url "https://books.toscrape.com" --interactive
8271
```
8372

84-
### 4. `list` - View Projects
85-
List all previously created scraper projects.
73+
### 5. `list` - View Local Projects
74+
Lists all active scraping projects, URLs, execution states, and last modified times.
8675
```bash
8776
scrapewizard list
8877
```
8978

90-
### 5. `resume` - Continue Work
91-
Resume a project that was stopped or failed.
79+
### 6. `resume` - Continue Scraper Builder
80+
Resumes a guide or scraper generation run that was interrupted.
9281
```bash
93-
scrapewizard resume "PROJECT_ID"
82+
scrapewizard resume "<PROJECT_ID>"
9483
```
9584

96-
### 6. `doctor` - Health Check
97-
Verify your environment, dependencies, and LLM connectivity.
85+
### 7. `doctor` - Environment Diagnostics
86+
Checks Python/OS versions, configuration files, Playwright installations, and validates LLM connection health.
9887
```bash
9988
scrapewizard doctor
10089
```
10190

102-
### 7. `clean` - Cleanup
103-
Remove temporary files or old projects to save space.
91+
### 8. `clean` - Cleanup Temporary Workspace
92+
Purges cached test runs and deleted project files to free up disk space.
10493
```bash
10594
scrapewizard clean
10695
```
10796

108-
### 8. `version` - Version Info
109-
Check the current version of ScrapeWizard.
110-
```bash
111-
scrapewizard version
112-
```
113-
11497
---
11598

116-
## ⚙️ Configuration
99+
## ⚙️ The Self-Healing Hierarchy (Tiers 0-5)
117100

118-
### Global Config
119-
Stored in `~/.scrapewizard/config.json`. Managed via the `setup` command.
101+
When a web element mutated (e.g. classes renamed, layout shifted, attributes altered), the ScrapeWizard engine steps through a deterministic self-healing hierarchy to re-identify the element offline:
120102

121-
### Local Config Overrides
122-
You can now override global settings (model, provider, etc.) on a per-project basis using a `.scrapewizardrc` file in your project root.
103+
1. **Tier 0 (Direct Match):** Evaluates the primary selector.
104+
2. **Tier 1 (Selector Ladder):** Tries fallback CSS selectors recorded during fingerprinting.
105+
3. **Tier 2 (Attribute & Text Score):** Computes text content and property matching similarity.
106+
4. **Tier 3 (Structural Matching):** Evaluates parent/sibling tag relationships.
107+
5. **Tier 4 (Geometry & Visuals):** Compares coordinates, dimensions, and visual bounds.
108+
6. **Tier 5 (Navigation Context):** Analyzes step sequence history to infer the correct element.
109+
7. **Tier 6 (LLM Recovery - Opt-in):** Triggers only if offline tiers fail to find a match above the confidence margin.
123110

124-
```json
125-
{
126-
"model": "gpt-4-local-override",
127-
"provider": "openai"
128-
}
129-
```
111+
> [!TIP]
112+
> To prevent wrong-element matches, the self-healing system requires a strict scoring margin threshold (0.10) between the top match and secondary candidates. Heals are only persisted if the full re-run passes green.
130113
131-
## 🏗️ Project Output
114+
---
132115

133-
Projects are saved in `~/scrapewizard_projects/`.
134-
Each project contains a self-contained `output/` folder:
135-
- `generated_scraper.py`: The ScrapeWizard Scraper Plugin (subclasses `BaseScraper`).
136-
- `storage_state.json`: Full session state (Cookies + LocalStorage) for manual bypass/login.
137-
- `data.json` / `data.csv` / `data.xlsx`: Your scraped records (cleaned and filtered).
138-
- `analysis_snapshot.json`: The raw DOM analysis used by the AI.
139-
- `llm_logs/`: Raw AI responses for deep debugging and transparency.
116+
## 🏗️ Project Output Structure
140117

141-
## Golden Test Suite
142-
To verify the system integrity, run the automated golden tests:
143-
```bash
144-
python tests/golden_sites/books.py
145-
```
118+
Every project created is saved in `~/.scrapewizard/projects/<PROJECT_ID>/` containing:
119+
* `generated_scraper.py` — The final executable scraper plugin subclassing `BaseScraper`.
120+
* `storage_state.json` — Cookies and local storage snapshot to bypass logins.
121+
* `data.json` / `data.csv` — Scraped structured datasets.
122+
* `analysis_snapshot.json` — Pre-flight DOM audit.
123+
* `llm_logs/` — Trace of raw AI prompts and responses for debug audit.
146124

147-
## 🔭 Project Direction
125+
---
148126

149-
ScrapeWizard is evolving from a CLI scraper builder into a **local-first UI/UX test automation
150-
platform** (record once → self-healing tests → admin portal), built on the same engine.
151-
The CLI scraper documented above remains the current, working product.
127+
## 🧪 Golden Test Suite
128+
Verify local setup and self-healing rate by running:
129+
```bash
130+
python3 -m pytest tests/ -v --ignore=tests/golden_sites
131+
```
152132

153-
- **[PLATFORM_PLAN.md](PLATFORM_PLAN.md)** — the full roadmap and architecture (source of truth)
154-
- **[BUILD_GUIDE.md](BUILD_GUIDE.md)** — step-by-step how-to for building each stage
155-
- **[FRONTEND_PLAN.md](FRONTEND_PLAN.md)** — detailed spec for the application (the GUI/portal)
156-
- **[APP_BUILD_STEPS.md](APP_BUILD_STEPS.md)** — step-by-step build order: backend API + SQLite, then frontend slices
157-
- **[MARKET_READY_PLAN.md](MARKET_READY_PLAN.md)** — final-mile plan: fix all audited issues, reach market standard, package & deploy
133+
---
158134

159-
## License
160-
MIT
135+
## 📄 License
136+
MIT License

pyproject.toml

Lines changed: 22 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -31,12 +31,32 @@ dependencies = [
3131
"openpyxl>=3.1.0",
3232
"python-json-logger>=2.0.0",
3333
"yaspin>=2.0.0",
34-
"sqlmodel"
34+
"sqlmodel",
35+
"fastapi",
36+
"uvicorn",
37+
"python-dotenv",
38+
"aiohttp",
39+
"pillow"
3540
]
3641

3742
[project.scripts]
3843
scrapewizard = "scrapewizard.cli.main:app"
3944

4045
[tool.setuptools.packages.find]
4146
where = ["."]
42-
include = ["scrapewizard*", "scrapewizard_runtime*"]
47+
include = ["scrapewizard*", "scrapewizard_runtime*", "studio", "studio.backend", "studio.bridge", "studio.shared"]
48+
49+
[tool.setuptools.package-data]
50+
studio = [
51+
"frontend/dist/**/*",
52+
"frontend/dist/*",
53+
"shared/*.json",
54+
"recordings/*.jsonl",
55+
"bridge/*.js"
56+
]
57+
58+
[tool.setuptools.exclude-package-data]
59+
studio = [
60+
"frontend/node_modules/**/*",
61+
"frontend/node_modules/*"
62+
]

requirements.txt

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,3 +14,11 @@ pandas>=2.0.0
1414
openpyxl>=3.1.0
1515
python-json-logger>=2.0.0
1616
yaspin>=2.0.0
17+
pytest-asyncio>=0.21.0
18+
pytest-mock>=3.12.0
19+
sqlmodel
20+
fastapi
21+
uvicorn
22+
python-dotenv
23+
aiohttp
24+
pillow

scrapewizard/engine/fingerprint.py

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -212,9 +212,12 @@ async def capture_from_page(page, element_handle, screenshot_path: Optional[str]
212212
log(f"Element screenshot failed: {e}", level="warning")
213213

214214
# Navigation context
215+
frame_url = page.url
216+
page_obj = getattr(page, "page", page)
217+
page_title = await page_obj.title() if page_obj else ""
215218
navigation_data = {
216-
"url": page.url,
217-
"title": await page.title()
219+
"url": frame_url,
220+
"title": page_title
218221
}
219222

220223
data = {

0 commit comments

Comments
 (0)