🧙 ScrapeWizard

The Local-First, Self-Healing Web Scraper Builder & UI/UX Test Automation Studio

ScrapeWizard is a professional, developer-first toolkit for building, executing, and maintaining reliable web automation workflows. By combining high-fidelity browser recording with an offline, multi-tier self-healing engine, ScrapeWizard ensures your scrapers and test suites survive target site markup changes, class renames, and structural mutations without manual script updates.

Important

Key Philosophy: AI is an optional enhancer to help you name steps and recover from catastrophic layout shifts. It is never on the runtime hot-path. If target pages haven't mutated, runtime AI cost is $0.00, ensuring high performance, zero runtime LLM costs, and 100% deterministic scraper/test execution.

🚀 Two Products, One Unified Engine

Built atop a shared core that tracks deep element fingerprints (tag names, semantic attributes, structural relationships, geometry, and navigation history), ScrapeWizard supports two major developer use-cases:

📦 Product A: Scraper Studio: Build high-performance data pipelines that export target pages to CSV, Excel (XLSX), or JSON with zero-click configuration.
🧪 Product B: UI/UX Test Automation: Record workflows once to generate standard Playwright + pytest suites. Run them headless in CI with automatic checks for accessibility (a11y), visual regressions (visual diffs), console errors, and network failures.

⚡ Key Features

🖥️ ScrapeWizard Studio Dashboard: A premium, local-first web dashboard built with FastAPI and React. Monitor execution queues, visualize run histories step-by-step, review accessibility violations, inspect visual diff crops, and approve or reject healed locators.
🩺 Multi-Tier Offline Self-Healing (Tiers 0-5): When page markup changes, our local engine attempts to locate the element automatically using 5 deterministic similarity tiers (attributes, tag structure, geometry, and parent-child hierarchy) with zero LLM/API calls.
📹 High-Fidelity Flow Recorder: Launches an interactive headed browser context to capture user interactions (clicks, text input, navigation, scroll) along with element fingerprints. Featuring full support for multi-page flows and automatic masking of password inputs.
🔬 Isolated Sandbox Runner: Executes flows in clean Playwright contexts, collecting visual screen diffs, console warnings, and network error signals.
♿ Automated Accessibility (a11y) Audits: Injects axe-core dynamically during runtime sandbox executions to find markup, color contrast, and ARIA violations per step.
📦 Zero Lock-in Pytest Export: Export flows directly to standalone Python scripts. The generated files are completely independent of the platform and can run in any standard CI environment.
🔑 Keyring Security: Securely stores LLM provider API keys (OpenAI, Anthropic, OpenRouter, and Ollama) using the system's secure keyring.

🛠️ Installation & Setup

# 1. Install ScrapeWizard and its dependencies
pip install scrapewizard

# 2. Install Playwright browser engines
playwright install chromium

# Note: On Linux/CI systems, you may also need:
playwright install-deps

🚦 Getting Started in 60 Seconds

1. Record a Workflow

Launch a headed browser to record user interactions on a page and capture detailed element fingerprints:

scrapewizard record --url "https://books.toscrape.com" --output login_flow.json

2. Run Quality Checks & Sandbox

Execute the recorded workflow headless to check console logs, network errors, accessibility violations, and visual regressions:

scrapewizard test login_flow.json

3. Build a Scraper Project

Generate a programmatic scraper script from a target URL with guided options:

scrapewizard build --url "https://books.toscrape.com"

4. Launch the Web Studio

Open the local FastAPI web dashboard to manage your tests, runs, and configurations:

scrapewizard start --port 8000

💻 CLI Commands Reference

1. `start` - Launch Web Studio Dashboard

Boots up the FastAPI backend and opens the React web dashboard in your default browser.

scrapewizard start [--port PORT] [--no-open]

2. `record` - Record User Interactions

Opens a headed browser to capture user events and element fingerprints, saving them to a JSON file.

scrapewizard record --url URL [--output OUTPUT_JSON] [--screenshots SCREENSHOT_DIR]

3. `test` - Run Sandbox Quality Checks

Runs a headless sandbox execution of the recorded flow, collecting quality signals (console, network, a11y, visual diff).

scrapewizard test FLOW_JSON [--artifacts ARTIFACT_DIR] [--headed]

4. `build` - Generate Scraper

Builds a new scraping project from a URL.

# Standard guided scraper builder
scrapewizard build --url URL

# Expert Mode: Shows debug logs, database states, and raw model logs
scrapewizard build --url URL --expert

# Interactive Mode: Prompts smart questions about target fields and formats
scrapewizard build --url URL --interactive

5. `setup` - Configure Global Settings

Interactively configures default LLM providers, active models, active proxies, and settings.

scrapewizard setup [--provider PROVIDER] [--api-key KEY] [--model MODEL] [--use-proxy]

6. `login` - Secure Provider Keys

Saves your LLM provider API keys securely in the system keyring.

scrapewizard login "sk-..."

7. `list` - View Local Projects

Lists all active scraper projects, target URLs, states, and modification times.

scrapewizard list

8. `resume` - Continue Scraper Builder

Resumes an interrupted scraper construction or guided tour session.

scrapewizard resume PROJECT_ID

9. `doctor` - Environment Diagnostics

Checks Python version, configuration files, Playwright installation, projects directory, and LLM connection health.

scrapewizard doctor

10. `clean` - Cleanup Temporary Workspace

Purges cached test runs, build logs, and deleted project files to free up disk space.

scrapewizard clean [--force]

11. `version` - Show Version

Prints the installed version of ScrapeWizard.

scrapewizard version

⚙️ The Self-Healing Hierarchy (Tiers 0-6)

When a web element mutates (e.g. classes renamed, layout shifted, attributes altered), the ScrapeWizard engine steps through a deterministic self-healing hierarchy to re-identify the element offline:

Tier 0 (Direct Match): Evaluates the primary selector.
Tier 1 (Selector Ladder): Tries fallback CSS selectors recorded during fingerprinting.
Tier 2 (Attribute & Text Score): Computes similarity score of attribute overlap and normalized inner text.
Tier 3 (Structural Matching): Evaluates parent/sibling tag relationships and sibling offsets.
Tier 4 (Geometry & Visuals): Compares relative viewport coordinates (x/y percentages) and dimensions.
Tier 5 (History & Navigation): Checks past successful element resolutions from historical runs.
Tier 6 (LLM Recovery - Opt-in): Triggers only if all offline tiers fail. Sends a compact DOM snippet to the LLM to locate the element, verifying the proposed selector by re-running the step.

Tip

To prevent wrong-element matches (false positives), the self-healing system requires a strict scoring margin threshold between the top match and secondary candidates. Heals are only persisted if the full re-run completes successfully.

🏗️ Workspace Directory Structure

All global configurations and local scraping/testing projects are stored locally on your machine:

Global Configuration: Saved in ~/.scrapewizard/
- config.json — Active LLM provider, default model, and settings.
- proxy.json — Configured proxies.
Scraper Projects Root: Saved in ~/scrapewizard_projects/
- Contains individual <PROJECT_ID>/ directories with:
  - session.json — Project execution state and metadata.
  - generated_scraper.py — The final Python scraper script.
  - llm_logs/ — Prompts and raw completion text for auditing.
  - output/ — Extracted datasets (JSON, CSV, XLSX).
Test Baselines & Runs:
- ~/.scrapewizard/baselines/ — Baseline screenshots for visual regression tests.
- Run artifacts (screenshots, visual diffs, and test report logs) are saved in the configured output directories.

🧪 Running Unit Tests

Verify the local installation and self-healing efficacy by executing:

python3 -m pytest tests/ -v --ignore=tests/golden_sites

📄 License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
.github/workflows		.github/workflows
scrapewizard		scrapewizard
scrapewizard_runtime		scrapewizard_runtime
studio		studio
tests		tests
.gitignore		.gitignore
ADMIN_DASHBOARD_PLAN.md		ADMIN_DASHBOARD_PLAN.md
APP_BUILD_STEPS.md		APP_BUILD_STEPS.md
BUILD_GUIDE.md		BUILD_GUIDE.md
FRONTEND_PLAN.md		FRONTEND_PLAN.md
IMPLEMENTATION_PLAN.md		IMPLEMENTATION_PLAN.md
LICENSE		LICENSE
MARKET_READY_PLAN.md		MARKET_READY_PLAN.md
PLATFORM_PLAN.md		PLATFORM_PLAN.md
README.md		README.md
ROADMAP.md		ROADMAP.md
learn.md		learn.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧙 ScrapeWizard

🚀 Two Products, One Unified Engine

⚡ Key Features

🛠️ Installation & Setup

🚦 Getting Started in 60 Seconds

1. Record a Workflow

2. Run Quality Checks & Sandbox

3. Build a Scraper Project

4. Launch the Web Studio

💻 CLI Commands Reference

1. `start` - Launch Web Studio Dashboard

2. `record` - Record User Interactions

3. `test` - Run Sandbox Quality Checks

4. `build` - Generate Scraper

5. `setup` - Configure Global Settings

6. `login` - Secure Provider Keys

7. `list` - View Local Projects

8. `resume` - Continue Scraper Builder

9. `doctor` - Environment Diagnostics

10. `clean` - Cleanup Temporary Workspace

11. `version` - Show Version

⚙️ The Self-Healing Hierarchy (Tiers 0-6)

🏗️ Workspace Directory Structure

🧪 Running Unit Tests

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🧙 ScrapeWizard

🚀 Two Products, One Unified Engine

⚡ Key Features

🛠️ Installation & Setup

🚦 Getting Started in 60 Seconds

1. Record a Workflow

2. Run Quality Checks & Sandbox

3. Build a Scraper Project

4. Launch the Web Studio

💻 CLI Commands Reference

1. start - Launch Web Studio Dashboard

2. record - Record User Interactions

3. test - Run Sandbox Quality Checks

4. build - Generate Scraper

5. setup - Configure Global Settings

6. login - Secure Provider Keys

7. list - View Local Projects

8. resume - Continue Scraper Builder

9. doctor - Environment Diagnostics

10. clean - Cleanup Temporary Workspace

11. version - Show Version

⚙️ The Self-Healing Hierarchy (Tiers 0-6)

🏗️ Workspace Directory Structure

🧪 Running Unit Tests

📄 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. `start` - Launch Web Studio Dashboard

2. `record` - Record User Interactions

3. `test` - Run Sandbox Quality Checks

4. `build` - Generate Scraper

5. `setup` - Configure Global Settings

6. `login` - Secure Provider Keys

7. `list` - View Local Projects

8. `resume` - Continue Scraper Builder

9. `doctor` - Environment Diagnostics

10. `clean` - Cleanup Temporary Workspace

11. `version` - Show Version

Packages