The Local-First, Self-Healing Web Scraper Builder & UI/UX Test Automation Studio
ScrapeWizard is a professional, developer-first toolkit for building, executing, and maintaining reliable web automation workflows. By combining high-fidelity browser recording with an offline, multi-tier self-healing engine, ScrapeWizard ensures your scrapers and test suites survive target site markup changes, class renames, and structural mutations without manual script updates.
Important
Key Philosophy: AI is an optional enhancer to help you name steps and recover from catastrophic layout shifts. It is never on the runtime hot-path. If target pages haven't mutated, runtime AI cost is $0.00, ensuring high performance, zero runtime LLM costs, and 100% deterministic scraper/test execution.
Built atop a shared core that tracks deep element fingerprints (tag names, semantic attributes, structural relationships, geometry, and navigation history), ScrapeWizard supports two major developer use-cases:
- π¦ Product A: Scraper Studio: Build high-performance data pipelines that export target pages to CSV, Excel (XLSX), or JSON with zero-click configuration.
- π§ͺ Product B: UI/UX Test Automation: Record workflows once to generate standard Playwright + pytest suites. Run them headless in CI with automatic checks for accessibility (a11y), visual regressions (visual diffs), console errors, and network failures.
- π₯οΈ ScrapeWizard Studio Dashboard: A premium, local-first web dashboard built with FastAPI and React. Monitor execution queues, visualize run histories step-by-step, review accessibility violations, inspect visual diff crops, and approve or reject healed locators.
- π©Ί Multi-Tier Offline Self-Healing (Tiers 0-5): When page markup changes, our local engine attempts to locate the element automatically using 5 deterministic similarity tiers (attributes, tag structure, geometry, and parent-child hierarchy) with zero LLM/API calls.
- πΉ High-Fidelity Flow Recorder: Launches an interactive headed browser context to capture user interactions (clicks, text input, navigation, scroll) along with element fingerprints. Featuring full support for multi-page flows and automatic masking of password inputs.
- π¬ Isolated Sandbox Runner: Executes flows in clean Playwright contexts, collecting visual screen diffs, console warnings, and network error signals.
- βΏ Automated Accessibility (a11y) Audits: Injects
axe-coredynamically during runtime sandbox executions to find markup, color contrast, and ARIA violations per step. - π¦ Zero Lock-in Pytest Export: Export flows directly to standalone Python scripts. The generated files are completely independent of the platform and can run in any standard CI environment.
- π Keyring Security: Securely stores LLM provider API keys (OpenAI, Anthropic, OpenRouter, and Ollama) using the system's secure keyring.
# 1. Install ScrapeWizard and its dependencies
pip install scrapewizard
# 2. Install Playwright browser engines
playwright install chromium
# Note: On Linux/CI systems, you may also need:
playwright install-depsLaunch a headed browser to record user interactions on a page and capture detailed element fingerprints:
scrapewizard record --url "https://books.toscrape.com" --output login_flow.jsonExecute the recorded workflow headless to check console logs, network errors, accessibility violations, and visual regressions:
scrapewizard test login_flow.jsonGenerate a programmatic scraper script from a target URL with guided options:
scrapewizard build --url "https://books.toscrape.com"Open the local FastAPI web dashboard to manage your tests, runs, and configurations:
scrapewizard start --port 8000Boots up the FastAPI backend and opens the React web dashboard in your default browser.
scrapewizard start [--port PORT] [--no-open]Opens a headed browser to capture user events and element fingerprints, saving them to a JSON file.
scrapewizard record --url URL [--output OUTPUT_JSON] [--screenshots SCREENSHOT_DIR]Runs a headless sandbox execution of the recorded flow, collecting quality signals (console, network, a11y, visual diff).
scrapewizard test FLOW_JSON [--artifacts ARTIFACT_DIR] [--headed]Builds a new scraping project from a URL.
# Standard guided scraper builder
scrapewizard build --url URL
# Expert Mode: Shows debug logs, database states, and raw model logs
scrapewizard build --url URL --expert
# Interactive Mode: Prompts smart questions about target fields and formats
scrapewizard build --url URL --interactiveInteractively configures default LLM providers, active models, active proxies, and settings.
scrapewizard setup [--provider PROVIDER] [--api-key KEY] [--model MODEL] [--use-proxy]Saves your LLM provider API keys securely in the system keyring.
scrapewizard login "sk-..."Lists all active scraper projects, target URLs, states, and modification times.
scrapewizard listResumes an interrupted scraper construction or guided tour session.
scrapewizard resume PROJECT_IDChecks Python version, configuration files, Playwright installation, projects directory, and LLM connection health.
scrapewizard doctorPurges cached test runs, build logs, and deleted project files to free up disk space.
scrapewizard clean [--force]Prints the installed version of ScrapeWizard.
scrapewizard versionWhen a web element mutates (e.g. classes renamed, layout shifted, attributes altered), the ScrapeWizard engine steps through a deterministic self-healing hierarchy to re-identify the element offline:
- Tier 0 (Direct Match): Evaluates the primary selector.
- Tier 1 (Selector Ladder): Tries fallback CSS selectors recorded during fingerprinting.
- Tier 2 (Attribute & Text Score): Computes similarity score of attribute overlap and normalized inner text.
- Tier 3 (Structural Matching): Evaluates parent/sibling tag relationships and sibling offsets.
- Tier 4 (Geometry & Visuals): Compares relative viewport coordinates (x/y percentages) and dimensions.
- Tier 5 (History & Navigation): Checks past successful element resolutions from historical runs.
- Tier 6 (LLM Recovery - Opt-in): Triggers only if all offline tiers fail. Sends a compact DOM snippet to the LLM to locate the element, verifying the proposed selector by re-running the step.
Tip
To prevent wrong-element matches (false positives), the self-healing system requires a strict scoring margin threshold between the top match and secondary candidates. Heals are only persisted if the full re-run completes successfully.
All global configurations and local scraping/testing projects are stored locally on your machine:
- Global Configuration: Saved in
~/.scrapewizard/config.jsonβ Active LLM provider, default model, and settings.proxy.jsonβ Configured proxies.
- Scraper Projects Root: Saved in
~/scrapewizard_projects/- Contains individual
<PROJECT_ID>/directories with:session.jsonβ Project execution state and metadata.generated_scraper.pyβ The final Python scraper script.llm_logs/β Prompts and raw completion text for auditing.output/β Extracted datasets (JSON, CSV, XLSX).
- Contains individual
- Test Baselines & Runs:
~/.scrapewizard/baselines/β Baseline screenshots for visual regression tests.- Run artifacts (screenshots, visual diffs, and test report logs) are saved in the configured output directories.
Verify the local installation and self-healing efficacy by executing:
python3 -m pytest tests/ -v --ignore=tests/golden_sitesMIT License