Skip to content

pras-ops/ScrapeWizard

Repository files navigation

πŸ§™ ScrapeWizard

The Local-First, Self-Healing Web Scraper Builder & UI/UX Test Automation Studio

ScrapeWizard is a professional, developer-first toolkit for building, executing, and maintaining reliable web automation workflows. By combining high-fidelity browser recording with an offline, multi-tier self-healing engine, ScrapeWizard ensures your scrapers and test suites survive target site markup changes, class renames, and structural mutations without manual script updates.

Important

Key Philosophy: AI is an optional enhancer to help you name steps and recover from catastrophic layout shifts. It is never on the runtime hot-path. If target pages haven't mutated, runtime AI cost is $0.00, ensuring high performance, zero runtime LLM costs, and 100% deterministic scraper/test execution.


πŸš€ Two Products, One Unified Engine

Built atop a shared core that tracks deep element fingerprints (tag names, semantic attributes, structural relationships, geometry, and navigation history), ScrapeWizard supports two major developer use-cases:

  1. πŸ“¦ Product A: Scraper Studio: Build high-performance data pipelines that export target pages to CSV, Excel (XLSX), or JSON with zero-click configuration.
  2. πŸ§ͺ Product B: UI/UX Test Automation: Record workflows once to generate standard Playwright + pytest suites. Run them headless in CI with automatic checks for accessibility (a11y), visual regressions (visual diffs), console errors, and network failures.

⚑ Key Features

  • πŸ–₯️ ScrapeWizard Studio Dashboard: A premium, local-first web dashboard built with FastAPI and React. Monitor execution queues, visualize run histories step-by-step, review accessibility violations, inspect visual diff crops, and approve or reject healed locators.
  • 🩺 Multi-Tier Offline Self-Healing (Tiers 0-5): When page markup changes, our local engine attempts to locate the element automatically using 5 deterministic similarity tiers (attributes, tag structure, geometry, and parent-child hierarchy) with zero LLM/API calls.
  • πŸ“Ή High-Fidelity Flow Recorder: Launches an interactive headed browser context to capture user interactions (clicks, text input, navigation, scroll) along with element fingerprints. Featuring full support for multi-page flows and automatic masking of password inputs.
  • πŸ”¬ Isolated Sandbox Runner: Executes flows in clean Playwright contexts, collecting visual screen diffs, console warnings, and network error signals.
  • β™Ώ Automated Accessibility (a11y) Audits: Injects axe-core dynamically during runtime sandbox executions to find markup, color contrast, and ARIA violations per step.
  • πŸ“¦ Zero Lock-in Pytest Export: Export flows directly to standalone Python scripts. The generated files are completely independent of the platform and can run in any standard CI environment.
  • πŸ”‘ Keyring Security: Securely stores LLM provider API keys (OpenAI, Anthropic, OpenRouter, and Ollama) using the system's secure keyring.

πŸ› οΈ Installation & Setup

# 1. Install ScrapeWizard and its dependencies
pip install scrapewizard

# 2. Install Playwright browser engines
playwright install chromium

# Note: On Linux/CI systems, you may also need:
playwright install-deps

🚦 Getting Started in 60 Seconds

1. Record a Workflow

Launch a headed browser to record user interactions on a page and capture detailed element fingerprints:

scrapewizard record --url "https://books.toscrape.com" --output login_flow.json

2. Run Quality Checks & Sandbox

Execute the recorded workflow headless to check console logs, network errors, accessibility violations, and visual regressions:

scrapewizard test login_flow.json

3. Build a Scraper Project

Generate a programmatic scraper script from a target URL with guided options:

scrapewizard build --url "https://books.toscrape.com"

4. Launch the Web Studio

Open the local FastAPI web dashboard to manage your tests, runs, and configurations:

scrapewizard start --port 8000

πŸ’» CLI Commands Reference

1. start - Launch Web Studio Dashboard

Boots up the FastAPI backend and opens the React web dashboard in your default browser.

scrapewizard start [--port PORT] [--no-open]

2. record - Record User Interactions

Opens a headed browser to capture user events and element fingerprints, saving them to a JSON file.

scrapewizard record --url URL [--output OUTPUT_JSON] [--screenshots SCREENSHOT_DIR]

3. test - Run Sandbox Quality Checks

Runs a headless sandbox execution of the recorded flow, collecting quality signals (console, network, a11y, visual diff).

scrapewizard test FLOW_JSON [--artifacts ARTIFACT_DIR] [--headed]

4. build - Generate Scraper

Builds a new scraping project from a URL.

# Standard guided scraper builder
scrapewizard build --url URL

# Expert Mode: Shows debug logs, database states, and raw model logs
scrapewizard build --url URL --expert

# Interactive Mode: Prompts smart questions about target fields and formats
scrapewizard build --url URL --interactive

5. setup - Configure Global Settings

Interactively configures default LLM providers, active models, active proxies, and settings.

scrapewizard setup [--provider PROVIDER] [--api-key KEY] [--model MODEL] [--use-proxy]

6. login - Secure Provider Keys

Saves your LLM provider API keys securely in the system keyring.

scrapewizard login "sk-..."

7. list - View Local Projects

Lists all active scraper projects, target URLs, states, and modification times.

scrapewizard list

8. resume - Continue Scraper Builder

Resumes an interrupted scraper construction or guided tour session.

scrapewizard resume PROJECT_ID

9. doctor - Environment Diagnostics

Checks Python version, configuration files, Playwright installation, projects directory, and LLM connection health.

scrapewizard doctor

10. clean - Cleanup Temporary Workspace

Purges cached test runs, build logs, and deleted project files to free up disk space.

scrapewizard clean [--force]

11. version - Show Version

Prints the installed version of ScrapeWizard.

scrapewizard version

βš™οΈ The Self-Healing Hierarchy (Tiers 0-6)

When a web element mutates (e.g. classes renamed, layout shifted, attributes altered), the ScrapeWizard engine steps through a deterministic self-healing hierarchy to re-identify the element offline:

  1. Tier 0 (Direct Match): Evaluates the primary selector.
  2. Tier 1 (Selector Ladder): Tries fallback CSS selectors recorded during fingerprinting.
  3. Tier 2 (Attribute & Text Score): Computes similarity score of attribute overlap and normalized inner text.
  4. Tier 3 (Structural Matching): Evaluates parent/sibling tag relationships and sibling offsets.
  5. Tier 4 (Geometry & Visuals): Compares relative viewport coordinates (x/y percentages) and dimensions.
  6. Tier 5 (History & Navigation): Checks past successful element resolutions from historical runs.
  7. Tier 6 (LLM Recovery - Opt-in): Triggers only if all offline tiers fail. Sends a compact DOM snippet to the LLM to locate the element, verifying the proposed selector by re-running the step.

Tip

To prevent wrong-element matches (false positives), the self-healing system requires a strict scoring margin threshold between the top match and secondary candidates. Heals are only persisted if the full re-run completes successfully.


πŸ—οΈ Workspace Directory Structure

All global configurations and local scraping/testing projects are stored locally on your machine:

  • Global Configuration: Saved in ~/.scrapewizard/
    • config.json β€” Active LLM provider, default model, and settings.
    • proxy.json β€” Configured proxies.
  • Scraper Projects Root: Saved in ~/scrapewizard_projects/
    • Contains individual <PROJECT_ID>/ directories with:
      • session.json β€” Project execution state and metadata.
      • generated_scraper.py β€” The final Python scraper script.
      • llm_logs/ β€” Prompts and raw completion text for auditing.
      • output/ β€” Extracted datasets (JSON, CSV, XLSX).
  • Test Baselines & Runs:
    • ~/.scrapewizard/baselines/ β€” Baseline screenshots for visual regression tests.
    • Run artifacts (screenshots, visual diffs, and test report logs) are saved in the configured output directories.

πŸ§ͺ Running Unit Tests

Verify the local installation and self-healing efficacy by executing:

python3 -m pytest tests/ -v --ignore=tests/golden_sites

πŸ“„ License

MIT License

About

AI-powered CLI that analyzes websites and generates real Playwright web scrapers with built-in data quality checks

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors