diff --git a/.agent/README.md b/.agent/README.md index ac61a57..569717f 100644 --- a/.agent/README.md +++ b/.agent/README.md @@ -12,7 +12,7 @@ Complete system architecture documentation including: - **Technology Stack** - Python 3.10+, FastMCP, httpx dependencies - **Project Structure** - File organization and key files - **Core Architecture** - MCP design, server architecture, patterns -- **MCP Tools** - All 5 tools (markdownify, smartscraper, searchscraper, smartcrawler_initiate, smartcrawler_fetch_results) +- **MCP Tools** - API v2 tools (markdownify, scrape, smartscraper, searchscraper, crawl, credits, history, monitor, …) - **API Integration** - ScrapeGraphAI API endpoints and credit system - **Deployment** - Smithery, Claude Desktop, Cursor, Docker setup - **Recent Updates** - SmartCrawler integration and latest features @@ -95,7 +95,7 @@ Complete Model Context Protocol integration documentation: **...available tools and their parameters:** - Read: [Project Architecture - MCP Tools](./system/project_architecture.md#mcp-tools) -- Quick reference: 5 tools (markdownify, smartscraper, searchscraper, smartcrawler_initiate, smartcrawler_fetch_results) +- Quick reference: see README β€œAvailable Tools” table (v2: + scrape, crawl_stop/resume, credits, sgai_history, monitor_*; removed sitemap, agentic_scrapper, *\_status tools) **...error handling:** - Read: [MCP Protocol - Error Handling](./system/mcp_protocol.md#error-handling) @@ -134,6 +134,7 @@ npx @modelcontextprotocol/inspector scrapegraph-mcp **Manual Testing (stdio):** ```bash echo '{"jsonrpc":"2.0","method":"tools/call","params":{"name":"markdownify","arguments":{"website_url":"https://scrapegraphai.com"}},"id":1}' | scrapegraph-mcp +# (v2: same tool name; backend calls POST /scrape) ``` **Integration Testing (Claude Desktop):** @@ -174,13 +175,14 @@ echo '{"jsonrpc":"2.0","method":"tools/list","id":1}' | docker run -i -e SGAI_AP Quick reference to all MCP tools: -| Tool | Parameters | Purpose | Credits | Async | -|------|------------|---------|---------|-------| -| `markdownify` | `website_url` | Convert webpage to markdown | 2 | No | -| `smartscraper` | `user_prompt`, `website_url`, `number_of_scrolls?`, `markdown_only?` | AI-powered data extraction | 10+ | No | -| `searchscraper` | `user_prompt`, `num_results?`, `number_of_scrolls?`, `time_range?` | AI-powered web search | Variable | No | -| `smartcrawler_initiate` | `url`, `prompt?`, `extraction_mode`, `depth?`, `max_pages?`, `same_domain_only?` | Start multi-page crawl | 100+ | Yes (returns request_id) | -| `smartcrawler_fetch_results` | `request_id` | Get crawl results | N/A | No (polls status) | +| Tool | Notes | +|------|--------| +| `markdownify` / `scrape` | POST /scrape (v2) | +| `smartscraper` | POST /extract; URL only | +| `searchscraper` | POST /search; num_results 3–20 | +| `smartcrawler_*`, `crawl_stop`, `crawl_resume` | POST/GET /crawl | +| `credits`, `sgai_history` | GET /credits, /history | +| `monitor_*` | /monitor namespace | For detailed tool documentation, see [Project Architecture - MCP Tools](./system/project_architecture.md#mcp-tools). @@ -376,8 +378,11 @@ npx @modelcontextprotocol/inspector scrapegraph-mcp ## πŸ“… Changelog +### April 2026 +- βœ… Migrated MCP client and tools to **API v2** ([scrapegraph-py#82](https://github.com/ScrapeGraphAI/scrapegraph-py/pull/82)): base `https://api.scrapegraphai.com/api/v2`, Bearer + SGAI-APIKEY, new crawl/monitor/credits/history tools; removed sitemap, agentic_scrapper, status polling tools. + ### January 2026 -- βœ… Added `time_range` parameter to SearchScraper for filtering results by recency +- βœ… Added `time_range` parameter to SearchScraper for filtering results by recency (v1-era; **ignored on API v2**) - βœ… Supported time ranges: `past_hour`, `past_24_hours`, `past_week`, `past_month`, `past_year` - βœ… Documentation updated to reflect SDK changes (scrapegraph-py#77, scrapegraph-js#2) diff --git a/.agent/system/project_architecture.md b/.agent/system/project_architecture.md index ea1fb1d..b8a1857 100644 --- a/.agent/system/project_architecture.md +++ b/.agent/system/project_architecture.md @@ -1,7 +1,7 @@ # ScrapeGraph MCP Server - Project Architecture -**Last Updated:** January 2026 -**Version:** 1.0.0 +**Last Updated:** April 2026 +**Version:** 2.0.0 ## Table of Contents - [System Overview](#system-overview) @@ -19,11 +19,12 @@ The ScrapeGraph MCP Server is a production-ready [Model Context Protocol](https://modelcontextprotocol.io/introduction) (MCP) server that provides seamless integration between AI assistants (like Claude, Cursor, etc.) and the [ScrapeGraphAI API](https://scrapegraphai.com). This server enables language models to leverage advanced AI-powered web scraping capabilities with enterprise-grade reliability. -**Key Capabilities:** -- **Markdownify** - Convert webpages to clean, structured markdown -- **SmartScraper** - AI-powered structured data extraction from webpages -- **SearchScraper** - AI-powered web searches with structured results -- **SmartCrawler** - Intelligent multi-page web crawling with AI extraction or markdown conversion +**Key Capabilities (API v2):** +- **Scrape** (`markdownify`, `scrape`) β€” POST `/api/v2/scrape` +- **Extract** (`smartscraper`) β€” POST `/api/v2/extract` (URL-only) +- **Search** (`searchscraper`) β€” POST `/api/v2/search` +- **Crawl** β€” POST/GET `/api/v2/crawl` (+ stop/resume); markdown/html crawl only +- **Monitor, credits, history** β€” `/api/v2/monitor`, `/credits`, `/history` **Purpose:** - Bridge AI assistants (Claude, Cursor, etc.) with web scraping capabilities @@ -129,7 +130,7 @@ AI Assistant (Claude/Cursor) ↓ (stdio via MCP) FastMCP Server (this project) ↓ (HTTPS API calls) -ScrapeGraphAI API (https://api.scrapegraphai.com/v1) +ScrapeGraphAI API (default https://api.scrapegraphai.com/api/v2) ↓ (web scraping) Target Websites ``` @@ -139,10 +140,10 @@ Target Websites The server follows a simple, single-file architecture: **`ScapeGraphClient` Class:** -- HTTP client wrapper for ScrapeGraphAI API -- Base URL: `https://api.scrapegraphai.com/v1` -- API key authentication via `SGAI-APIKEY` header -- Methods: `markdownify()`, `smartscraper()`, `searchscraper()`, `smartcrawler_initiate()`, `smartcrawler_fetch_results()` +- HTTP client wrapper for ScrapeGraphAI API v2 ([scrapegraph-py#82](https://github.com/ScrapeGraphAI/scrapegraph-py/pull/82)) +- Base URL: `https://api.scrapegraphai.com/api/v2` (override with env `SCRAPEGRAPH_API_BASE_URL`) +- Auth: `Authorization: Bearer`, `SGAI-APIKEY`, `X-SDK-Version: scrapegraph-mcp@2.0.0` +- v2 methods include `scrape_v2`, `extract`, `search_api`, `crawl_*`, `monitor_*`, `credits`, `history`, plus compatibility wrappers used by MCP tools **FastMCP Server:** - Created with `FastMCP("ScapeGraph API MCP Server")` @@ -185,7 +186,9 @@ The server follows a simple, single-file architecture: ## MCP Tools -The server exposes 5 tools to AI assistants: +The server exposes many `@mcp.tool()` handlers (see repository `README.md` for the full table). The detailed subsections below still use **v1-style endpoint names** in several places; treat them as illustrative and prefer the v2 mapping in **API Integration**. + +**v2 tool names:** `markdownify`, `scrape`, `smartscraper`, `searchscraper`, `smartcrawler_initiate`, `smartcrawler_fetch_results`, `crawl_stop`, `crawl_resume`, `credits`, `sgai_history`, `monitor_create`, `monitor_list`, `monitor_get`, `monitor_pause`, `monitor_resume`, `monitor_delete`. ### 1. `markdownify(website_url: str)` @@ -388,21 +391,29 @@ If status is "completed": ### ScrapeGraphAI API -**Base URL:** `https://api.scrapegraphai.com/v1` +**Base URL:** `https://api.scrapegraphai.com/api/v2` (configurable via `SCRAPEGRAPH_API_BASE_URL`) **Authentication:** -- Header: `SGAI-APIKEY: your-api-key` +- Headers: `Authorization: Bearer `, `SGAI-APIKEY: ` - Obtain API key from: [ScrapeGraph Dashboard](https://dashboard.scrapegraphai.com) -**Endpoints Used:** - -| Endpoint | Method | Tool | -|----------|--------|------| -| `/v1/markdownify` | POST | `markdownify()` | -| `/v1/smartscraper` | POST | `smartscraper()` | -| `/v1/searchscraper` | POST | `searchscraper()` | -| `/v1/crawl` | POST | `smartcrawler_initiate()` | -| `/v1/crawl/{request_id}` | GET | `smartcrawler_fetch_results()` | +**Endpoints used (v2):** + +| Endpoint | Method | MCP tools (typical) | +|----------|--------|---------------------| +| `/scrape` | POST | `markdownify`, `scrape` | +| `/extract` | POST | `smartscraper` | +| `/search` | POST | `searchscraper` | +| `/crawl` | POST | `smartcrawler_initiate` | +| `/crawl/{id}` | GET | `smartcrawler_fetch_results` | +| `/crawl/{id}/stop` | POST | `crawl_stop` | +| `/crawl/{id}/resume` | POST | `crawl_resume` | +| `/credits` | GET | `credits` | +| `/history` | GET | `sgai_history` | +| `/monitor` | POST, GET | `monitor_create`, `monitor_list` | +| `/monitor/{id}` | GET, DELETE | `monitor_get`, `monitor_delete` | +| `/monitor/{id}/pause` | POST | `monitor_pause` | +| `/monitor/{id}/resume` | POST | `monitor_resume` | **Request Format:** ```json diff --git a/README.md b/README.md index 5efefb1..2946d10 100644 --- a/README.md +++ b/README.md @@ -26,18 +26,21 @@ A production-ready [Model Context Protocol](https://modelcontextprotocol.io/intr - [Technology Stack](#technology-stack) - [License](#license) +## API v2 + +This MCP server targets **ScrapeGraph API v2** (`https://api.scrapegraphai.com/api/v2`), aligned with +[scrapegraph-py PR #82](https://github.com/ScrapeGraphAI/scrapegraph-py/pull/82). Auth sends both +`Authorization: Bearer` and `SGAI-APIKEY`. Override the base URL with **`SCRAPEGRAPH_API_BASE_URL`** if needed. + ## Key Features -- **8 Powerful Tools**: From simple markdown conversion to complex multi-page crawling and agentic workflows -- **AI-Powered Extraction**: Intelligently extract structured data using natural language prompts -- **Multi-Page Crawling**: SmartCrawler supports asynchronous crawling with configurable depth and page limits -- **Infinite Scroll Support**: Handle dynamic content loading with configurable scroll counts -- **JavaScript Rendering**: Full support for JavaScript-heavy websites -- **Flexible Output Formats**: Get results as markdown, structured JSON, or custom schemas -- **Easy Integration**: Works seamlessly with Claude Desktop, Cursor, and any MCP-compatible client -- **Enterprise-Ready**: Robust error handling, timeout management, and production-tested reliability -- **Simple Deployment**: One-command installation via Smithery or manual setup -- **Comprehensive Documentation**: Detailed developer docs in `.agent/` folder +- **Scrape & extract**: `markdownify` / `scrape` (POST /scrape), `smartscraper` (POST /extract, URL only) +- **Search**: `searchscraper` (POST /search; `num_results` clamped 3–20) +- **Crawl**: Async multi-page crawl in **markdown** or **html** only; `crawl_stop` / `crawl_resume` +- **Monitors**: Scheduled jobs via `monitor_create`, `monitor_list`, `monitor_get`, pause/resume/delete +- **Account**: `credits`, `sgai_history` +- **Easy integration**: Claude Desktop, Cursor, Smithery, HTTP transport +- **Developer docs**: `.agent/` folder ## Quick Start @@ -62,112 +65,20 @@ That's it! The server is now available to your AI assistant. ## Available Tools -The server provides **8 enterprise-ready tools** for AI-powered web scraping: - -### Core Scraping Tools - -#### 1. `markdownify` -Transform any webpage into clean, structured markdown format. - -```python -markdownify(website_url: str) -``` -- **Credits**: 2 per request -- **Use case**: Quick webpage content extraction in markdown - -#### 2. `smartscraper` -Leverage AI to extract structured data from any webpage with support for infinite scrolling. - -```python -smartscraper( - user_prompt: str, - website_url: str, - number_of_scrolls: int = None, - markdown_only: bool = None -) -``` -- **Credits**: 10+ (base) + variable based on scrolling -- **Use case**: AI-powered data extraction with custom prompts - -#### 3. `searchscraper` -Execute AI-powered web searches with structured, actionable results. - -```python -searchscraper( - user_prompt: str, - num_results: int = None, - number_of_scrolls: int = None, - time_range: str = None # Filter by: past_hour, past_24_hours, past_week, past_month, past_year -) -``` -- **Credits**: Variable (3-20 websites Γ— 10 credits) -- **Use case**: Multi-source research and data aggregation -- **Time filtering**: Use `time_range` to filter results by recency (e.g., `"past_week"` for recent results) - -### Advanced Scraping Tools - -#### 4. `scrape` -Basic scraping endpoint to fetch page content with optional heavy JavaScript rendering. - -```python -scrape(website_url: str, render_heavy_js: bool = None) -``` -- **Use case**: Simple page content fetching with JS rendering support - -#### 5. `sitemap` -Extract sitemap URLs and structure for any website. - -```python -sitemap(website_url: str) -``` -- **Use case**: Website structure analysis and URL discovery - -### Multi-Page Crawling - -#### 6. `smartcrawler_initiate` -Initiate intelligent multi-page web crawling (asynchronous operation). - -```python -smartcrawler_initiate( - url: str, - prompt: str = None, - extraction_mode: str = "ai", - depth: int = None, - max_pages: int = None, - same_domain_only: bool = None -) -``` -- **AI Extraction Mode**: 10 credits per page - extracts structured data -- **Markdown Mode**: 2 credits per page - converts to markdown -- **Returns**: `request_id` for polling -- **Use case**: Large-scale website crawling and data extraction - -#### 7. `smartcrawler_fetch_results` -Retrieve results from asynchronous crawling operations. - -```python -smartcrawler_fetch_results(request_id: str) -``` -- **Returns**: Status and results when crawling is complete -- **Use case**: Poll for crawl completion and retrieve results - -### Intelligent Agent-Based Scraping - -#### 8. `agentic_scrapper` -Run advanced agentic scraping workflows with customizable steps and structured output schemas. - -```python -agentic_scrapper( - url: str, - user_prompt: str = None, - output_schema: dict = None, - steps: list = None, - ai_extraction: bool = None, - persistent_session: bool = None, - timeout_seconds: float = None -) -``` -- **Use case**: Complex multi-step workflows with custom schemas and persistent sessions +| Tool | Role | +|------|------| +| `markdownify` | POST /scrape (markdown) | +| `scrape` | POST /scrape (`output_format`: markdown, html, screenshot, branding) | +| `smartscraper` | POST /extract (requires `website_url`; no inline HTML/markdown body on v2) | +| `searchscraper` | POST /search (`num_results` 3–20; `time_range` / `number_of_scrolls` ignored on v2) | +| `smartcrawler_initiate` | POST /crawl β€” `extraction_mode` **`markdown`** or **`html`** (default markdown). No AI crawl across pages. | +| `smartcrawler_fetch_results` | GET /crawl/:id | +| `crawl_stop`, `crawl_resume` | POST /crawl/:id/stop \| resume | +| `credits` | GET /credits | +| `sgai_history` | GET /history | +| `monitor_create`, `monitor_list`, `monitor_get`, `monitor_pause`, `monitor_resume`, `monitor_delete` | /monitor API | + +**Removed vs older MCP releases:** `sitemap`, `agentic_scrapper`, `markdownify_status`, `smartscraper_status` (no v2 endpoints). ## Setup Instructions @@ -482,7 +393,7 @@ root_agent = LlmAgent( - Adjust based on your use case (crawling operations may need even longer timeouts) **Tool Filtering:** -- By default, all 8 tools are exposed to the agent +- By default, all registered MCP tools are exposed to the agent (see [Available Tools](#available-tools)) - Use `tool_filter` to limit which tools are available: ```python tool_filter=['markdownify', 'smartscraper', 'searchscraper'] @@ -520,20 +431,18 @@ The server enables sophisticated queries across various scraping scenarios: - **SearchScraper**: "Research and summarize recent developments in AI-powered web scraping" - **SearchScraper**: "Search for the top 5 articles about machine learning frameworks and extract key insights" - **SearchScraper**: "Find recent news about GPT-4 and provide a structured summary" -- **SearchScraper with time_range**: "Search for AI news from the past week only" (uses `time_range="past_week"`) +- **SearchScraper**: v2 does not apply `time_range`; phrase queries to bias recency in natural language instead -### Website Analysis -- **Sitemap**: "Extract the complete sitemap structure from the ScrapeGraph website" -- **Sitemap**: "Discover all URLs on this blog site" +### Website analysis +- Use **`smartcrawler_initiate`** (markdown/html) plus **`smartcrawler_fetch_results`** to map and capture multi-page content; there is no separate **sitemap** tool on v2. -### Multi-Page Crawling -- **SmartCrawler (AI mode)**: "Crawl the entire documentation site and extract all API endpoints with descriptions" -- **SmartCrawler (Markdown mode)**: "Convert all pages in the blog to markdown up to 2 levels deep" -- **SmartCrawler**: "Extract all product information from an e-commerce site, maximum 100 pages, same domain only" +### Multi-page crawling +- **SmartCrawler (markdown/html)**: "Crawl the blog in markdown mode and poll until complete" +- For structured fields per page, run **`smartscraper`** on individual URLs (or **`monitor_create`** on a schedule) -### Advanced Agentic Scraping -- **Agentic Scraper**: "Navigate through a multi-step authentication form and extract user dashboard data" -- **Agentic Scraper with schema**: "Follow pagination links and compile a dataset with schema: {title, author, date, content}" +### Monitors and account +- **Monitor**: "Run this extract prompt on https://example.com every day at 9am" (`monitor_create` with cron) +- **Credits / history**: `credits`, `sgai_history` - **Agentic Scraper**: "Execute a complex workflow: login, navigate to reports, download data, and extract summary statistics" ## Error Handling diff --git a/pyproject.toml b/pyproject.toml index be301bd..344e3ce 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -1,6 +1,6 @@ [project] name = "scrapegraph-mcp" -version = "1.0.1" +version = "2.0.0" description = "MCP server for ScapeGraph API integration" license = {text = "MIT"} readme = "README.md" @@ -52,6 +52,10 @@ line-length = 100 target-version = "py312" select = ["E", "F", "I", "B", "W"] +[tool.ruff.lint.per-file-ignores] +# MCP tool docstrings and embedded guides exceed 100 cols by design. +"src/scrapegraph_mcp/server.py" = ["E501"] + [tool.mypy] python_version = "3.12" warn_return_any = true diff --git a/server.json b/server.json index 2229881..c591969 100644 --- a/server.json +++ b/server.json @@ -6,12 +6,12 @@ "url": "https://github.com/ScrapeGraphAI/scrapegraph-mcp", "source": "github" }, - "version": "1.0.1", + "version": "2.0.0", "packages": [ { "registryType": "pypi", "identifier": "scrapegraph-mcp", - "version": "1.0.1", + "version": "2.0.0", "transport": { "type": "stdio" }, @@ -22,6 +22,13 @@ "format": "string", "isSecret": true, "name": "SGAI_API_KEY" + }, + { + "description": "Override API base URL (default https://api.scrapegraphai.com/api/v2)", + "isRequired": false, + "format": "string", + "isSecret": false, + "name": "SCRAPEGRAPH_API_BASE_URL" } ] } diff --git a/src/scrapegraph_mcp/server.py b/src/scrapegraph_mcp/server.py index 2d881d2..0fde832 100644 --- a/src/scrapegraph_mcp/server.py +++ b/src/scrapegraph_mcp/server.py @@ -1,16 +1,19 @@ #!/usr/bin/env python3 """ -MCP server for ScapeGraph API integration. - -This server exposes methods to use ScapeGraph's AI-powered web scraping services: -- markdownify: Convert any webpage into clean, formatted markdown -- smartscraper: Extract structured data from any webpage using AI -- searchscraper: Perform AI-powered web searches with structured results -- smartcrawler_initiate: Initiate intelligent multi-page web crawling with AI extraction or markdown conversion -- smartcrawler_fetch_results: Retrieve results from asynchronous crawling operations -- scrape: Fetch raw page content with optional JavaScript rendering -- sitemap: Extract and discover complete website structure -- agentic_scrapper: Execute complex multi-step web scraping workflows +MCP server for ScapeGraph API integration (API v2). + +Aligned with scrapegraph-py v2 ([ScrapeGraphAI/scrapegraph-py#82](https://github.com/ScrapeGraphAI/scrapegraph-py/pull/82)): +- markdownify: Page content via POST /scrape (markdown by default) +- smartscraper: Structured extraction via POST /extract (URL input only) +- searchscraper: Web search via POST /search +- smartcrawler_initiate / smartcrawler_fetch_results: Async crawl via /crawl (markdown or html only) +- crawl_stop / crawl_resume: Control running crawl jobs +- scrape: Format-specific fetch (markdown, html, screenshot, branding) +- credits / sgai_history: Account usage and request history +- monitor_*: Scheduled extraction jobs (replaces legacy scheduled jobs) + +Removed on v2 (no API equivalent): sitemap, agentic_scrapper, markdownify_status, smartscraper_status. +Optional base URL override: SCRAPEGRAPH_API_BASE_URL (default https://api.scrapegraphai.com/api/v2). ## Parameter Validation and Error Handling @@ -55,40 +58,119 @@ import json import logging import os -from typing import Any, Dict, Optional, List, Union, Annotated, Literal +from typing import Annotated, Any, Dict, List, Literal, Optional, Union import httpx from fastmcp import Context, FastMCP +from pydantic import AliasChoices, BaseModel, Field from smithery.decorators import smithery -from pydantic import BaseModel, Field, AliasChoices +from starlette.requests import Request +from starlette.responses import JSONResponse # Configure logging logging.basicConfig( - level=logging.INFO, - format='%(asctime)s - %(name)s - %(levelname)s - %(message)s' + level=logging.INFO, format="%(asctime)s - %(name)s - %(levelname)s - %(message)s" ) logger = logging.getLogger(__name__) +MCP_SERVER_VERSION = "2.0.0" +DEFAULT_API_BASE_URL = "https://api.scrapegraphai.com/api/v2" -class ScapeGraphClient: - """Client for interacting with the ScapeGraph API.""" - BASE_URL = "https://api.scrapegraphai.com/v1" +def _api_base_url() -> str: + return os.environ.get("SCRAPEGRAPH_API_BASE_URL", DEFAULT_API_BASE_URL).rstrip("/") - def __init__(self, api_key: str): - """ - Initialize the ScapeGraph API client. - Args: - api_key: API key for ScapeGraph API - """ +class ScapeGraphClient: + """HTTP client for ScrapeGraphAI API v2 (see scrapegraph-py PR #82).""" + + def __init__(self, api_key: str, base_url: Optional[str] = None) -> None: self.api_key = api_key + self.base_url = (base_url or _api_base_url()).rstrip("/") self.headers = { + "Authorization": f"Bearer {api_key}", "SGAI-APIKEY": api_key, - "Content-Type": "application/json" + "Content-Type": "application/json", + "accept": "application/json", + "X-SDK-Version": f"scrapegraph-mcp@{MCP_SERVER_VERSION}", } self.client = httpx.Client(timeout=httpx.Timeout(120.0)) + def _parse_response(self, response: httpx.Response) -> Dict[str, Any]: + if response.status_code >= 400: + raise Exception(f"Error {response.status_code}: {response.text}") + if not response.content: + return {"ok": True} + try: + out = response.json() + if isinstance(out, dict): + return out + return {"result": out} + except json.JSONDecodeError: + return {"raw": response.text} + + def _request( + self, + method: str, + path: str, + *, + json_body: Optional[Dict[str, Any]] = None, + params: Optional[Dict[str, Any]] = None, + ) -> Dict[str, Any]: + url = f"{self.base_url}{path}" + r = self.client.request(method, url, headers=self.headers, json=json_body, params=params) + return self._parse_response(r) + + def _fetch_config( + self, + *, + headers: Optional[Dict[str, str]] = None, + stealth: Optional[bool] = None, + mock: Optional[bool] = None, + scrolls: Optional[int] = None, + render_js: Optional[bool] = None, + wait_ms: Optional[int] = None, + ) -> Optional[Dict[str, Any]]: + cfg: Dict[str, Any] = {} + if headers is not None: + cfg["headers"] = headers + if stealth is not None: + cfg["stealth"] = stealth + if mock is not None: + cfg["mock"] = mock + if scrolls is not None: + cfg["scrolls"] = scrolls + if render_js is not None: + cfg["render_js"] = render_js + if wait_ms is not None: + cfg["wait_ms"] = wait_ms + return cfg or None + + def scrape_v2( + self, + website_url: str, + output_format: str = "markdown", + *, + fetch_config_dict: Optional[Dict[str, Any]] = None, + screenshot_full_page: bool = False, + ) -> Dict[str, Any]: + fmt = output_format.lower() + body: Dict[str, Any] = {"url": website_url} + if fmt == "markdown": + body["markdown"] = {"mode": "normal"} + elif fmt == "html": + body["html"] = {"mode": "normal"} + elif fmt == "screenshot": + body["screenshot"] = {"full_page": screenshot_full_page} + elif fmt == "branding": + body["branding"] = {} + else: + raise ValueError( + f"Invalid output_format {output_format!r}; use markdown, html, screenshot, or branding" + ) + if fetch_config_dict: + body["fetch_config"] = fetch_config_dict + return self._request("POST", "/scrape", json_body=body) def markdownify( self, @@ -96,178 +178,81 @@ def markdownify( headers: Optional[Dict[str, str]] = None, stealth: Optional[bool] = None, stream: Optional[bool] = None, - mock: Optional[bool] = None + mock: Optional[bool] = None, ) -> Dict[str, Any]: - """ - Convert a webpage into clean, formatted markdown. - - Args: - website_url: URL of the webpage to convert - headers: HTTP headers to include in the request (optional) - stealth: Enable stealth mode to avoid bot detection (optional) - stream: Enable streaming response for real-time updates (optional) - mock: Return mock data for testing purposes (optional) - - Returns: - Dictionary containing the markdown result - """ - url = f"{self.BASE_URL}/markdownify" - data = {"website_url": website_url} - - if headers is not None: - data["headers"] = headers - if stealth is not None: - data["stealth"] = stealth if stream is not None: - data["stream"] = stream - if mock is not None: - data["mock"] = mock - - response = self.client.post(url, headers=self.headers, json=data) - - if response.status_code != 200: - error_msg = f"Error {response.status_code}: {response.text}" - raise Exception(error_msg) - - return response.json() - - def markdownify_status(self, request_id: str) -> Dict[str, Any]: - """ - Get the status of a markdownify request. + logger.warning("stream is not supported on API v2 /scrape; ignoring") + fc = self._fetch_config(headers=headers, stealth=stealth, mock=mock) + return self.scrape_v2(website_url, "markdown", fetch_config_dict=fc) - Args: - request_id: The request ID to check status for - - Returns: - Dictionary containing the request status and results - """ - url = f"{self.BASE_URL}/markdownify/{request_id}" - response = self.client.get(url, headers=self.headers) - - if response.status_code != 200: - error_msg = f"Error {response.status_code}: {response.text}" - raise Exception(error_msg) - - return response.json() + def extract( + self, + user_prompt: str, + website_url: str, + output_schema: Optional[Dict[str, Any]] = None, + fetch_config_dict: Optional[Dict[str, Any]] = None, + ) -> Dict[str, Any]: + body: Dict[str, Any] = {"url": website_url, "prompt": user_prompt} + if output_schema is not None: + body["output_schema"] = output_schema + if fetch_config_dict: + body["fetch_config"] = fetch_config_dict + return self._request("POST", "/extract", json_body=body) def smartscraper( self, user_prompt: str, - website_url: str = None, - website_html: str = None, - website_markdown: str = None, - output_schema: Dict[str, Any] = None, - number_of_scrolls: int = None, - total_pages: int = None, - render_heavy_js: bool = None, - stealth: bool = None + website_url: Optional[str] = None, + website_html: Optional[str] = None, + website_markdown: Optional[str] = None, + output_schema: Optional[Dict[str, Any]] = None, + number_of_scrolls: Optional[int] = None, + total_pages: Optional[int] = None, + render_heavy_js: Optional[bool] = None, + stealth: Optional[bool] = None, ) -> Dict[str, Any]: - """ - Extract structured data from a webpage using AI. - - Args: - user_prompt: Instructions for what data to extract - website_url: URL of the webpage to scrape (mutually exclusive with website_html and website_markdown) - website_html: HTML content to process locally (mutually exclusive with website_url and website_markdown, max 2MB) - website_markdown: Markdown content to process locally (mutually exclusive with website_url and website_html, max 2MB) - output_schema: JSON schema defining expected output structure (optional) - number_of_scrolls: Number of infinite scrolls to perform (0-50, default 0) - total_pages: Number of pages to process for pagination (1-100, default 1) - render_heavy_js: Enable heavy JavaScript rendering for dynamic pages (default false) - stealth: Enable stealth mode to avoid bot detection (default false) - - Returns: - Dictionary containing the extracted data - """ - url = f"{self.BASE_URL}/smartscraper" - data = {"user_prompt": user_prompt} - - # Add input source (mutually exclusive) - if website_url is not None: - data["website_url"] = website_url - elif website_html is not None: - data["website_html"] = website_html - elif website_markdown is not None: - data["website_markdown"] = website_markdown - else: - raise ValueError("Must provide one of: website_url, website_html, or website_markdown") + if website_html is not None or website_markdown is not None: + raise ValueError( + "API v2 extract supports URL input only; website_html and website_markdown " + "are not supported." + ) + if website_url is None: + raise ValueError("Must provide website_url") + if total_pages is not None and total_pages != 1: + raise ValueError( + "total_pages is not supported for extract on API v2; omit it or use 1, or use " + "smartcrawler_initiate for multi-page markdown/html crawl." + ) + fc = self._fetch_config( + stealth=stealth, scrolls=number_of_scrolls, render_js=render_heavy_js + ) + return self.extract(user_prompt, website_url, output_schema, fc) - # Add optional parameters + def search_api( + self, + query: str, + num_results: Optional[int] = None, + output_schema: Optional[Dict[str, Any]] = None, + ) -> Dict[str, Any]: + n = 5 if num_results is None else num_results + n = max(3, min(20, n)) + body: Dict[str, Any] = {"query": query, "num_results": n} if output_schema is not None: - data["output_schema"] = output_schema - if number_of_scrolls is not None: - data["number_of_scrolls"] = number_of_scrolls - if total_pages is not None: - data["total_pages"] = total_pages - if render_heavy_js is not None: - data["render_heavy_js"] = render_heavy_js - if stealth is not None: - data["stealth"] = stealth - - response = self.client.post(url, headers=self.headers, json=data) - - if response.status_code != 200: - error_msg = f"Error {response.status_code}: {response.text}" - raise Exception(error_msg) - - return response.json() - - def smartscraper_status(self, request_id: str) -> Dict[str, Any]: - """ - Get the status of a smartscraper request. - - Args: - request_id: The request ID to check status for - - Returns: - Dictionary containing the request status and results - """ - url = f"{self.BASE_URL}/smartscraper/{request_id}" - response = self.client.get(url, headers=self.headers) - - if response.status_code != 200: - error_msg = f"Error {response.status_code}: {response.text}" - raise Exception(error_msg) - - return response.json() - - def searchscraper(self, user_prompt: str, num_results: int = None, number_of_scrolls: int = None, time_range: str = None) -> Dict[str, Any]: - """ - Perform AI-powered web searches with structured results. - - Args: - user_prompt: Search query or instructions - num_results: Number of websites to search (optional, default: 3 websites = 30 credits) - number_of_scrolls: Number of infinite scrolls to perform on each website (optional) - time_range: Filter results by time range (optional). Valid values: past_hour, past_24_hours, past_week, past_month, past_year - - Returns: - Dictionary containing search results and reference URLs - """ - url = f"{self.BASE_URL}/searchscraper" - data = { - "user_prompt": user_prompt - } + body["output_schema"] = output_schema + return self._request("POST", "/search", json_body=body) - # Add num_results to the request if provided - if num_results is not None: - data["num_results"] = num_results - - # Add number_of_scrolls to the request if provided - if number_of_scrolls is not None: - data["number_of_scrolls"] = number_of_scrolls - - # Add time_range to the request if provided + def searchscraper( + self, + user_prompt: str, + num_results: Optional[int] = None, + number_of_scrolls: Optional[int] = None, + time_range: Optional[str] = None, + ) -> Dict[str, Any]: if time_range is not None: - data["time_range"] = time_range - - response = self.client.post(url, headers=self.headers, json=data) - - if response.status_code != 200: - error_msg = f"Error {response.status_code}: {response.text}" - raise Exception(error_msg) - - return response.json() + logger.warning("time_range is not supported on API v2 /search; ignoring") + if number_of_scrolls is not None: + logger.warning("number_of_scrolls is not supported on API v2 /search; ignoring") + return self.search_api(user_prompt, num_results=num_results) def scrape( self, @@ -275,186 +260,152 @@ def scrape( render_heavy_js: Optional[bool] = None, mock: Optional[bool] = None, stealth: Optional[bool] = None, - stream: Optional[bool] = None + stream: Optional[bool] = None, + output_format: str = "markdown", + screenshot_full_page: bool = False, ) -> Dict[str, Any]: - """ - Basic scrape endpoint to fetch page content. - - Args: - website_url: URL to scrape - render_heavy_js: Whether to render heavy JS (optional) - mock: Return mock data for testing purposes (optional) - stealth: Enable stealth mode to avoid bot detection (optional) - stream: Enable streaming response for real-time updates (optional) - - Returns: - Dictionary containing the scraped result - """ - url = f"{self.BASE_URL}/scrape" - payload: Dict[str, Any] = {"website_url": website_url} - if render_heavy_js is not None: - payload["render_heavy_js"] = render_heavy_js - if mock is not None: - payload["mock"] = mock - if stealth is not None: - payload["stealth"] = stealth - if stream is not None: - payload["stream"] = stream - - response = self.client.post(url, headers=self.headers, json=payload) - response.raise_for_status() - return response.json() - - def sitemap(self, website_url: str, stream: Optional[bool] = None) -> Dict[str, Any]: - """ - Extract sitemap for a given website. - - Args: - website_url: Base website URL - stream: Enable streaming response for real-time updates (optional) - - Returns: - Dictionary containing sitemap URLs/structure - """ - url = f"{self.BASE_URL}/sitemap" - payload: Dict[str, Any] = {"website_url": website_url} if stream is not None: - payload["stream"] = stream - - response = self.client.post(url, headers=self.headers, json=payload) - response.raise_for_status() - return response.json() + logger.warning("stream is not supported on API v2 /scrape; ignoring") + fc = self._fetch_config(stealth=stealth, mock=mock, render_js=render_heavy_js) + return self.scrape_v2( + website_url, + output_format, + fetch_config_dict=fc, + screenshot_full_page=screenshot_full_page, + ) - def agentic_scrapper( + def crawl_start( self, url: str, - user_prompt: Optional[str] = None, - output_schema: Optional[Dict[str, Any]] = None, - steps: Optional[List[str]] = None, - ai_extraction: Optional[bool] = None, - persistent_session: Optional[bool] = None, - timeout_seconds: Optional[float] = None, + *, + depth: int = 2, + max_pages: int = 10, + crawl_format: str = "markdown", + include_patterns: Optional[List[str]] = None, + exclude_patterns: Optional[List[str]] = None, + fetch_config_dict: Optional[Dict[str, Any]] = None, ) -> Dict[str, Any]: - """ - Run the Agentic Scraper workflow (no live session/browser interaction). - - Args: - url: Target website URL - user_prompt: Instructions for what to do/extract (optional) - output_schema: Desired structured output schema (optional) - steps: High-level steps/instructions for the agent (optional) - ai_extraction: Whether to enable AI extraction mode (optional) - persistent_session: Whether to keep session alive between steps (optional) - timeout_seconds: Per-request timeout override in seconds (optional) - """ - endpoint = f"{self.BASE_URL}/agentic-scrapper" - payload: Dict[str, Any] = {"url": url} - if user_prompt is not None: - payload["user_prompt"] = user_prompt - if output_schema is not None: - payload["output_schema"] = output_schema - if steps is not None: - payload["steps"] = steps - if ai_extraction is not None: - payload["ai_extraction"] = ai_extraction - if persistent_session is not None: - payload["persistent_session"] = persistent_session - - if timeout_seconds is not None: - response = self.client.post(endpoint, headers=self.headers, json=payload, timeout=timeout_seconds) - else: - response = self.client.post(endpoint, headers=self.headers, json=payload) - response.raise_for_status() - return response.json() + cf = crawl_format.lower() + if cf not in ("markdown", "html"): + raise ValueError("crawl_format must be 'markdown' or 'html'") + body: Dict[str, Any] = { + "url": url, + "depth": depth, + "max_pages": max_pages, + "format": cf, + } + if include_patterns is not None: + body["include_patterns"] = include_patterns + if exclude_patterns is not None: + body["exclude_patterns"] = exclude_patterns + if fetch_config_dict: + body["fetch_config"] = fetch_config_dict + return self._request("POST", "/crawl", json_body=body) def smartcrawler_initiate( - self, - url: str, - prompt: str = None, - extraction_mode: str = "ai", - depth: int = None, - max_pages: int = None, - same_domain_only: bool = None + self, + url: str, + prompt: Optional[str] = None, + extraction_mode: str = "markdown", + depth: Optional[int] = None, + max_pages: Optional[int] = None, + same_domain_only: Optional[bool] = None, + include_patterns: Optional[List[str]] = None, + exclude_patterns: Optional[List[str]] = None, ) -> Dict[str, Any]: - """ - Initiate a SmartCrawler request for multi-page web crawling. - - SmartCrawler supports two modes: - - AI Extraction Mode (10 credits per page): Extracts structured data based on your prompt - - Markdown Conversion Mode (2 credits per page): Converts pages to clean markdown - - Smartcrawler takes some time to process the request and returns the request id. - Use smartcrawler_fetch_results to get the results of the request. - You have to keep polling the smartcrawler_fetch_results until the request is complete. - The request is complete when the status is "completed". - - Args: - url: Starting URL to crawl - prompt: AI prompt for data extraction (required for AI mode) - extraction_mode: "ai" for AI extraction or "markdown" for markdown conversion (default: "ai") - depth: Maximum link traversal depth (optional) - max_pages: Maximum number of pages to crawl (optional) - same_domain_only: Whether to crawl only within the same domain (optional) - - Returns: - Dictionary containing the request ID for async processing - """ - endpoint = f"{self.BASE_URL}/crawl" - data = { - "url": url - } - - # Handle extraction mode - if extraction_mode == "markdown": - data["markdown_only"] = True - elif extraction_mode == "ai": - if prompt is None: - raise ValueError("prompt is required when extraction_mode is 'ai'") - data["prompt"] = prompt - else: - raise ValueError(f"Invalid extraction_mode: {extraction_mode}. Must be 'ai' or 'markdown'") - if depth is not None: - data["depth"] = depth - if max_pages is not None: - data["max_pages"] = max_pages + if extraction_mode == "ai": + raise ValueError( + "API v2 crawl does not support AI extraction across pages. Use smartscraper " + "(extract) for a single URL, or monitor_create for scheduled jobs. Use " + "extraction_mode 'markdown' or 'html' for multi-page crawl." + ) + if extraction_mode not in ("markdown", "html"): + raise ValueError("extraction_mode must be 'markdown', 'html', or (deprecated) 'ai'") + if prompt is not None and extraction_mode in ("markdown", "html"): + logger.warning("prompt is ignored for markdown/html crawl on API v2") if same_domain_only is not None: - data["same_domain_only"] = same_domain_only + logger.warning( + "same_domain_only is not supported on API v2 crawl; use include_patterns / " + "exclude_patterns" + ) + d = 2 if depth is None else depth + mp = 10 if max_pages is None else max_pages + if d < 1 or d > 10: + raise ValueError("depth must be between 1 and 10") + if mp < 1 or mp > 100: + raise ValueError("max_pages must be between 1 and 100") + return self.crawl_start( + url, + depth=d, + max_pages=mp, + crawl_format="markdown" if extraction_mode == "markdown" else "html", + include_patterns=include_patterns, + exclude_patterns=exclude_patterns, + ) - response = self.client.post(endpoint, headers=self.headers, json=data) + def smartcrawler_fetch_results(self, request_id: str) -> Dict[str, Any]: + return self._request("GET", f"/crawl/{request_id}") - if response.status_code != 200: - error_msg = f"Error {response.status_code}: {response.text}" - raise Exception(error_msg) + def crawl_stop(self, crawl_id: str) -> Dict[str, Any]: + return self._request("POST", f"/crawl/{crawl_id}/stop") - return response.json() + def crawl_resume(self, crawl_id: str) -> Dict[str, Any]: + return self._request("POST", f"/crawl/{crawl_id}/resume") - def smartcrawler_fetch_results(self, request_id: str) -> Dict[str, Any]: - """ - Fetch the results of a SmartCrawler operation. + def credits(self) -> Dict[str, Any]: + return self._request("GET", "/credits") - Args: - request_id: The request ID returned by smartcrawler_initiate + def history( + self, + endpoint: Optional[str] = None, + status_filter: Optional[str] = None, + limit: Optional[int] = None, + offset: Optional[int] = None, + ) -> Dict[str, Any]: + params = { + k: v + for k, v in { + "endpoint": endpoint, + "status": status_filter, + "limit": limit, + "offset": offset, + }.items() + if v is not None + } + return self._request("GET", "/history", params=params or None) - Returns: - Dictionary containing the crawled data (structured extraction or markdown) - and metadata about processed pages + def monitor_create( + self, + name: str, + url: str, + prompt: str, + cron: str, + output_schema: Optional[Dict[str, Any]] = None, + ) -> Dict[str, Any]: + body: Dict[str, Any] = { + "name": name, + "url": url, + "prompt": prompt, + "cron": cron, + } + if output_schema is not None: + body["output_schema"] = output_schema + return self._request("POST", "/monitor", json_body=body) - Note: - It takes some time to process the request and returns the results. - Meanwhile it returns the status of the request. - You have to keep polling the smartcrawler_fetch_results until the request is complete. - The request is complete when the status is "completed". and you get results - Keep polling the smartcrawler_fetch_results until the request is complete. - """ - endpoint = f"{self.BASE_URL}/crawl/{request_id}" - - response = self.client.get(endpoint, headers=self.headers) + def monitor_list(self) -> Dict[str, Any]: + return self._request("GET", "/monitor") - if response.status_code != 200: - error_msg = f"Error {response.status_code}: {response.text}" - raise Exception(error_msg) + def monitor_get(self, monitor_id: str) -> Dict[str, Any]: + return self._request("GET", f"/monitor/{monitor_id}") - return response.json() + def monitor_pause(self, monitor_id: str) -> Dict[str, Any]: + return self._request("POST", f"/monitor/{monitor_id}/pause") + + def monitor_resume(self, monitor_id: str) -> Dict[str, Any]: + return self._request("POST", f"/monitor/{monitor_id}/resume") + + def monitor_delete(self, monitor_id: str) -> Dict[str, Any]: + return self._request("DELETE", f"/monitor/{monitor_id}") def close(self) -> None: """Close the HTTP client.""" @@ -495,20 +446,20 @@ def get_api_key(ctx: Context) -> str: # Try HTTP header first (for remote/Render deployments) try: headers = get_http_headers() - api_key = headers.get('x-api-key') + api_key = headers.get("x-api-key") if api_key: logger.info("API key retrieved from X-API-Key header") - return api_key + return str(api_key) except LookupError: # Not in HTTP context, try session config (Smithery/stdio mode) pass # Try session config (for Smithery/stdio deployments) - if hasattr(ctx, 'session_config') and ctx.session_config is not None: - api_key = getattr(ctx.session_config, 'scrapegraph_api_key', None) + if hasattr(ctx, "session_config") and ctx.session_config is not None: + api_key = getattr(ctx.session_config, "scrapegraph_api_key", None) if api_key: logger.info("API key retrieved from session config") - return api_key + return str(api_key) logger.error("No API key found in header or session config") raise ValueError( @@ -524,9 +475,8 @@ def get_api_key(ctx: Context) -> str: # Health check endpoint for remote deployments (Render, etc.) @mcp.custom_route("/health", methods=["GET"]) -async def health_check(request): +async def health_check(_request: Request) -> JSONResponse: """Health check endpoint for container orchestration and load balancers.""" - from starlette.responses import JSONResponse return JSONResponse({"status": "healthy", "service": "scrapegraph-mcp"}) @@ -535,81 +485,29 @@ async def health_check(request): def web_scraping_guide() -> str: """ A comprehensive guide to using ScapeGraph's web scraping tools effectively. - + This prompt provides examples and best practices for each tool in the ScapeGraph MCP server. """ - return """# ScapeGraph Web Scraping Guide - -## Available Tools Overview - -### 1. **markdownify** - Convert webpages to clean markdown -**Use case**: Get clean, readable content from any webpage -**Example**: -- Input: `https://docs.python.org/3/tutorial/` -- Output: Clean markdown of the Python tutorial - -### 2. **smartscraper** - AI-powered data extraction -**Use case**: Extract specific structured data using natural language prompts -**Examples**: -- "Extract all product names and prices from this e-commerce page" -- "Get contact information including email, phone, and address" -- "Find all article titles, authors, and publication dates" - -### 3. **searchscraper** - AI web search with extraction -**Use case**: Search the web and extract structured information -**Examples**: -- "Find the latest AI research papers and their abstracts" -- "Search for Python web scraping tutorials with ratings" -- "Get current cryptocurrency prices and market caps" - -### 4. **smartcrawler_initiate** - Multi-page intelligent crawling -**Use case**: Crawl multiple pages with AI extraction or markdown conversion -**Modes**: -- AI Mode (10 credits/page): Extract structured data -- Markdown Mode (2 credits/page): Convert to markdown -**Example**: Crawl a documentation site to extract all API endpoints - -### 5. **smartcrawler_fetch_results** - Get crawling results -**Use case**: Retrieve results from initiated crawling operations -**Note**: Keep polling until status is "completed" - -### 6. **scrape** - Basic page content fetching -**Use case**: Get raw page content with optional JavaScript rendering -**Example**: Fetch content from dynamic pages that require JS - -### 7. **sitemap** - Extract website structure -**Use case**: Get all URLs and structure of a website -**Example**: Map out a website's architecture before crawling - -### 8. **agentic_scrapper** - AI-powered automated scraping -**Use case**: Complex multi-step scraping with AI automation -**Example**: Navigate through forms, click buttons, extract data - -## Best Practices - -1. **Start Simple**: Use `markdownify` or `scrape` for basic content -2. **Be Specific**: Provide detailed prompts for better AI extraction -3. **Use Crawling Wisely**: Set appropriate limits for `max_pages` and `depth` -4. **Monitor Credits**: AI extraction uses more credits than markdown conversion -5. **Handle Async**: Use `smartcrawler_fetch_results` to poll for completion - -## Common Workflows - -### Extract Product Information -1. Use `smartscraper` with prompt: "Extract product name, price, description, and availability" -2. For multiple pages: Use `smartcrawler_initiate` in AI mode - -### Research and Analysis -1. Use `searchscraper` to find relevant pages -2. Use `smartscraper` on specific pages for detailed extraction - -### Site Documentation -1. Use `sitemap` to discover all pages -2. Use `smartcrawler_initiate` in markdown mode to convert all pages - -### Complex Navigation -1. Use `agentic_scrapper` for sites requiring interaction -2. Provide step-by-step instructions in the `steps` parameter + return """# ScapeGraph Web Scraping Guide (API v2) + +See [scrapegraph-py#82](https://github.com/ScrapeGraphAI/scrapegraph-py/pull/82) for the upstream SDK migration. + +## Core tools +- **markdownify** β€” `POST /scrape` (markdown output) +- **scrape** β€” `POST /scrape` (markdown, html, screenshot, branding) +- **smartscraper** β€” `POST /extract` (URL + prompt; no inline HTML/markdown source on v2) +- **searchscraper** β€” `POST /search` (query + num_results 3–20) +- **smartcrawler_initiate** / **smartcrawler_fetch_results** β€” `POST/GET /crawl` (markdown or html crawl only; no per-page AI crawl on v2) +- **crawl_stop** / **crawl_resume** β€” control a running job +- **credits** β€” `GET /credits` +- **sgai_history** β€” `GET /history` +- **monitor_*** β€” scheduled jobs (`POST/GET/DELETE /monitor`, pause/resume) + +## Best practices +1. Use **markdownify** or **scrape** before **smartscraper** when you only need readable text. +2. Multi-page **AI** extraction: run **smartscraper** per URL, or use **monitor_create** on a schedule. +3. Poll **smartcrawler_fetch_results** until the crawl finishes. +4. Override API host with env **SCRAPEGRAPH_API_BASE_URL** if needed (default production v2 base URL). """ @@ -617,78 +515,49 @@ def web_scraping_guide() -> str: def quick_start_examples() -> str: """ Quick start examples for common ScapeGraph use cases. - + Ready-to-use examples for immediate productivity. """ - return """# ScapeGraph Quick Start Examples - -## πŸš€ Ready-to-Use Examples + return """# ScapeGraph Quick Start (API v2) -### Extract E-commerce Product Data +### Extract structured data (single URL) ``` Tool: smartscraper -URL: https://example-shop.com/products/laptop -Prompt: "Extract product name, price, specifications, customer rating, and availability status" +website_url: https://example.com/product/1 +user_prompt: "Extract name, price, and availability" ``` -### Convert Documentation to Markdown +### Markdown snapshot ``` Tool: markdownify -URL: https://docs.example.com/api-reference +website_url: https://docs.example.com ``` -### Research Latest News +### Search ``` Tool: searchscraper -Prompt: "Find latest news about artificial intelligence breakthroughs in 2024" +user_prompt: "Latest Python 3.12 release highlights" num_results: 5 ``` -### Crawl Entire Blog for Articles +### Multi-page crawl (markdown/html only) ``` Tool: smartcrawler_initiate -URL: https://blog.example.com -Prompt: "Extract article title, author, publication date, and summary" -extraction_mode: "ai" -max_pages: 20 -``` - -### Get Website Structure -``` -Tool: sitemap -URL: https://example.com +url: https://blog.example.com +extraction_mode: "markdown" +max_pages: 15 +depth: 2 ``` +Then poll `smartcrawler_fetch_results` with the returned `id`. -### Extract Contact Information +### Credits and history ``` -Tool: smartscraper -URL: https://company.example.com/contact -Prompt: "Find all contact methods: email addresses, phone numbers, physical address, and social media links" +Tool: credits +Tool: sgai_history +limit: 10 ``` -### Automated Form Navigation -``` -Tool: agentic_scrapper -URL: https://example.com/search -user_prompt: "Navigate to the search page, enter 'web scraping tools', and extract the top 5 results" -steps: ["Find search box", "Enter search term", "Submit form", "Extract results"] -``` - -## πŸ’‘ Pro Tips - -1. **For Dynamic Content**: Use `render_heavy_js: true` with the `scrape` tool -2. **For Large Sites**: Start with `sitemap` to understand structure -3. **For Async Operations**: Always poll `smartcrawler_fetch_results` until complete -4. **For Complex Sites**: Use `agentic_scrapper` with detailed step instructions -5. **For Cost Efficiency**: Use markdown mode for content conversion, AI mode for data extraction - -## πŸ”§ Configuration - -Set your API key via: -- Environment variable: `SGAI_API_KEY=your_key_here` -- MCP configuration: `scrapegraph_api_key: "your_key_here"` - -No configuration required - the server works with environment variables! +Auth: `SGAI_API_KEY` or MCP `scrapegraphApiKey`. Optional: `SCRAPEGRAPH_API_BASE_URL`. """ @@ -697,44 +566,22 @@ def quick_start_examples() -> str: def api_status() -> str: """ Current status and capabilities of the ScapeGraph API server. - + Provides real-time information about available tools, credit usage, and server health. """ - return """# ScapeGraph API Status - -## Server Information -- **Status**: βœ… Online and Ready -- **Version**: 1.0.0 -- **Base URL**: https://api.scrapegraphai.com/v1 - -## Available Tools -1. **markdownify** - Convert webpages to markdown (2 credits/page) -2. **smartscraper** - AI data extraction (10 credits/page) -3. **searchscraper** - AI web search (30 credits for 3 websites) -4. **smartcrawler** - Multi-page crawling (2-10 credits/page) -5. **scrape** - Basic page fetching (1 credit/page) -6. **sitemap** - Website structure extraction (1 credit) -7. **agentic_scrapper** - AI automation (variable credits) - -## Credit Costs -- **Markdown Conversion**: 2 credits per page -- **AI Extraction**: 10 credits per page -- **Web Search**: 10 credits per website (default 3 websites) -- **Basic Scraping**: 1 credit per page -- **Sitemap**: 1 credit per request - -## Configuration -- **API Key**: Required (set via SGAI_API_KEY env var or config) -- **Timeout**: 120 seconds default (configurable) -- **Rate Limits**: Applied per API key - -## Best Practices -- Use markdown mode for content conversion (cheaper) -- Use AI mode for structured data extraction -- Set appropriate limits for crawling operations -- Monitor credit usage for cost optimization - -Last Updated: $(date) + return """# ScapeGraph API Status (MCP v2) + +- **MCP package version**: 2.0.0 (matches [scrapegraph-py#82](https://github.com/ScrapeGraphAI/scrapegraph-py/pull/82) API surface) +- **Default API base**: `https://api.scrapegraphai.com/api/v2` (override with `SCRAPEGRAPH_API_BASE_URL`) +- **Auth headers**: `Authorization: Bearer`, `SGAI-APIKEY`, `X-SDK-Version: scrapegraph-mcp@2.0.0` + +## Tools +markdownify, scrape, smartscraper, searchscraper, smartcrawler_initiate, smartcrawler_fetch_results, crawl_stop, crawl_resume, credits, sgai_history, monitor_create, monitor_list, monitor_get, monitor_pause, monitor_resume, monitor_delete + +## Removed vs legacy MCP +sitemap, agentic_scrapper, markdownify_status, smartscraper_status β€” not available on API v2. + +Credit costs are determined by the ScrapeGraphAI API; use **credits** to check balance. """ @@ -742,7 +589,7 @@ def api_status() -> str: def common_use_cases() -> str: """ Common use cases and example implementations for ScapeGraph tools. - + Real-world examples with expected inputs and outputs. """ return """# ScapeGraph Common Use Cases @@ -848,11 +695,16 @@ def common_use_cases() -> str: def parameter_reference_guide() -> str: """ Comprehensive parameter reference guide for all ScapeGraph MCP tools. - + Complete documentation of every parameter with examples, constraints, and best practices. """ return """# ScapeGraph MCP Parameter Reference Guide +> **API v2 note:** This document still contains legacy v1-era tool names and parameters in places. +> Trust the live tool schemas in the MCP client and the module docstring in `server.py` for v2. +> New tools: `credits`, `sgai_history`, `crawl_stop`, `crawl_resume`, `monitor_*`. Removed: `sitemap`, +> `agentic_scrapper`, `markdownify_status`, `smartscraper_status`. + ## πŸ“‹ Complete Parameter Documentation This guide provides comprehensive documentation for every parameter across all ScapeGraph MCP tools. Use this as your definitive reference for understanding parameter behavior, constraints, and best practices. @@ -862,12 +714,12 @@ def parameter_reference_guide() -> str: ## πŸ”§ Common Parameters ### URL Parameters -**Used in**: markdownify, smartscraper, searchscraper, smartcrawler_initiate, scrape, sitemap, agentic_scrapper +**Used in**: markdownify, smartscraper, searchscraper, smartcrawler_initiate, scrape, monitor_*, and related v2 tools #### `website_url` / `url` - **Type**: `str` (required) - **Format**: Must include protocol (http:// or https://) -- **Examples**: +- **Examples**: - βœ… `https://example.com/page` - βœ… `https://docs.python.org/3/tutorial/` - ❌ `example.com` (missing protocol) @@ -1287,11 +1139,13 @@ def parameter_reference_guide() -> str: def tool_comparison_guide() -> str: """ Detailed comparison of ScapeGraph tools to help choose the right tool for each task. - + Decision matrix and feature comparison across all available tools. """ return """# ScapeGraph Tools Comparison Guide +> **API v2:** Some rows reference removed tools (`sitemap`, `agentic_scrapper`). See `scrapegraph://api/status`. + ## 🎯 Quick Decision Matrix | Need | Recommended Tool | Alternative | Credits | @@ -1439,7 +1293,7 @@ def markdownify( headers: Optional[Dict[str, str]] = None, stealth: Optional[bool] = None, stream: Optional[bool] = None, - mock: Optional[bool] = None + mock: Optional[bool] = None, ) -> Dict[str, Any]: """ Convert a webpage into clean, formatted markdown. @@ -1512,69 +1366,12 @@ def markdownify( api_key = get_api_key(ctx) client = ScapeGraphClient(api_key) return client.markdownify( - website_url=website_url, - headers=headers, - stealth=stealth, - stream=stream, - mock=mock + website_url=website_url, headers=headers, stealth=stealth, stream=stream, mock=mock ) except Exception as e: return {"error": str(e)} -# Add tool for markdownify status -@mcp.tool(annotations={"readOnlyHint": True, "destructiveHint": False, "idempotentHint": True}) -def markdownify_status(request_id: str, ctx: Context) -> Dict[str, Any]: - """ - Get the status and results of a markdownify conversion request. - - This tool retrieves the status of a previously initiated markdown conversion using the request_id. - Use this when you need to check the status or retrieve results of an asynchronous markdownify operation. - Read-only operation with no side effects. - - Args: - request_id (str): The unique request identifier returned by a previous markdownify call. - - Format: UUID string (e.g., "123e4567-e89b-12d3-a456-426614174000") - - Used to track and retrieve specific conversion results - - Each markdownify operation may return a request_id for status checking - - Examples: - * "7f3d8a9c-1234-5678-9abc-def012345678" - * "a1b2c3d4-e5f6-7890-abcd-ef1234567890" - - Returns: - Dictionary containing: - - request_id: The request identifier - - status: Current status of the conversion ("queued", "processing", "completed", "failed") - - result: The converted markdown content (when status is "completed") - - website_url: The URL that was converted - - error: Error message if status is "failed" (empty string otherwise) - - processing_time: Time taken for the conversion (when completed) - - credits_used: Number of credits consumed - - Raises: - ValueError: If request_id is malformed or invalid - HTTPError: If the request cannot be found (404) or server error occurs - - Use Cases: - - Checking the status of long-running markdown conversions - - Retrieving results from asynchronous markdownify operations - - Monitoring conversion progress for large or complex pages - - Verifying completion before proceeding with next steps - - Note: - - Some markdownify operations may complete synchronously and not require status checks - - If status is "processing" or "queued", poll this endpoint again after a delay - - Once status is "completed", the result field will contain the markdown content - - Failed requests will have status "failed" and an error message in the error field - """ - try: - api_key = get_api_key(ctx) - client = ScapeGraphClient(api_key) - return client.markdownify_status(request_id=request_id) - except Exception as e: - return {"error": str(e)} - - # Add tool for smartscraper @mcp.tool(annotations={"readOnlyHint": True, "destructiveHint": False, "idempotentHint": True}) def smartscraper( @@ -1583,20 +1380,20 @@ def smartscraper( website_url: Optional[str] = None, website_html: Optional[str] = None, website_markdown: Optional[str] = None, - output_schema: Optional[Annotated[Union[str, Dict[str, Any]], Field( - default=None, - description="JSON schema dict or JSON string defining the expected output structure", - json_schema_extra={ - "oneOf": [ - {"type": "string"}, - {"type": "object"} - ] - } - )]] = None, + output_schema: Optional[ + Annotated[ + Union[str, Dict[str, Any]], + Field( + default=None, + description="JSON schema dict or JSON string defining the expected output structure", + json_schema_extra={"oneOf": [{"type": "string"}, {"type": "object"}]}, + ), + ] + ] = None, number_of_scrolls: Optional[int] = None, total_pages: Optional[int] = None, render_heavy_js: Optional[bool] = None, - stealth: Optional[bool] = None + stealth: Optional[bool] = None, ) -> Dict[str, Any]: """ Extract structured data from a webpage, HTML, or markdown using AI-powered extraction. @@ -1750,69 +1547,12 @@ def smartscraper( number_of_scrolls=number_of_scrolls, total_pages=total_pages, render_heavy_js=render_heavy_js, - stealth=stealth + stealth=stealth, ) except Exception as e: return {"error": str(e)} -# Add tool for smartscraper status -@mcp.tool(annotations={"readOnlyHint": True, "destructiveHint": False, "idempotentHint": True}) -def smartscraper_status(request_id: str, ctx: Context) -> Dict[str, Any]: - """ - Get the status and results of a smartscraper extraction request. - - This tool retrieves the status of a previously initiated AI-powered data extraction using the request_id. - Use this when you need to check the status or retrieve results of an asynchronous smartscraper operation. - Read-only operation with no side effects. - - Args: - request_id (str): The unique request identifier returned by a previous smartscraper call. - - Format: UUID string (e.g., "123e4567-e89b-12d3-a456-426614174000") - - Used to track and retrieve specific extraction results - - Each smartscraper operation may return a request_id for status checking - - Examples: - * "7f3d8a9c-1234-5678-9abc-def012345678" - * "a1b2c3d4-e5f6-7890-abcd-ef1234567890" - - Returns: - Dictionary containing: - - request_id: The request identifier - - status: Current status of the extraction ("queued", "processing", "completed", "failed") - - result: The extracted structured data (when status is "completed") - - website_url: The URL that was scraped (if applicable) - - user_prompt: The original extraction prompt - - error: Error message if status is "failed" (empty string otherwise) - - processing_time: Time taken for the extraction (when completed) - - credits_used: Number of credits consumed - - pages_processed: Number of pages analyzed - - Raises: - ValueError: If request_id is malformed or invalid - HTTPError: If the request cannot be found (404) or server error occurs - - Use Cases: - - Checking the status of long-running data extractions - - Retrieving results from asynchronous smartscraper operations - - Monitoring extraction progress for complex or multi-page scraping - - Verifying completion before proceeding with next steps - - Handling extraction errors and retries - - Note: - - Some smartscraper operations may complete synchronously and not require status checks - - If status is "processing" or "queued", poll this endpoint again after a delay - - Once status is "completed", the result field will contain the extracted structured data - - Failed requests will have status "failed" and an error message in the error field - - The extracted data format depends on the output_schema provided in the original request - """ - try: - api_key = get_api_key(ctx) - client = ScapeGraphClient(api_key) - return client.smartscraper_status(request_id=request_id) - except Exception as e: - return {"error": str(e)} - - # Add tool for searchscraper @mcp.tool(annotations={"readOnlyHint": True, "destructiveHint": False, "idempotentHint": False}) def searchscraper( @@ -1820,7 +1560,9 @@ def searchscraper( ctx: Context, num_results: Optional[int] = None, number_of_scrolls: Optional[int] = None, - time_range: Optional[Literal["past_hour", "past_24_hours", "past_week", "past_month", "past_year"]] = None + time_range: Optional[ + Literal["past_hour", "past_24_hours", "past_week", "past_month", "past_year"] + ] = None, ) -> Dict[str, Any]: """ Perform AI-powered web searches with structured data extraction. @@ -1846,10 +1588,8 @@ def searchscraper( * Mention timeframes or filters (e.g., "latest", "2024", "top 10") * Specify data types needed (prices, dates, ratings, etc.) - num_results (Optional[int]): Number of websites to search and extract data from. - - Default: 3 websites (costs 30 credits total) - - Range: 1-20 websites (recommended to stay under 10 for cost efficiency) - - Each website costs 10 credits, so total cost = num_results Γ— 10 + num_results (Optional[int]): Number of search results (API v2 allows 3–20; values are clamped). + - Default: 5 (SDK v2 default) - Examples: * 1: Quick single-source lookup (10 credits) * 3: Standard research (30 credits) - good balance of coverage and cost @@ -1857,32 +1597,10 @@ def searchscraper( * 10: Extensive analysis (100 credits) - Note: More results provide broader coverage but increase costs and processing time - number_of_scrolls (Optional[int]): Number of infinite scrolls per searched webpage. - - Default: 0 (no scrolling on search result pages) - - Range: 0-10 scrolls per page - - Useful when search results point to pages with dynamic content loading - - Each scroll waits for content to load before continuing - - Examples: - * 0: Static content pages, news articles, documentation - * 2: Social media pages, product listings with lazy loading - * 5: Extensive feeds, long-form content with infinite scroll - - Note: Increases processing time significantly (adds 5-10 seconds per scroll per page) - - time_range (Optional[str]): Filter search results by time range. - - Default: None (no time filter applied) - - Valid values: - * "past_hour": Results from the last hour - * "past_24_hours": Results from the last 24 hours - * "past_week": Results from the last 7 days - * "past_month": Results from the last 30 days - * "past_year": Results from the last 365 days - - Examples: - * Use "past_hour" for breaking news or real-time updates - * Use "past_24_hours" for recent developments - * Use "past_week" for current events and trending topics - * Use "past_month" for recent but not immediate information - * Use "past_year" for relatively recent content - - Note: Useful for finding recent information or filtering out outdated content + number_of_scrolls (Optional[int]): **Not supported on API v2** β€” ignored. + + time_range (Optional[str]): **Not supported on API v2** β€” this parameter is ignored. + The v2 /search endpoint does not accept time_range; omit it or expect no effect. Returns: Dictionary containing: @@ -1921,140 +1639,30 @@ def smartcrawler_initiate( url: str, ctx: Context, prompt: Optional[str] = None, - extraction_mode: str = "ai", + extraction_mode: str = "markdown", depth: Optional[int] = None, max_pages: Optional[int] = None, - same_domain_only: Optional[bool] = None + same_domain_only: Optional[bool] = None, + include_patterns: Optional[List[str]] = None, + exclude_patterns: Optional[List[str]] = None, ) -> Dict[str, Any]: """ - Initiate an asynchronous multi-page web crawling operation with AI extraction or markdown conversion. + Start an asynchronous multi-page crawl (API v2 POST /crawl). - This tool starts an intelligent crawler that discovers and processes multiple pages from a starting URL. - Choose between AI Extraction Mode (10 credits/page) for structured data or Markdown Mode (2 credits/page) - for content conversion. The operation is asynchronous - use smartcrawler_fetch_results to retrieve results. - Creates a new crawl request (non-idempotent, non-read-only). + API v2 supports **markdown** or **html** output per page only. Per-page AI extraction during crawl + is not available; use ``smartscraper`` for a single URL or ``monitor_create`` for scheduled extraction. - SmartCrawler supports two modes: - - AI Extraction Mode: Extracts structured data based on your prompt from every crawled page - - Markdown Conversion Mode: Converts each page to clean markdown format + Poll ``smartcrawler_fetch_results`` with the returned job ``id`` until the crawl completes. Args: - url (str): The starting URL to begin crawling from. - - Must include protocol (http:// or https://) - - The crawler will discover and process linked pages from this starting point - - Should be a page with links to other pages you want to crawl - - Examples: - * https://docs.example.com (documentation site root) - * https://blog.company.com (blog homepage) - * https://example.com/products (product category page) - * https://news.site.com/category/tech (news section) - - Best practices: - * Use homepage or main category pages as starting points - * Ensure the starting page has links to content you want to crawl - * Consider site structure when choosing the starting URL - - prompt (Optional[str]): AI prompt for data extraction. - - REQUIRED when extraction_mode is 'ai' - - Ignored when extraction_mode is 'markdown' - - Describes what data to extract from each crawled page - - Applied consistently across all discovered pages - - Examples: - * "Extract API endpoint name, method, parameters, and description" - * "Get article title, author, publication date, and summary" - * "Find product name, price, description, and availability" - * "Extract job title, company, location, salary, and requirements" - - Tips for better results: - * Be specific about fields you want from each page - * Consider that different pages may have different content structures - * Use general terms that apply across multiple page types - - extraction_mode (str): Extraction mode for processing crawled pages. - - Default: "ai" - - Options: - * "ai": AI-powered structured data extraction (10 credits per page) - - Uses the prompt to extract specific data from each page - - Returns structured JSON data - - More expensive but provides targeted information - - Best for: Data collection, research, structured analysis - * "markdown": Simple markdown conversion (2 credits per page) - - Converts each page to clean markdown format - - No AI processing, just content conversion - - More cost-effective for content archival - - Best for: Documentation backup, content migration, reading - - Cost comparison: - * AI mode: 50 pages = 500 credits - * Markdown mode: 50 pages = 100 credits - - depth (Optional[int]): Maximum depth of link traversal from the starting URL. - - Default: unlimited (will follow links until max_pages or no more links) - - Depth levels: - * 0: Only the starting URL (no link following) - * 1: Starting URL + pages directly linked from it - * 2: Starting URL + direct links + links from those pages - * 3+: Continues following links to specified depth - - Examples: - * 1: Crawl blog homepage + all blog posts - * 2: Crawl docs homepage + category pages + individual doc pages - * 3: Deep crawling for comprehensive site coverage - - Considerations: - * Higher depth can lead to exponential page growth - * Use with max_pages to control scope and cost - * Consider site structure when setting depth - - max_pages (Optional[int]): Maximum number of pages to crawl in total. - - Default: unlimited (will crawl until no more links or depth limit) - - Recommended ranges: - * 10-20: Testing and small sites - * 50-100: Medium sites and focused crawling - * 200-500: Large sites and comprehensive analysis - * 1000+: Enterprise-level crawling (high cost) - - Cost implications: - * AI mode: max_pages Γ— 10 credits - * Markdown mode: max_pages Γ— 2 credits - - Examples: - * 10: Quick site sampling (20-100 credits) - * 50: Standard documentation crawl (100-500 credits) - * 200: Comprehensive site analysis (400-2000 credits) - - Note: Crawler stops when this limit is reached, regardless of remaining links - - same_domain_only (Optional[bool]): Whether to crawl only within the same domain. - - Default: true (recommended for most use cases) - - Options: - * true: Only crawl pages within the same domain as starting URL - - Prevents following external links - - Keeps crawling focused on the target site - - Reduces risk of crawling unrelated content - - Example: Starting at docs.example.com only crawls docs.example.com pages - * false: Allow crawling external domains - - Follows links to other domains - - Can lead to very broad crawling scope - - May crawl unrelated or unwanted content - - Use with caution and appropriate max_pages limit - - Recommendations: - * Use true for focused site crawling - * Use false only when you specifically need cross-domain data - * Always set max_pages when using false to prevent runaway crawling - - Returns: - Dictionary containing: - - request_id: Unique identifier for this crawl operation (use with smartcrawler_fetch_results) - - status: Initial status of the crawl request ("initiated" or "processing") - - estimated_cost: Estimated credit cost based on parameters (actual cost may vary) - - crawl_parameters: Summary of the crawling configuration - - estimated_time: Rough estimate of processing time - - next_steps: Instructions for retrieving results - - Raises: - ValueError: If URL is malformed, prompt is missing for AI mode, or parameters are invalid - HTTPError: If the starting URL cannot be accessed - RateLimitError: If too many crawl requests are initiated too quickly - - Note: - - This operation is asynchronous and may take several minutes to complete - - Use smartcrawler_fetch_results with the returned request_id to get results - - Keep polling smartcrawler_fetch_results until status is "completed" - - Actual pages crawled may be less than max_pages if fewer links are found - - Processing time increases with max_pages, depth, and extraction_mode complexity + url: Starting URL (http/https). + prompt: Ignored for markdown/html crawl (kept for signature compatibility). + extraction_mode: ``markdown`` (default) or ``html``. ``ai`` is not supported on v2 crawl. + depth: Crawl depth (1–10, default 2). + max_pages: Max pages (1–100, default 10). + same_domain_only: Not sent to v2 API; use ``include_patterns`` / ``exclude_patterns`` instead. + include_patterns: Optional URL patterns to include. + exclude_patterns: Optional URL patterns to exclude. """ try: api_key = get_api_key(ctx) @@ -2065,7 +1673,9 @@ def smartcrawler_initiate( extraction_mode=extraction_mode, depth=depth, max_pages=max_pages, - same_domain_only=same_domain_only + same_domain_only=same_domain_only, + include_patterns=include_patterns, + exclude_patterns=exclude_patterns, ) except Exception as e: return {"error": str(e)} @@ -2100,468 +1710,196 @@ def smartcrawler_fetch_results(request_id: str, ctx: Context) -> Dict[str, Any]: return {"error": str(e)} -# Add tool for basic scrape -@mcp.tool(annotations={"readOnlyHint": True, "destructiveHint": False, "idempotentHint": True}) -def scrape( - website_url: str, - ctx: Context, - render_heavy_js: Optional[bool] = None, - mock: Optional[bool] = None, - stealth: Optional[bool] = None, - stream: Optional[bool] = None -) -> Dict[str, Any]: - """ - Fetch raw page content from any URL with optional JavaScript rendering. +@mcp.tool(annotations={"readOnlyHint": False, "destructiveHint": False, "idempotentHint": False}) +def crawl_stop(crawl_id: str, ctx: Context) -> Dict[str, Any]: + """Stop a running crawl job (API v2 POST /crawl/:id/stop).""" + try: + api_key = get_api_key(ctx) + client = ScapeGraphClient(api_key) + return client.crawl_stop(crawl_id) + except Exception as e: + return {"error": str(e)} - This tool performs basic web scraping to retrieve the raw HTML content of a webpage. - Optionally enable JavaScript rendering for Single Page Applications (SPAs) and sites with - heavy client-side rendering. Lower cost than AI extraction (1 credit/page). - Read-only operation with no side effects. - Args: - website_url (str): The complete URL of the webpage to scrape. - - Must include protocol (http:// or https://) - - Returns raw HTML content of the page - - Works with both static and dynamic websites - - Examples: - * https://example.com/page - * https://api.example.com/docs - * https://news.site.com/article/123 - * https://app.example.com/dashboard (may need render_heavy_js=true) - - Supported protocols: HTTP, HTTPS - - Invalid examples: - * example.com (missing protocol) - * ftp://example.com (unsupported protocol) - - render_heavy_js (Optional[bool]): Enable full JavaScript rendering for dynamic content. - - Default: false (faster, lower cost, works for most static sites) - - Set to true for sites that require JavaScript execution to display content - - When to use true: - * Single Page Applications (React, Angular, Vue.js) - * Sites with dynamic content loading via AJAX - * Content that appears only after JavaScript execution - * Interactive web applications - * Sites where initial HTML is mostly empty - - When to use false (default): - * Static websites and blogs - * Server-side rendered content - * Traditional HTML pages - * News articles and documentation - * When you need faster processing - - Performance impact: - * false: 2-5 seconds processing time - * true: 15-30 seconds processing time (waits for JS execution) - - Cost: Same (1 credit) regardless of render_heavy_js setting +@mcp.tool(annotations={"readOnlyHint": False, "destructiveHint": False, "idempotentHint": False}) +def crawl_resume(crawl_id: str, ctx: Context) -> Dict[str, Any]: + """Resume a stopped crawl job (API v2 POST /crawl/:id/resume).""" + try: + api_key = get_api_key(ctx) + client = ScapeGraphClient(api_key) + return client.crawl_resume(crawl_id) + except Exception as e: + return {"error": str(e)} - mock (Optional[bool]): Return mock data for testing purposes. - - Default: false (real scraping) - - Set to true to receive mock/sample data instead of actually scraping the website - - Useful for testing and development without consuming credits or hitting rate limits - - When to use true: - * Testing your integration without making real requests - * Prototyping workflows before production use - * Development and debugging scenarios - stealth (Optional[bool]): Enable stealth mode to avoid bot detection. - - Default: false (standard scraping) - - Set to true to bypass basic anti-scraping measures - - Uses techniques to appear more like a human browser - - When to use true: - * Sites with bot detection systems - * E-commerce sites with protection - * Sites that block automated requests - - Note: May increase processing time and is not 100% guaranteed +@mcp.tool(annotations={"readOnlyHint": True, "destructiveHint": False, "idempotentHint": True}) +def credits(ctx: Context) -> Dict[str, Any]: + """Return remaining API credits (API v2 GET /credits).""" + try: + api_key = get_api_key(ctx) + client = ScapeGraphClient(api_key) + return client.credits() + except Exception as e: + return {"error": str(e)} - stream (Optional[bool]): Enable streaming response for real-time updates. - - Default: false (standard response) - - Set to true for streaming mode to receive data as it's being processed - - Useful for monitoring progress on large or slow-loading pages - - Provides real-time feedback during the scraping operation - Returns: - Dictionary containing: - - html_content: The raw HTML content of the page as a string - - page_title: Extracted page title if available - - status_code: HTTP response status code (200 for success) - - final_url: Final URL after any redirects - - content_length: Size of the HTML content in bytes - - processing_time: Time taken to fetch and process the page - - javascript_rendered: Whether JavaScript rendering was used - - credits_used: Number of credits consumed (always 1) +@mcp.tool(annotations={"readOnlyHint": True, "destructiveHint": False, "idempotentHint": False}) +def sgai_history( + ctx: Context, + endpoint: Optional[str] = None, + status: Optional[str] = None, + limit: Optional[int] = None, + offset: Optional[int] = None, +) -> Dict[str, Any]: + """ + List recent API requests (API v2 GET /history). - Raises: - ValueError: If website_url is malformed or missing protocol - HTTPError: If the webpage returns an error status (404, 500, etc.) - TimeoutError: If the page takes too long to load - ConnectionError: If the website cannot be reached + Args: + endpoint: Filter by service name (e.g. scrape, extract, search). + status: Filter by status string. + limit: Max rows (1–100). + offset: Skip offset for pagination. + """ + try: + api_key = get_api_key(ctx) + client = ScapeGraphClient(api_key) + return client.history(endpoint=endpoint, status_filter=status, limit=limit, offset=offset) + except Exception as e: + return {"error": str(e)} - Use Cases: - - Getting raw HTML for custom parsing - - Checking page structure before using other tools - - Fetching content for offline processing - - Debugging website content issues - - Pre-processing before AI extraction - Note: - - This tool returns raw HTML without any AI processing - - Use smartscraper for structured data extraction - - Use markdownify for clean, readable content - - Consider render_heavy_js=true if initial results seem incomplete - """ +@mcp.tool(annotations={"readOnlyHint": False, "destructiveHint": False, "idempotentHint": False}) +def monitor_create( + name: str, + url: str, + prompt: str, + cron: str, + ctx: Context, + output_schema: Optional[ + Annotated[ + Union[str, Dict[str, Any]], + Field( + default=None, + description="JSON schema dict or JSON string for structured monitor output", + json_schema_extra={"oneOf": [{"type": "string"}, {"type": "object"}]}, + ), + ] + ] = None, +) -> Dict[str, Any]: + """Create a scheduled monitor job (API v2 POST /monitor). Cron uses a 5-field expression.""" try: api_key = get_api_key(ctx) client = ScapeGraphClient(api_key) - return client.scrape( - website_url=website_url, - render_heavy_js=render_heavy_js, - mock=mock, - stealth=stealth, - stream=stream + normalized_schema: Optional[Dict[str, Any]] = None + if isinstance(output_schema, dict): + normalized_schema = output_schema + elif isinstance(output_schema, str): + try: + parsed = json.loads(output_schema) + if isinstance(parsed, dict): + normalized_schema = parsed + else: + return {"error": "output_schema must be a JSON object"} + except json.JSONDecodeError as e: + return {"error": f"Invalid JSON for output_schema: {e}"} + return client.monitor_create( + name=name, url=url, prompt=prompt, cron=cron, output_schema=normalized_schema ) - except httpx.HTTPError as http_err: - return {"error": str(http_err)} - except ValueError as val_err: - return {"error": str(val_err)} + except Exception as e: + return {"error": str(e)} -# Add tool for sitemap extraction @mcp.tool(annotations={"readOnlyHint": True, "destructiveHint": False, "idempotentHint": True}) -def sitemap(website_url: str, ctx: Context, stream: Optional[bool] = None) -> Dict[str, Any]: - """ - Extract and discover the complete sitemap structure of any website. - - This tool automatically discovers all accessible URLs and pages within a website, providing - a comprehensive map of the site's structure. Useful for understanding site architecture before - crawling or for discovering all available content. Very cost-effective at 1 credit per request. - Read-only operation with no side effects. - - Args: - website_url (str): The base URL of the website to extract sitemap from. - - Must include protocol (http:// or https://) - - Should be the root domain or main section you want to map - - The tool will discover all accessible pages from this starting point - - Examples: - * https://example.com (discover entire website structure) - * https://docs.example.com (map documentation site) - * https://blog.company.com (discover all blog pages) - * https://shop.example.com (map e-commerce structure) - - Best practices: - * Use root domain (https://example.com) for complete site mapping - * Use subdomain (https://docs.example.com) for focused mapping - * Ensure the URL is accessible and doesn't require authentication - - Discovery methods: - * Checks for robots.txt and sitemap.xml files - * Crawls navigation links and menus - * Discovers pages through internal link analysis - * Identifies common URL patterns and structures - - stream (Optional[bool]): Enable streaming response for real-time updates. - - Default: false (standard response) - - Set to true for streaming mode to receive updates as they are discovered - - Useful for large sites where discovery may take significant time - - Provides progress updates during the sitemap extraction process - - Returns: - Dictionary containing: - - discovered_urls: List of all URLs found on the website - - site_structure: Hierarchical organization of pages and sections - - url_categories: URLs grouped by type (pages, images, documents, etc.) - - total_pages: Total number of pages discovered - - subdomains: List of subdomains found (if any) - - sitemap_sources: Sources used for discovery (sitemap.xml, robots.txt, crawling) - - page_types: Breakdown of different content types found - - depth_analysis: URL organization by depth from root - - external_links: Links pointing to external domains (if found) - - processing_time: Time taken to complete the discovery - - credits_used: Number of credits consumed (always 1) +def monitor_list(ctx: Context) -> Dict[str, Any]: + """List monitors (API v2 GET /monitor).""" + try: + api_key = get_api_key(ctx) + client = ScapeGraphClient(api_key) + return client.monitor_list() + except Exception as e: + return {"error": str(e)} - Raises: - ValueError: If website_url is malformed or missing protocol - HTTPError: If the website cannot be accessed or returns errors - TimeoutError: If the discovery process takes too long - ConnectionError: If the website cannot be reached - - Use Cases: - - Planning comprehensive crawling operations - - Understanding website architecture and organization - - Discovering all available content before targeted scraping - - SEO analysis and site structure optimization - - Content inventory and audit preparation - - Identifying pages for bulk processing operations - - Best Practices: - - Run sitemap before using smartcrawler_initiate for better planning - - Use results to set appropriate max_pages and depth parameters - - Check discovered URLs to understand site organization - - Identify high-value pages for targeted extraction - - Use for cost estimation before large crawling operations - Note: - - Very cost-effective at only 1 credit per request - - Results may vary based on site structure and accessibility - - Some pages may require authentication and won't be discovered - - Large sites may have thousands of URLs - consider filtering results - - Use discovered URLs as input for other scraping tools - """ +@mcp.tool(annotations={"readOnlyHint": True, "destructiveHint": False, "idempotentHint": True}) +def monitor_get(monitor_id: str, ctx: Context) -> Dict[str, Any]: + """Get one monitor by id (API v2 GET /monitor/:id).""" try: api_key = get_api_key(ctx) client = ScapeGraphClient(api_key) - return client.sitemap(website_url=website_url, stream=stream) - except httpx.HTTPError as http_err: - return {"error": str(http_err)} - except ValueError as val_err: - return {"error": str(val_err)} + return client.monitor_get(monitor_id) + except Exception as e: + return {"error": str(e)} -# Add tool for Agentic Scraper (no live session/browser interaction) @mcp.tool(annotations={"readOnlyHint": False, "destructiveHint": False, "idempotentHint": False}) -def agentic_scrapper( - url: str, - ctx: Context, - user_prompt: Optional[str] = None, - output_schema: Optional[Annotated[Union[str, Dict[str, Any]], Field( - default=None, - description="Desired output structure as a JSON schema dict or JSON string", - json_schema_extra={ - "oneOf": [ - {"type": "string"}, - {"type": "object"} - ] - } - )]] = None, - steps: Optional[Annotated[Union[str, List[str]], Field( - default=None, - description="Step-by-step instructions for the agent as a list of strings or JSON array string", - json_schema_extra={ - "oneOf": [ - {"type": "string"}, - {"type": "array", "items": {"type": "string"}} - ] - } - )]] = None, - ai_extraction: Optional[bool] = None, - persistent_session: Optional[bool] = None, - timeout_seconds: Optional[float] = None -) -> Dict[str, Any]: - """ - Execute complex multi-step web scraping workflows with AI-powered automation. - - This tool runs an intelligent agent that can navigate websites, interact with forms and buttons, - follow multi-step workflows, and extract structured data. Ideal for complex scraping scenarios - requiring user interaction simulation, form submissions, or multi-page navigation flows. - Supports custom output schemas and step-by-step instructions. Variable credit cost based on - complexity. Can perform actions on the website (non-read-only, non-idempotent). +def monitor_pause(monitor_id: str, ctx: Context) -> Dict[str, Any]: + """Pause a monitor (API v2 POST /monitor/:id/pause).""" + try: + api_key = get_api_key(ctx) + client = ScapeGraphClient(api_key) + return client.monitor_pause(monitor_id) + except Exception as e: + return {"error": str(e)} - The agent accepts flexible input formats for steps (list or JSON string) and output_schema - (dict or JSON string) to accommodate different client implementations. - Args: - url (str): The target website URL where the agentic scraping workflow should start. - - Must include protocol (http:// or https://) - - Should be the starting page for your automation workflow - - The agent will begin its actions from this URL - - Examples: - * https://example.com/search (start at search page) - * https://shop.example.com/login (begin with login flow) - * https://app.example.com/dashboard (start at main interface) - * https://forms.example.com/contact (begin at form page) - - Considerations: - * Choose a starting point that makes sense for your workflow - * Ensure the page is publicly accessible or handle authentication - * Consider the logical flow of actions from this starting point - - user_prompt (Optional[str]): High-level instructions for what the agent should accomplish. - - Describes the overall goal and desired outcome of the automation - - Should be clear and specific about what you want to achieve - - Works in conjunction with the steps parameter for detailed guidance - - Examples: - * "Navigate to the search page, search for laptops, and extract the top 5 results with prices" - * "Fill out the contact form with sample data and submit it" - * "Login to the dashboard and extract all recent notifications" - * "Browse the product catalog and collect information about all items" - * "Navigate through the multi-step checkout process and capture each step" - - Tips for better results: - * Be specific about the end goal - * Mention what data you want extracted - * Include context about the expected workflow - * Specify any particular elements or sections to focus on +@mcp.tool(annotations={"readOnlyHint": False, "destructiveHint": False, "idempotentHint": False}) +def monitor_resume(monitor_id: str, ctx: Context) -> Dict[str, Any]: + """Resume a paused monitor (API v2 POST /monitor/:id/resume).""" + try: + api_key = get_api_key(ctx) + client = ScapeGraphClient(api_key) + return client.monitor_resume(monitor_id) + except Exception as e: + return {"error": str(e)} - output_schema (Optional[Union[str, Dict]]): Desired output structure for extracted data. - - Can be provided as a dictionary or JSON string - - Defines the format and structure of the final extracted data - - Helps ensure consistent, predictable output format - - IMPORTANT: Must include a "required" field (can be empty array [] if no fields are required) - - Examples: - * Simple object: {'type': 'object', 'properties': {'title': {'type': 'string'}, 'price': {'type': 'number'}}, 'required': []} - * Array of objects: {'type': 'array', 'items': {'type': 'object', 'properties': {'name': {'type': 'string'}, 'value': {'type': 'string'}}, 'required': []}, 'required': []} - * Complex nested: {'type': 'object', 'properties': {'products': {'type': 'array', 'items': {...}}, 'total_count': {'type': 'number'}}, 'required': []} - * As JSON string: '{"type": "object", "properties": {"results": {"type": "array"}}, "required": []}' - * With required fields: {'type': 'object', 'properties': {'id': {'type': 'string'}, 'name': {'type': 'string'}}, 'required': ['id']} - - Note: If "required" field is missing, it will be automatically added as an empty array [] - - Default: None (agent will infer structure from prompt and steps) - - steps (Optional[Union[str, List[str]]]): Step-by-step instructions for the agent. - - Can be provided as a list of strings or JSON array string - - Provides detailed, sequential instructions for the automation workflow - - Each step should be a clear, actionable instruction - - Examples as list: - * ['Click the search button', 'Enter "laptops" in the search box', 'Press Enter', 'Wait for results to load', 'Extract product information'] - * ['Fill in email field with test@example.com', 'Fill in password field', 'Click login button', 'Navigate to profile page'] - - Examples as JSON string: - * '["Open navigation menu", "Click on Products", "Select category filters", "Extract all product data"]' - - Best practices: - * Break complex actions into simple steps - * Be specific about UI elements (button text, field names, etc.) - * Include waiting/loading steps when necessary - * Specify extraction points clearly - * Order steps logically for the workflow - - ai_extraction (Optional[bool]): Enable AI-powered extraction mode for intelligent data parsing. - - Default: true (recommended for most use cases) - - Options: - * true: Uses advanced AI to intelligently extract and structure data - - Better at handling complex page layouts - - Can adapt to different content structures - - Provides more accurate data extraction - - Recommended for most scenarios - * false: Uses simpler extraction methods - - Faster processing but less intelligent - - May miss complex or nested data - - Use when speed is more important than accuracy - - Performance impact: - * true: Higher processing time but better results - * false: Faster execution but potentially less accurate extraction - - persistent_session (Optional[bool]): Maintain session state between steps. - - Default: false (each step starts fresh) - - Options: - * true: Keeps cookies, login state, and session data between steps - - Essential for authenticated workflows - - Maintains shopping cart contents, user preferences, etc. - - Required for multi-step processes that depend on previous actions - - Use for: Login flows, shopping processes, form wizards - * false: Each step starts with a clean session - - Faster and simpler for independent actions - - No state carried between steps - - Use for: Simple data extraction, public content scraping - - Examples when to use true: - * Login β†’ Navigate to protected area β†’ Extract data - * Add items to cart β†’ Proceed to checkout β†’ Extract order details - * Multi-step form completion with session dependencies - - timeout_seconds (Optional[float]): Maximum time to wait for the entire workflow. - - Default: 120 seconds (2 minutes) - - Recommended ranges: - * 60-120: Simple workflows (2-5 steps) - * 180-300: Medium complexity (5-10 steps) - * 300-600: Complex workflows (10+ steps or slow sites) - * 600+: Very complex or slow-loading workflows - - Considerations: - * Include time for page loads, form submissions, and processing - * Factor in network latency and site response times - * Allow extra time for AI processing and extraction - * Balance between thoroughness and efficiency - - Examples: - * 60.0: Quick single-page data extraction - * 180.0: Multi-step form filling and submission - * 300.0: Complex navigation and comprehensive data extraction - * 600.0: Extensive workflows with multiple page interactions - Returns: - Dictionary containing: - - extracted_data: The structured data matching your prompt and optional schema - - workflow_log: Detailed log of all actions performed by the agent - - pages_visited: List of URLs visited during the workflow - - actions_performed: Summary of interactions (clicks, form fills, navigations) - - execution_time: Total time taken for the workflow - - steps_completed: Number of steps successfully executed - - final_page_url: The URL where the workflow ended - - session_data: Session information if persistent_session was enabled - - credits_used: Number of credits consumed (varies by complexity) - - status: Success/failure status with any error details +@mcp.tool(annotations={"readOnlyHint": False, "destructiveHint": True, "idempotentHint": False}) +def monitor_delete(monitor_id: str, ctx: Context) -> Dict[str, Any]: + """Delete a monitor (API v2 DELETE /monitor/:id).""" + try: + api_key = get_api_key(ctx) + client = ScapeGraphClient(api_key) + return client.monitor_delete(monitor_id) + except Exception as e: + return {"error": str(e)} - Raises: - ValueError: If URL is malformed or required parameters are missing - TimeoutError: If the workflow exceeds the specified timeout - NavigationError: If the agent cannot navigate to required pages - InteractionError: If the agent cannot interact with specified elements - ExtractionError: If data extraction fails or returns invalid results - - Use Cases: - - Automated form filling and submission - - Multi-step checkout processes - - Login-protected content extraction - - Interactive search and filtering workflows - - Complex navigation scenarios requiring user simulation - - Data collection from dynamic, JavaScript-heavy applications - - Best Practices: - - Start with simple workflows and gradually increase complexity - - Use specific element identifiers in steps (button text, field labels) - - Include appropriate wait times for page loads and dynamic content - - Test with persistent_session=true for authentication-dependent workflows - - Set realistic timeouts based on workflow complexity - - Provide clear, sequential steps that build on each other - - Use output_schema to ensure consistent data structure - Note: - - This tool can perform actions on websites (non-read-only) - - Results may vary between runs due to dynamic content (non-idempotent) - - Credit cost varies based on workflow complexity and execution time - - Some websites may have anti-automation measures that could affect success - - Consider using simpler tools (smartscraper, markdownify) for basic extraction needs +# Add tool for basic scrape +@mcp.tool(annotations={"readOnlyHint": True, "destructiveHint": False, "idempotentHint": True}) +def scrape( + website_url: str, + ctx: Context, + output_format: Literal["markdown", "html", "screenshot", "branding"] = "markdown", + render_heavy_js: Optional[bool] = None, + mock: Optional[bool] = None, + stealth: Optional[bool] = None, + stream: Optional[bool] = None, + screenshot_full_page: bool = False, +) -> Dict[str, Any]: """ - # Normalize inputs to handle flexible formats from different MCP clients - normalized_steps: Optional[List[str]] = None - if isinstance(steps, list): - normalized_steps = steps - elif isinstance(steps, str): - parsed_steps: Optional[Any] = None - try: - parsed_steps = json.loads(steps) - except json.JSONDecodeError: - parsed_steps = None - if isinstance(parsed_steps, list): - normalized_steps = parsed_steps - else: - normalized_steps = [steps] - - normalized_schema: Optional[Dict[str, Any]] = None - if isinstance(output_schema, dict): - normalized_schema = output_schema - elif isinstance(output_schema, str): - try: - parsed_schema = json.loads(output_schema) - if isinstance(parsed_schema, dict): - normalized_schema = parsed_schema - else: - return {"error": "output_schema must be a JSON object"} - except json.JSONDecodeError as e: - return {"error": f"Invalid JSON for output_schema: {str(e)}"} - - # Ensure output_schema has a 'required' field if it exists - if normalized_schema is not None: - if "required" not in normalized_schema: - normalized_schema["required"] = [] + Fetch page content via API v2 POST /scrape (markdown, html, screenshot, or branding). + Maps fetch options into ``fetch_config`` (e.g. ``render_heavy_js`` β†’ ``render_js``). + ``stream`` is not supported on v2 and is ignored. + """ try: api_key = get_api_key(ctx) client = ScapeGraphClient(api_key) - return client.agentic_scrapper( - url=url, - user_prompt=user_prompt, - output_schema=normalized_schema, - steps=normalized_steps, - ai_extraction=ai_extraction, - persistent_session=persistent_session, - timeout_seconds=timeout_seconds, + return client.scrape( + website_url=website_url, + render_heavy_js=render_heavy_js, + mock=mock, + stealth=stealth, + stream=stream, + output_format=output_format, + screenshot_full_page=screenshot_full_page, ) - except httpx.TimeoutException as timeout_err: - return {"error": f"Request timed out: {str(timeout_err)}"} except httpx.HTTPError as http_err: return {"error": str(http_err)} except ValueError as val_err: return {"error": str(val_err)} + except Exception as e: + return {"error": str(e)} # Smithery server creation function @@ -2569,7 +1907,7 @@ def agentic_scrapper( def create_server() -> FastMCP: """ Create and return the FastMCP server instance for Smithery deployment. - + Returns: Configured FastMCP server instance """ @@ -2608,4 +1946,4 @@ def main() -> None: if __name__ == "__main__": - main() \ No newline at end of file + main()