Skip to content

docs: document Spider Tools SSRF-smuggling URL hardening from PraisonAI PR #1578 #284

@MervinPraison

Description

@MervinPraison

Context

PraisonAI PR MervinPraison/PraisonAI#1578 (fix(security): reject SSRF-smuggling URL characters in spider_tools._validate_url, merged 2026-04-28, head SHA 6cf2d11d5b788165d6384abf571653e06ea586aa) hardens SpiderTools._validate_url against an SSRF bypass class that exploits parser disagreement between urllib.parse.urlparse and the underlying HTTP client (requests / httpx).

The validator now early-rejects any URL that:

  1. is not a str,
  2. contains a backslash anywhere (\), or
  3. contains any ASCII control character (codepoint < 0x20 or == 0x7f — NUL, CR, LF, DEL, …).

These three checks run before urlparse, so the existing IP / private-range / metadata / internal-domain rejections in _validate_url continue to apply unchanged.

The threat (from the PR)

A URL such as

http://127.0.0.1:6666\@1.1.1.1

parses with hostname 1.1.1.1 via urllib.parse.urlparse (so any allow/deny check that consults parsed.hostname sees a public IP) but is actually dispatched to 127.0.0.1:6666 by requests / httpx, because those clients re-resolve the authority differently. Hostname-based SSRF guards are silently bypassed and the agent ends up issuing requests to internal services on the local host. ASCII control characters (NUL, CR, LF, DEL, …) in the authority section produce a similar parser disagreement and have been used in HTTP request smuggling and CRLF-injection attacks.

This is a user-facing security guarantee: any agent equipped with scrape_page, extract_links, crawl, or extract_text from praisonaiagents.tools.spider_tools now refuses these payloads before any network call is made.

Important

This is a content update only — no new pages are needed. Per AGENTS.md, do NOT create or modify anything under docs/concepts/. All edits land in docs/tools/spider_tools.mdx and docs/best-practices/security.mdx.


Triage: update vs. new content

Question Answer
Does PR #1578 add a new feature? No — it tightens an existing built-in security check (_validate_url).
Is the underlying tool already documented? Yes. docs/tools/spider_tools.mdx documents scrape_page, extract_links, crawl, extract_text.
Is the existing built-in URL validation already documented? No. Neither spider_tools.mdx nor best-practices/security.mdx mentions that the spider tools auto-block private IPs, loopback, metadata endpoints, or now SSRF-smuggling characters. This is a real, user-visible safety guarantee that is currently undocumented.
Are new docs pages needed? No. Add a new "Built-in Safety" / "URL Validation" section inside the existing docs/tools/spider_tools.mdx, and mirror a short note in docs/best-practices/security.mdx.
Does any existing docs copy need to be corrected? Yesdocs/best-practices/security.mdx lists scrape_page and crawl in the Tool Approval Matrix as risk "—" with no qualifier. We should keep them as no-approval-required, but add a footnote / paragraph explaining why they are safe by default (built-in URL validation).

SDK Truth (read these before writing docs)

Per AGENTS.md SDK-First cycle — read source, understand, then document.

File (in MervinPraison/PraisonAI) Why
src/praisonai-agents/praisonaiagents/tools/spider_tools.py Source of truth. Read SpiderTools._validate_url (lines ~30–95 after this PR) for the complete rejection list. The new lines (the only ones added by #1578) are: if not isinstance(url, str): return False and if "\\" in url or any(ord(c) < 0x20 or ord(c) == 0x7f for c in url): return False.
src/praisonai-agents/tests/unit/tools/test_spider_url_validation.py The 6 regression tests added by #1578 are the canonical examples. Use them verbatim in the docs.

Full _validate_url rejection list (read from source — do not paraphrase from memory)

After PR #1578, _validate_url returns False for any of:

Rejection Source-of-truth check Added by
Non-string input not isinstance(url, str) #1578 (new)
Backslash anywhere in URL "\\" in url #1578 (new)
ASCII control character (< 0x20 or == 0x7f) any(ord(c) < 0x20 or ord(c) == 0x7f for c in url) #1578 (new)
Non-http/https scheme parsed.scheme not in ['http', 'https'] pre-existing
Empty hostname not parsed.hostname pre-existing
localhost, 127.0.0.1, 0.0.0.0, ::1 hostname check pre-existing
Private IP / reserved / loopback / link-local ipaddress.ip_address(...).is_private/... pre-existing
.local, .internal, .localdomain suffixes hostname.endswith(...) pre-existing
Cloud metadata endpoints (169.254.169.254, metadata.google.internal) hostname check pre-existing

The docs page must list all of these, not just the three new ones, so a reader of spider_tools.mdx understands the full safety contract. (#1578 is the trigger to finally document it; the other rules have always been there but were never written down.)


Files to update

1. docs/tools/spider_tools.mdx — add "Built-in URL Safety" section

Location: Insert a new top-level section between "Best Practices" (currently ends ~line 216) and "Common Patterns" (currently starts ~line 218). New section title: ## Built-in URL Safety.

Why insert here: The page currently jumps from agent-configuration best practices to scraping pipeline patterns without ever mentioning that the tool refuses dangerous URLs. After PR #1578, this is the right moment to give it a permanent home, because the validator is no longer trivially bypassable and we can confidently say "the spider tools will not hit your localhost from a prompt-injected URL".

Content to add (Mintlify components only — no raw HTML):

## Built-in URL Safety

Spider tools (`scrape_page`, `extract_links`, `crawl`, `extract_text`) refuse to fetch dangerous URLs **before any network request is made**. You don't need to wrap them in a custom validator.

```mermaid
graph LR
    URL([URL from agent]) --> V{Built-in<br/>_validate_url}
    V -->|Safe public URL| FETCH([HTTP request])
    V -->|Private / smuggled / control char| BLK([Refused — error returned])

    classDef input fill:#6366F1,stroke:#7C90A0,color:#fff
    classDef check fill:#F59E0B,stroke:#7C90A0,color:#fff
    classDef ok fill:#10B981,stroke:#7C90A0,color:#fff
    classDef bad fill:#8B0000,stroke:#7C90A0,color:#fff

    class URL input
    class V check
    class FETCH ok
    class BLK bad
```

### What gets refused

| URL pattern | Example | Why |
|---|---|---|
| Non-`http`/`https` schemes | `file:///etc/passwd`, `gopher://x` | Only web protocols are allowed |
| Loopback | `http://127.0.0.1/`, `http://localhost/` | Blocks access to the local machine |
| Private / reserved IPs | `http://10.0.0.5/`, `http://192.168.1.1/` | Blocks internal network access |
| Link-local | `http://169.254.169.254/` | Blocks cloud metadata services |
| Internal TLDs | `http://intranet.local/`, `http://svc.internal/` | Blocks corporate internal hosts |
| Backslash in URL | `http://127.0.0.1:6666\@1.1.1.1` | SSRF-smuggling: `urlparse` says `1.1.1.1`, `requests` actually hits `127.0.0.1` |
| ASCII control chars (`< 0x20` or `0x7f`) | `http://example.com\x00.evil.com`, `http://example.com\r\n.evil.com` | CRLF / NUL injection in the authority |
| Non-string input | `None`, `123` | Defensive — returns `False` instead of raising |

<Note>
The backslash and control-character rejections (the last two rows above) were added in PraisonAI [#1578](https://github.com/MervinPraison/PraisonAI/pull/1578) to close an SSRF bypass where `urllib.parse.urlparse` and the HTTP client (`requests` / `httpx`) disagreed on the destination host.
</Note>

### What it looks like to your agent

When the validator refuses a URL, the tool returns an error dict instead of fetching:

```python
from praisonaiagents.tools import scrape_page

# Smuggled URL — looks like 1.1.1.1, would actually hit 127.0.0.1
scrape_page("http://127.0.0.1:6666\\@1.1.1.1")
# {'error': 'Invalid or potentially dangerous URL: http://127.0.0.1:6666\\@1.1.1.1'}

# Loopback
scrape_page("http://localhost/admin")
# {'error': 'Invalid or potentially dangerous URL: http://localhost/admin'}

# Cloud metadata endpoint
scrape_page("http://169.254.169.254/latest/meta-data/")
# {'error': 'Invalid or potentially dangerous URL: http://169.254.169.254/latest/meta-data/'}

# Normal public URL — works as expected
scrape_page("https://example.com/")
# {'url': 'https://example.com/', 'status_code': 200, 'content': '...', ...}
```

<Tip>
This validation is **always on** for the bundled spider tools. It runs on every URL passed to `scrape_page`, `extract_links`, `crawl`, and `extract_text`. There is no flag to disable it, and it does not require `enable_security()`.
</Tip>

Style requirements (per AGENTS.md):

  • One-sentence section intro ✅
  • <Note>, <Tip> callouts (no <Warning> — the change is not a warning, it's a guarantee)
  • Mermaid diagram uses the standard color scheme (#6366F1 input, #F59E0B process, #10B981 success, #8B0000 blocked) — already encoded above
  • All examples are minimal (one line each) and runnable
  • No forbidden phrases ("In this section...", "As you can see...", etc.)

2. docs/best-practices/security.mdx — annotate Tool Approval Matrix

Location: The Tool Approval Matrix table (around lines 1241–1255) currently lists:

| **Spider** | `scrape_page` | — | No |
| **Spider** | `crawl`       | — | No |

Add immediately after that table (before the ### Configuring Approval heading at ~line 1257) a short paragraph:

<Note>
**Why spider tools don't require approval:** `scrape_page`, `extract_links`, `crawl`, and `extract_text` only read public web content and reject any URL that points to private networks, loopback, cloud metadata endpoints, or that contains SSRF-smuggling characters (backslashes, ASCII control characters). See [Spider Tools → Built-in URL Safety](/tools/spider_tools#built-in-url-safety) for the full rejection list.
</Note>

Why: Right now, a security-focused reader sees scrape_page listed with risk "—" and no approval required, and has no way to know why the project considers it safe. After PR #1578 there is now a defensible answer, and this is the natural place to surface it.

Do NOT touch the rest of security.mdx — the rest of that file is broad architectural guidance and is not affected by this PR.


Files NOT to change

File Reason
docs/concepts/tools.mdx Concepts folder is HUMAN ONLY per AGENTS.md §1.8. The spider card there (line 358) is fine — it just links to /tools/spider_tools.
docs/sdk/reference/praisonaiagents/modules/tools.mdx Auto-generated SDK reference. Do not hand-edit.
docs/tools/external/firecrawl.mdx, docs/tools/external/crawl4ai.mdx These are third-party crawlers with their own validation. PR #1578 only touches the bundled praisonaiagents.tools.spider_tools.
docs/firecrawl.mdx, docs/recipes/url-to-blog.mdx, docs/cli/tools.mdx, docs/tools/crawl4ai.mdx, docs/tools/yaml-tools.mdx Mention spider/firecrawl tangentially; no behavioural change for them.
docs/js/, docs/rust/ Auto-generated parity pages. The TS and Rust SDKs do not have an equivalent _validate_url change in this PR.
docs.json No new pages → no nav changes.

Quality checklist (the writing agent must verify before opening the docs PR)

  • Re-read src/praisonai-agents/praisonaiagents/tools/spider_tools.py _validate_url end-to-end and confirm the rejection list in the docs matches exactly (do not paraphrase from this issue — verify against source).
  • Re-read src/praisonai-agents/tests/unit/tools/test_spider_url_validation.py and confirm every example URL in the docs corresponds to a real, asserted test case.
  • No file under docs/concepts/ is created or modified.
  • No edits under docs/js/ or docs/rust/.
  • Mermaid diagram uses the standard color scheme from AGENTS.md §3.1 with white text on coloured backgrounds and classDef declarations.
  • All Mintlify components used: <Note>, <Tip>, fenced ```mermaid block, regular markdown table — no custom HTML.
  • No forbidden phrases ("In this section...", "As you can see...", "It's important to note...", "Please note...", "Let's take a look...", "The following example shows...").
  • Code examples are copy-paste runnable: imports included, no placeholder your-key-here values.
  • Cross-link from security.mdx to the new #built-in-url-safety anchor resolves.
  • PR description references PraisonAI #1578 and links to the new docs section.

Suggested PR title for the docs change

docs(spider): document built-in URL safety + SSRF-smuggling rejections (PraisonAI #1578)

Branch

Develop on claude/admiring-euler-Ay2QQ (per branching policy in this repo).

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingclaudeTrigger Claude Code analysisdocumentationImprovements or additions to documentationsecurity

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions