docs: document Spider Tools SSRF-smuggling URL hardening from PraisonAI PR #1578

## Context

PraisonAI PR [MervinPraison/PraisonAI#1578](https://github.com/MervinPraison/PraisonAI/pull/1578) (`fix(security): reject SSRF-smuggling URL characters in spider_tools._validate_url`, merged 2026-04-28, head SHA `6cf2d11d5b788165d6384abf571653e06ea586aa`) hardens `SpiderTools._validate_url` against an SSRF bypass class that exploits parser disagreement between `urllib.parse.urlparse` and the underlying HTTP client (`requests` / `httpx`).

The validator now **early-rejects** any URL that:

1. is not a `str`,
2. contains a backslash anywhere (`\`), or
3. contains any ASCII control character (codepoint `< 0x20` or `== 0x7f` — NUL, CR, LF, DEL, …).

These three checks run **before** `urlparse`, so the existing IP / private-range / metadata / internal-domain rejections in `_validate_url` continue to apply unchanged.

### The threat (from the PR)

A URL such as

```
http://127.0.0.1:6666\@1.1.1.1
```

parses with hostname `1.1.1.1` via `urllib.parse.urlparse` (so any allow/deny check that consults `parsed.hostname` sees a public IP) **but is actually dispatched to `127.0.0.1:6666` by `requests` / `httpx`**, because those clients re-resolve the authority differently. Hostname-based SSRF guards are silently bypassed and the agent ends up issuing requests to internal services on the local host. ASCII control characters (NUL, CR, LF, DEL, …) in the authority section produce a similar parser disagreement and have been used in HTTP request smuggling and CRLF-injection attacks.

This is a **user-facing security guarantee**: any agent equipped with `scrape_page`, `extract_links`, `crawl`, or `extract_text` from `praisonaiagents.tools.spider_tools` now refuses these payloads before any network call is made.

> [!IMPORTANT]
> This is a **content update only** — no new pages are needed. Per `AGENTS.md`, **do NOT create or modify anything under `docs/concepts/`**. All edits land in `docs/tools/spider_tools.mdx` and `docs/best-practices/security.mdx`.

---

## Triage: update vs. new content

| Question | Answer |
|---|---|
| Does PR #1578 add a new feature? | No — it tightens an existing built-in security check (`_validate_url`). |
| Is the underlying tool already documented? | **Yes.** `docs/tools/spider_tools.mdx` documents `scrape_page`, `extract_links`, `crawl`, `extract_text`. |
| Is the existing built-in URL validation already documented? | **No.** Neither `spider_tools.mdx` nor `best-practices/security.mdx` mentions that the spider tools auto-block private IPs, loopback, metadata endpoints, or now SSRF-smuggling characters. This is a real, user-visible safety guarantee that is currently undocumented. |
| Are new docs pages needed? | **No.** Add a new "Built-in Safety" / "URL Validation" section inside the existing `docs/tools/spider_tools.mdx`, and mirror a short note in `docs/best-practices/security.mdx`. |
| Does any existing docs copy need to be corrected? | **Yes** — `docs/best-practices/security.mdx` lists `scrape_page` and `crawl` in the Tool Approval Matrix as risk "—" with no qualifier. We should keep them as no-approval-required, but add a footnote / paragraph explaining *why* they are safe by default (built-in URL validation). |

---

## SDK Truth (read these before writing docs)

Per `AGENTS.md` SDK-First cycle — read source, understand, then document.

| File (in `MervinPraison/PraisonAI`) | Why |
|---|---|
| `src/praisonai-agents/praisonaiagents/tools/spider_tools.py` | Source of truth. Read `SpiderTools._validate_url` (lines ~30–95 after this PR) for the **complete** rejection list. The new lines (the only ones added by #1578) are: `if not isinstance(url, str): return False` and `if "\\" in url or any(ord(c) < 0x20 or ord(c) == 0x7f for c in url): return False`. |
| `src/praisonai-agents/tests/unit/tools/test_spider_url_validation.py` | The 6 regression tests added by #1578 are the canonical examples. Use them verbatim in the docs. |

### Full `_validate_url` rejection list (read from source — do not paraphrase from memory)

After PR #1578, `_validate_url` returns `False` for **any** of:

| Rejection | Source-of-truth check | Added by |
|---|---|---|
| Non-string input | `not isinstance(url, str)` | **#1578 (new)** |
| Backslash anywhere in URL | `"\\" in url` | **#1578 (new)** |
| ASCII control character (`< 0x20` or `== 0x7f`) | `any(ord(c) < 0x20 or ord(c) == 0x7f for c in url)` | **#1578 (new)** |
| Non-`http`/`https` scheme | `parsed.scheme not in ['http', 'https']` | pre-existing |
| Empty hostname | `not parsed.hostname` | pre-existing |
| `localhost`, `127.0.0.1`, `0.0.0.0`, `::1` | hostname check | pre-existing |
| Private IP / reserved / loopback / link-local | `ipaddress.ip_address(...).is_private/...` | pre-existing |
| `.local`, `.internal`, `.localdomain` suffixes | `hostname.endswith(...)` | pre-existing |
| Cloud metadata endpoints (`169.254.169.254`, `metadata.google.internal`) | hostname check | pre-existing |

The docs page must list **all** of these, not just the three new ones, so a reader of `spider_tools.mdx` understands the full safety contract. (#1578 is the trigger to finally document it; the other rules have always been there but were never written down.)

---

## Files to update

### 1. `docs/tools/spider_tools.mdx` — add "Built-in URL Safety" section

**Location:** Insert a new top-level section **between** "Best Practices" (currently ends ~line 216) and "Common Patterns" (currently starts ~line 218). New section title: `## Built-in URL Safety`.

**Why insert here:** The page currently jumps from agent-configuration best practices to scraping pipeline patterns without ever mentioning that the tool refuses dangerous URLs. After PR #1578, this is the right moment to give it a permanent home, because the validator is no longer trivially bypassable and we can confidently say "the spider tools will not hit your localhost from a prompt-injected URL".

**Content to add (Mintlify components only — no raw HTML):**

````mdx
## Built-in URL Safety

Spider tools (`scrape_page`, `extract_links`, `crawl`, `extract_text`) refuse to fetch dangerous URLs **before any network request is made**. You don't need to wrap them in a custom validator.

```mermaid
graph LR
    URL([URL from agent]) --> V{Built-in<br/>_validate_url}
    V -->|Safe public URL| FETCH([HTTP request])
    V -->|Private / smuggled / control char| BLK([Refused — error returned])

    classDef input fill:#6366F1,stroke:#7C90A0,color:#fff
    classDef check fill:#F59E0B,stroke:#7C90A0,color:#fff
    classDef ok fill:#10B981,stroke:#7C90A0,color:#fff
    classDef bad fill:#8B0000,stroke:#7C90A0,color:#fff

    class URL input
    class V check
    class FETCH ok
    class BLK bad
```

### What gets refused

| URL pattern | Example | Why |
|---|---|---|
| Non-`http`/`https` schemes | `file:///etc/passwd`, `gopher://x` | Only web protocols are allowed |
| Loopback | `http://127.0.0.1/`, `http://localhost/` | Blocks access to the local machine |
| Private / reserved IPs | `http://10.0.0.5/`, `http://192.168.1.1/` | Blocks internal network access |
| Link-local | `http://169.254.169.254/` | Blocks cloud metadata services |
| Internal TLDs | `http://intranet.local/`, `http://svc.internal/` | Blocks corporate internal hosts |
| Backslash in URL | `http://127.0.0.1:6666\@1.1.1.1` | SSRF-smuggling: `urlparse` says `1.1.1.1`, `requests` actually hits `127.0.0.1` |
| ASCII control chars (`< 0x20` or `0x7f`) | `http://example.com\x00.evil.com`, `http://example.com\r\n.evil.com` | CRLF / NUL injection in the authority |
| Non-string input | `None`, `123` | Defensive — returns `False` instead of raising |

<Note>
The backslash and control-character rejections (the last two rows above) were added in PraisonAI [#1578](https://github.com/MervinPraison/PraisonAI/pull/1578) to close an SSRF bypass where `urllib.parse.urlparse` and the HTTP client (`requests` / `httpx`) disagreed on the destination host.
</Note>

### What it looks like to your agent

When the validator refuses a URL, the tool returns an error dict instead of fetching:

```python
from praisonaiagents.tools import scrape_page

# Smuggled URL — looks like 1.1.1.1, would actually hit 127.0.0.1
scrape_page("http://127.0.0.1:6666\\@1.1.1.1")
# {'error': 'Invalid or potentially dangerous URL: http://127.0.0.1:6666\\@1.1.1.1'}

# Loopback
scrape_page("http://localhost/admin")
# {'error': 'Invalid or potentially dangerous URL: http://localhost/admin'}

# Cloud metadata endpoint
scrape_page("http://169.254.169.254/latest/meta-data/")
# {'error': 'Invalid or potentially dangerous URL: http://169.254.169.254/latest/meta-data/'}

# Normal public URL — works as expected
scrape_page("https://example.com/")
# {'url': 'https://example.com/', 'status_code': 200, 'content': '...', ...}
```

<Tip>
This validation is **always on** for the bundled spider tools. It runs on every URL passed to `scrape_page`, `extract_links`, `crawl`, and `extract_text`. There is no flag to disable it, and it does not require `enable_security()`.
</Tip>
````

**Style requirements (per `AGENTS.md`):**
- One-sentence section intro ✅
- `<Note>`, `<Tip>` callouts (no `<Warning>` — the change is not a warning, it's a guarantee)
- Mermaid diagram uses the standard color scheme (`#6366F1` input, `#F59E0B` process, `#10B981` success, `#8B0000` blocked) — already encoded above
- All examples are minimal (one line each) and runnable
- No forbidden phrases ("In this section...", "As you can see...", etc.)

---

### 2. `docs/best-practices/security.mdx` — annotate Tool Approval Matrix

**Location:** The Tool Approval Matrix table (around lines 1241–1255) currently lists:

```
| **Spider** | `scrape_page` | — | No |
| **Spider** | `crawl`       | — | No |
```

**Add immediately after that table** (before the `### Configuring Approval` heading at ~line 1257) a short paragraph:

````mdx
<Note>
**Why spider tools don't require approval:** `scrape_page`, `extract_links`, `crawl`, and `extract_text` only read public web content and reject any URL that points to private networks, loopback, cloud metadata endpoints, or that contains SSRF-smuggling characters (backslashes, ASCII control characters). See [Spider Tools → Built-in URL Safety](/tools/spider_tools#built-in-url-safety) for the full rejection list.
</Note>
````

**Why:** Right now, a security-focused reader sees `scrape_page` listed with risk "—" and no approval required, and has no way to know *why* the project considers it safe. After PR #1578 there is now a defensible answer, and this is the natural place to surface it.

**Do NOT touch the rest of `security.mdx`** — the rest of that file is broad architectural guidance and is not affected by this PR.

---

## Files NOT to change

| File | Reason |
|---|---|
| `docs/concepts/tools.mdx` | Concepts folder is HUMAN ONLY per `AGENTS.md` §1.8. The spider card there (line 358) is fine — it just links to `/tools/spider_tools`. |
| `docs/sdk/reference/praisonaiagents/modules/tools.mdx` | Auto-generated SDK reference. Do not hand-edit. |
| `docs/tools/external/firecrawl.mdx`, `docs/tools/external/crawl4ai.mdx` | These are third-party crawlers with their own validation. PR #1578 only touches the bundled `praisonaiagents.tools.spider_tools`. |
| `docs/firecrawl.mdx`, `docs/recipes/url-to-blog.mdx`, `docs/cli/tools.mdx`, `docs/tools/crawl4ai.mdx`, `docs/tools/yaml-tools.mdx` | Mention spider/firecrawl tangentially; no behavioural change for them. |
| `docs/js/`, `docs/rust/` | Auto-generated parity pages. The TS and Rust SDKs do not have an equivalent `_validate_url` change in this PR. |
| `docs.json` | No new pages → no nav changes. |

---

## Quality checklist (the writing agent must verify before opening the docs PR)

- [ ] Re-read `src/praisonai-agents/praisonaiagents/tools/spider_tools.py` `_validate_url` end-to-end and confirm the rejection list in the docs matches **exactly** (do not paraphrase from this issue — verify against source).
- [ ] Re-read `src/praisonai-agents/tests/unit/tools/test_spider_url_validation.py` and confirm every example URL in the docs corresponds to a real, asserted test case.
- [ ] No file under `docs/concepts/` is created or modified.
- [ ] No edits under `docs/js/` or `docs/rust/`.
- [ ] Mermaid diagram uses the standard color scheme from `AGENTS.md` §3.1 with white text on coloured backgrounds and `classDef` declarations.
- [ ] All Mintlify components used: `<Note>`, `<Tip>`, fenced ```mermaid block, regular markdown table — no custom HTML.
- [ ] No forbidden phrases ("In this section...", "As you can see...", "It's important to note...", "Please note...", "Let's take a look...", "The following example shows...").
- [ ] Code examples are copy-paste runnable: imports included, no placeholder `your-key-here` values.
- [ ] Cross-link from `security.mdx` to the new `#built-in-url-safety` anchor resolves.
- [ ] PR description references PraisonAI #1578 and links to the new docs section.

---

## Suggested PR title for the docs change

`docs(spider): document built-in URL safety + SSRF-smuggling rejections (PraisonAI #1578)`

## Branch

Develop on `claude/admiring-euler-Ay2QQ` (per branching policy in this repo).


File (in `MervinPraison/PraisonAI`)	Why
`src/praisonai-agents/praisonaiagents/tools/spider_tools.py`	Source of truth. Read `SpiderTools._validate_url` (lines ~30–95 after this PR) for the complete rejection list. The new lines (the only ones added by #1578) are: `if not isinstance(url, str): return False` and `if "\\" in url or any(ord(c) < 0x20 or ord(c) == 0x7f for c in url): return False`.
`src/praisonai-agents/tests/unit/tools/test_spider_url_validation.py`	The 6 regression tests added by #1578 are the canonical examples. Use them verbatim in the docs.

File	Reason
`docs/concepts/tools.mdx`	Concepts folder is HUMAN ONLY per `AGENTS.md` §1.8. The spider card there (line 358) is fine — it just links to `/tools/spider_tools`.
`docs/sdk/reference/praisonaiagents/modules/tools.mdx`	Auto-generated SDK reference. Do not hand-edit.
`docs/tools/external/firecrawl.mdx`, `docs/tools/external/crawl4ai.mdx`	These are third-party crawlers with their own validation. PR #1578 only touches the bundled `praisonaiagents.tools.spider_tools`.
`docs/firecrawl.mdx`, `docs/recipes/url-to-blog.mdx`, `docs/cli/tools.mdx`, `docs/tools/crawl4ai.mdx`, `docs/tools/yaml-tools.mdx`	Mention spider/firecrawl tangentially; no behavioural change for them.
`docs/js/`, `docs/rust/`	Auto-generated parity pages. The TS and Rust SDKs do not have an equivalent `_validate_url` change in this PR.
`docs.json`	No new pages → no nav changes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: document Spider Tools SSRF-smuggling URL hardening from PraisonAI PR #1578 #284

Context

The threat (from the PR)

Triage: update vs. new content

SDK Truth (read these before writing docs)

Full `_validate_url` rejection list (read from source — do not paraphrase from memory)

Files to update

1. `docs/tools/spider_tools.mdx` — add "Built-in URL Safety" section

2. `docs/best-practices/security.mdx` — annotate Tool Approval Matrix

Files NOT to change

Quality checklist (the writing agent must verify before opening the docs PR)

Suggested PR title for the docs change

Branch

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Question	Answer
Does PR #1578 add a new feature?	No — it tightens an existing built-in security check (`_validate_url`).
Is the underlying tool already documented?	Yes. `docs/tools/spider_tools.mdx` documents `scrape_page`, `extract_links`, `crawl`, `extract_text`.
Is the existing built-in URL validation already documented?	No. Neither `spider_tools.mdx` nor `best-practices/security.mdx` mentions that the spider tools auto-block private IPs, loopback, metadata endpoints, or now SSRF-smuggling characters. This is a real, user-visible safety guarantee that is currently undocumented.
Are new docs pages needed?	No. Add a new "Built-in Safety" / "URL Validation" section inside the existing `docs/tools/spider_tools.mdx`, and mirror a short note in `docs/best-practices/security.mdx`.
Does any existing docs copy need to be corrected?	Yes — `docs/best-practices/security.mdx` lists `scrape_page` and `crawl` in the Tool Approval Matrix as risk "—" with no qualifier. We should keep them as no-approval-required, but add a footnote / paragraph explaining why they are safe by default (built-in URL validation).

Rejection	Source-of-truth check	Added by
Non-string input	`not isinstance(url, str)`	#1578 (new)
Backslash anywhere in URL	`"\\" in url`	#1578 (new)
ASCII control character (`< 0x20` or `== 0x7f`)	`any(ord(c) < 0x20 or ord(c) == 0x7f for c in url)`	#1578 (new)
Non-`http`/`https` scheme	`parsed.scheme not in ['http', 'https']`	pre-existing
Empty hostname	`not parsed.hostname`	pre-existing
`localhost`, `127.0.0.1`, `0.0.0.0`, `::1`	hostname check	pre-existing
Private IP / reserved / loopback / link-local	`ipaddress.ip_address(...).is_private/...`	pre-existing
`.local`, `.internal`, `.localdomain` suffixes	`hostname.endswith(...)`	pre-existing
Cloud metadata endpoints (`169.254.169.254`, `metadata.google.internal`)	hostname check	pre-existing

docs: document Spider Tools SSRF-smuggling URL hardening from PraisonAI PR #1578 #284

Description

Context

The threat (from the PR)

Triage: update vs. new content

SDK Truth (read these before writing docs)

Full _validate_url rejection list (read from source — do not paraphrase from memory)

Files to update

1. docs/tools/spider_tools.mdx — add "Built-in URL Safety" section

2. docs/best-practices/security.mdx — annotate Tool Approval Matrix

Files NOT to change

Quality checklist (the writing agent must verify before opening the docs PR)

Suggested PR title for the docs change

Branch

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Full `_validate_url` rejection list (read from source — do not paraphrase from memory)

1. `docs/tools/spider_tools.mdx` — add "Built-in URL Safety" section

2. `docs/best-practices/security.mdx` — annotate Tool Approval Matrix