You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
PraisonAI PR MervinPraison/PraisonAI#1578 (fix(security): reject SSRF-smuggling URL characters in spider_tools._validate_url, merged 2026-04-28, head SHA 6cf2d11d5b788165d6384abf571653e06ea586aa) hardens SpiderTools._validate_url against an SSRF bypass class that exploits parser disagreement between urllib.parse.urlparse and the underlying HTTP client (requests / httpx).
The validator now early-rejects any URL that:
is not a str,
contains a backslash anywhere (\), or
contains any ASCII control character (codepoint < 0x20 or == 0x7f — NUL, CR, LF, DEL, …).
These three checks run beforeurlparse, so the existing IP / private-range / metadata / internal-domain rejections in _validate_url continue to apply unchanged.
The threat (from the PR)
A URL such as
http://127.0.0.1:6666\@1.1.1.1
parses with hostname 1.1.1.1 via urllib.parse.urlparse (so any allow/deny check that consults parsed.hostname sees a public IP) but is actually dispatched to 127.0.0.1:6666 by requests / httpx, because those clients re-resolve the authority differently. Hostname-based SSRF guards are silently bypassed and the agent ends up issuing requests to internal services on the local host. ASCII control characters (NUL, CR, LF, DEL, …) in the authority section produce a similar parser disagreement and have been used in HTTP request smuggling and CRLF-injection attacks.
This is a user-facing security guarantee: any agent equipped with scrape_page, extract_links, crawl, or extract_text from praisonaiagents.tools.spider_tools now refuses these payloads before any network call is made.
Important
This is a content update only — no new pages are needed. Per AGENTS.md, do NOT create or modify anything under docs/concepts/. All edits land in docs/tools/spider_tools.mdx and docs/best-practices/security.mdx.
Triage: update vs. new content
Question
Answer
Does PR #1578 add a new feature?
No — it tightens an existing built-in security check (_validate_url).
Is the existing built-in URL validation already documented?
No. Neither spider_tools.mdx nor best-practices/security.mdx mentions that the spider tools auto-block private IPs, loopback, metadata endpoints, or now SSRF-smuggling characters. This is a real, user-visible safety guarantee that is currently undocumented.
Are new docs pages needed?
No. Add a new "Built-in Safety" / "URL Validation" section inside the existing docs/tools/spider_tools.mdx, and mirror a short note in docs/best-practices/security.mdx.
Does any existing docs copy need to be corrected?
Yes — docs/best-practices/security.mdx lists scrape_page and crawl in the Tool Approval Matrix as risk "—" with no qualifier. We should keep them as no-approval-required, but add a footnote / paragraph explaining why they are safe by default (built-in URL validation).
SDK Truth (read these before writing docs)
Per AGENTS.md SDK-First cycle — read source, understand, then document.
Source of truth. Read SpiderTools._validate_url (lines ~30–95 after this PR) for the complete rejection list. The new lines (the only ones added by #1578) are: if not isinstance(url, str): return False and if "\\" in url or any(ord(c) < 0x20 or ord(c) == 0x7f for c in url): return False.
The docs page must list all of these, not just the three new ones, so a reader of spider_tools.mdx understands the full safety contract. (#1578 is the trigger to finally document it; the other rules have always been there but were never written down.)
Location: Insert a new top-level section between "Best Practices" (currently ends ~line 216) and "Common Patterns" (currently starts ~line 218). New section title: ## Built-in URL Safety.
Why insert here: The page currently jumps from agent-configuration best practices to scraping pipeline patterns without ever mentioning that the tool refuses dangerous URLs. After PR #1578, this is the right moment to give it a permanent home, because the validator is no longer trivially bypassable and we can confidently say "the spider tools will not hit your localhost from a prompt-injected URL".
Content to add (Mintlify components only — no raw HTML):
## Built-in URL Safety
Spider tools (`scrape_page`, `extract_links`, `crawl`, `extract_text`) refuse to fetch dangerous URLs **before any network request is made**. You don't need to wrap them in a custom validator.
```mermaidgraph LR URL([URL from agent]) --> V{Built-in<br/>_validate_url} V -->|Safe public URL| FETCH([HTTP request]) V -->|Private / smuggled / control char| BLK([Refused — error returned]) classDef input fill:#6366F1,stroke:#7C90A0,color:#fff classDef check fill:#F59E0B,stroke:#7C90A0,color:#fff classDef ok fill:#10B981,stroke:#7C90A0,color:#fff classDef bad fill:#8B0000,stroke:#7C90A0,color:#fff class URL input class V check class FETCH ok class BLK bad```### What gets refused| URL pattern | Example | Why ||---|---|---|| Non-`http`/`https` schemes |`file:///etc/passwd`, `gopher://x`| Only web protocols are allowed || Loopback |`http://127.0.0.1/`, `http://localhost/`| Blocks access to the local machine || Private / reserved IPs |`http://10.0.0.5/`, `http://192.168.1.1/`| Blocks internal network access || Link-local |`http://169.254.169.254/`| Blocks cloud metadata services || Internal TLDs |`http://intranet.local/`, `http://svc.internal/`| Blocks corporate internal hosts || Backslash in URL |`http://127.0.0.1:6666\@1.1.1.1`| SSRF-smuggling: `urlparse` says `1.1.1.1`, `requests` actually hits `127.0.0.1`|| ASCII control chars (`< 0x20` or `0x7f`) |`http://example.com\x00.evil.com`, `http://example.com\r\n.evil.com`| CRLF / NUL injection in the authority || Non-string input |`None`, `123`| Defensive — returns `False` instead of raising |
<Note>
The backslash and control-character rejections (the last two rows above) were added in PraisonAI [#1578](https://github.com/MervinPraison/PraisonAI/pull/1578) to close an SSRF bypass where `urllib.parse.urlparse` and the HTTP client (`requests` / `httpx`) disagreed on the destination host.
</Note>
### What it looks like to your agent
When the validator refuses a URL, the tool returns an error dict instead of fetching:
```pythonfrom praisonaiagents.tools import scrape_page
# Smuggled URL — looks like 1.1.1.1, would actually hit 127.0.0.1
scrape_page("http://127.0.0.1:6666\\@1.1.1.1")
# {'error': 'Invalid or potentially dangerous URL: http://127.0.0.1:6666\\@1.1.1.1'}# Loopback
scrape_page("http://localhost/admin")
# {'error': 'Invalid or potentially dangerous URL: http://localhost/admin'}# Cloud metadata endpoint
scrape_page("http://169.254.169.254/latest/meta-data/")
# {'error': 'Invalid or potentially dangerous URL: http://169.254.169.254/latest/meta-data/'}# Normal public URL — works as expected
scrape_page("https://example.com/")
# {'url': 'https://example.com/', 'status_code': 200, 'content': '...', ...}```
<Tip>
This validation is **always on** for the bundled spider tools. It runs on every URL passed to `scrape_page`, `extract_links`, `crawl`, and `extract_text`. There is no flag to disable it, and it does not require `enable_security()`.
</Tip>
Style requirements (per AGENTS.md):
One-sentence section intro ✅
<Note>, <Tip> callouts (no <Warning> — the change is not a warning, it's a guarantee)
Mermaid diagram uses the standard color scheme (#6366F1 input, #F59E0B process, #10B981 success, #8B0000 blocked) — already encoded above
All examples are minimal (one line each) and runnable
No forbidden phrases ("In this section...", "As you can see...", etc.)
Location: The Tool Approval Matrix table (around lines 1241–1255) currently lists:
| **Spider** | `scrape_page` | — | No |
| **Spider** | `crawl` | — | No |
Add immediately after that table (before the ### Configuring Approval heading at ~line 1257) a short paragraph:
<Note>
**Why spider tools don't require approval:**`scrape_page`, `extract_links`, `crawl`, and `extract_text` only read public web content and reject any URL that points to private networks, loopback, cloud metadata endpoints, or that contains SSRF-smuggling characters (backslashes, ASCII control characters). See [Spider Tools → Built-in URL Safety](/tools/spider_tools#built-in-url-safety) for the full rejection list.
</Note>
Why: Right now, a security-focused reader sees scrape_page listed with risk "—" and no approval required, and has no way to know why the project considers it safe. After PR #1578 there is now a defensible answer, and this is the natural place to surface it.
Do NOT touch the rest of security.mdx — the rest of that file is broad architectural guidance and is not affected by this PR.
Files NOT to change
File
Reason
docs/concepts/tools.mdx
Concepts folder is HUMAN ONLY per AGENTS.md §1.8. The spider card there (line 358) is fine — it just links to /tools/spider_tools.
Mention spider/firecrawl tangentially; no behavioural change for them.
docs/js/, docs/rust/
Auto-generated parity pages. The TS and Rust SDKs do not have an equivalent _validate_url change in this PR.
docs.json
No new pages → no nav changes.
Quality checklist (the writing agent must verify before opening the docs PR)
Re-read src/praisonai-agents/praisonaiagents/tools/spider_tools.py_validate_url end-to-end and confirm the rejection list in the docs matches exactly (do not paraphrase from this issue — verify against source).
Re-read src/praisonai-agents/tests/unit/tools/test_spider_url_validation.py and confirm every example URL in the docs corresponds to a real, asserted test case.
No file under docs/concepts/ is created or modified.
No edits under docs/js/ or docs/rust/.
Mermaid diagram uses the standard color scheme from AGENTS.md §3.1 with white text on coloured backgrounds and classDef declarations.
All Mintlify components used: <Note>, <Tip>, fenced ```mermaid block, regular markdown table — no custom HTML.
No forbidden phrases ("In this section...", "As you can see...", "It's important to note...", "Please note...", "Let's take a look...", "The following example shows...").
Code examples are copy-paste runnable: imports included, no placeholder your-key-here values.
Cross-link from security.mdx to the new #built-in-url-safety anchor resolves.
PR description references PraisonAI #1578 and links to the new docs section.
Context
PraisonAI PR MervinPraison/PraisonAI#1578 (
fix(security): reject SSRF-smuggling URL characters in spider_tools._validate_url, merged 2026-04-28, head SHA6cf2d11d5b788165d6384abf571653e06ea586aa) hardensSpiderTools._validate_urlagainst an SSRF bypass class that exploits parser disagreement betweenurllib.parse.urlparseand the underlying HTTP client (requests/httpx).The validator now early-rejects any URL that:
str,\), or< 0x20or== 0x7f— NUL, CR, LF, DEL, …).These three checks run before
urlparse, so the existing IP / private-range / metadata / internal-domain rejections in_validate_urlcontinue to apply unchanged.The threat (from the PR)
A URL such as
parses with hostname
1.1.1.1viaurllib.parse.urlparse(so any allow/deny check that consultsparsed.hostnamesees a public IP) but is actually dispatched to127.0.0.1:6666byrequests/httpx, because those clients re-resolve the authority differently. Hostname-based SSRF guards are silently bypassed and the agent ends up issuing requests to internal services on the local host. ASCII control characters (NUL, CR, LF, DEL, …) in the authority section produce a similar parser disagreement and have been used in HTTP request smuggling and CRLF-injection attacks.This is a user-facing security guarantee: any agent equipped with
scrape_page,extract_links,crawl, orextract_textfrompraisonaiagents.tools.spider_toolsnow refuses these payloads before any network call is made.Important
This is a content update only — no new pages are needed. Per
AGENTS.md, do NOT create or modify anything underdocs/concepts/. All edits land indocs/tools/spider_tools.mdxanddocs/best-practices/security.mdx.Triage: update vs. new content
_validate_url).docs/tools/spider_tools.mdxdocumentsscrape_page,extract_links,crawl,extract_text.spider_tools.mdxnorbest-practices/security.mdxmentions that the spider tools auto-block private IPs, loopback, metadata endpoints, or now SSRF-smuggling characters. This is a real, user-visible safety guarantee that is currently undocumented.docs/tools/spider_tools.mdx, and mirror a short note indocs/best-practices/security.mdx.docs/best-practices/security.mdxlistsscrape_pageandcrawlin the Tool Approval Matrix as risk "—" with no qualifier. We should keep them as no-approval-required, but add a footnote / paragraph explaining why they are safe by default (built-in URL validation).SDK Truth (read these before writing docs)
Per
AGENTS.mdSDK-First cycle — read source, understand, then document.MervinPraison/PraisonAI)src/praisonai-agents/praisonaiagents/tools/spider_tools.pySpiderTools._validate_url(lines ~30–95 after this PR) for the complete rejection list. The new lines (the only ones added by #1578) are:if not isinstance(url, str): return Falseandif "\\" in url or any(ord(c) < 0x20 or ord(c) == 0x7f for c in url): return False.src/praisonai-agents/tests/unit/tools/test_spider_url_validation.pyFull
_validate_urlrejection list (read from source — do not paraphrase from memory)After PR #1578,
_validate_urlreturnsFalsefor any of:not isinstance(url, str)"\\" in url< 0x20or== 0x7f)any(ord(c) < 0x20 or ord(c) == 0x7f for c in url)http/httpsschemeparsed.scheme not in ['http', 'https']not parsed.hostnamelocalhost,127.0.0.1,0.0.0.0,::1ipaddress.ip_address(...).is_private/....local,.internal,.localdomainsuffixeshostname.endswith(...)169.254.169.254,metadata.google.internal)The docs page must list all of these, not just the three new ones, so a reader of
spider_tools.mdxunderstands the full safety contract. (#1578 is the trigger to finally document it; the other rules have always been there but were never written down.)Files to update
1.
docs/tools/spider_tools.mdx— add "Built-in URL Safety" sectionLocation: Insert a new top-level section between "Best Practices" (currently ends ~line 216) and "Common Patterns" (currently starts ~line 218). New section title:
## Built-in URL Safety.Why insert here: The page currently jumps from agent-configuration best practices to scraping pipeline patterns without ever mentioning that the tool refuses dangerous URLs. After PR #1578, this is the right moment to give it a permanent home, because the validator is no longer trivially bypassable and we can confidently say "the spider tools will not hit your localhost from a prompt-injected URL".
Content to add (Mintlify components only — no raw HTML):
Style requirements (per
AGENTS.md):<Note>,<Tip>callouts (no<Warning>— the change is not a warning, it's a guarantee)#6366F1input,#F59E0Bprocess,#10B981success,#8B0000blocked) — already encoded above2.
docs/best-practices/security.mdx— annotate Tool Approval MatrixLocation: The Tool Approval Matrix table (around lines 1241–1255) currently lists:
Add immediately after that table (before the
### Configuring Approvalheading at ~line 1257) a short paragraph:Why: Right now, a security-focused reader sees
scrape_pagelisted with risk "—" and no approval required, and has no way to know why the project considers it safe. After PR #1578 there is now a defensible answer, and this is the natural place to surface it.Do NOT touch the rest of
security.mdx— the rest of that file is broad architectural guidance and is not affected by this PR.Files NOT to change
docs/concepts/tools.mdxAGENTS.md§1.8. The spider card there (line 358) is fine — it just links to/tools/spider_tools.docs/sdk/reference/praisonaiagents/modules/tools.mdxdocs/tools/external/firecrawl.mdx,docs/tools/external/crawl4ai.mdxpraisonaiagents.tools.spider_tools.docs/firecrawl.mdx,docs/recipes/url-to-blog.mdx,docs/cli/tools.mdx,docs/tools/crawl4ai.mdx,docs/tools/yaml-tools.mdxdocs/js/,docs/rust/_validate_urlchange in this PR.docs.jsonQuality checklist (the writing agent must verify before opening the docs PR)
src/praisonai-agents/praisonaiagents/tools/spider_tools.py_validate_urlend-to-end and confirm the rejection list in the docs matches exactly (do not paraphrase from this issue — verify against source).src/praisonai-agents/tests/unit/tools/test_spider_url_validation.pyand confirm every example URL in the docs corresponds to a real, asserted test case.docs/concepts/is created or modified.docs/js/ordocs/rust/.AGENTS.md§3.1 with white text on coloured backgrounds andclassDefdeclarations.<Note>,<Tip>, fenced ```mermaid block, regular markdown table — no custom HTML.your-key-herevalues.security.mdxto the new#built-in-url-safetyanchor resolves.Suggested PR title for the docs change
docs(spider): document built-in URL safety + SSRF-smuggling rejections (PraisonAI #1578)Branch
Develop on
claude/admiring-euler-Ay2QQ(per branching policy in this repo).