Overview
- Vulnerability type: Blind SSRF
- Affected components:
src/crawlee/_utils/sitemap.py, src/crawlee/_utils/robots.py, src/crawlee/request_loaders/_sitemap_request_loader.py, and all built-in HTTP clients.
- Trigger: an attacker-controlled sitemap or
robots.txt containing a URL that points to an internal host (layer 1) or uses a non-http scheme (layer 2).
Two-layer SSRF via sitemap-derived URLs:
1) Cross-host HTTP SSRF
Base case, affects every HTTP client.** Sitemap entries and robots.txt Sitemap: directives were accepted regardless of the host they pointed to. A sitemap on example.com could push http://internal.corp/admin into the crawler's queue, and the configured HTTP client would dispatch the request.
2) Non-HTTP scheme SSRF
Escalation, only CurlImpersonateHttpClient.** Nested-sitemap fetching dispatches the URL straight to the HTTP client, bypassing the Request construction step where Pydantic enforces http(s). Combined with the libcurl-backed CurlImpersonateHttpClient, this lets gopher://, file://, dict://, ftp://, etc., through.
Root cause
Crawlee already validates URL schemes through Pydantic's AnyHttpUrl (via validate_http_url in src/crawlee/_utils/urls.py) wherever a crawl target is materialised as a Request: the Request.url field is declared as Annotated[str, BeforeValidator(validate_http_url), Field(frozen=True)]. Anything that becomes a Request is therefore guaranteed to be http(s).
Two parts of the sitemap pipeline sidestepped this property in different ways:
1) Sitemap-derived URLs were enqueued without any host policy
SitemapRequestLoader took every <urlset><url><loc> entry, wrapped it in Request.from_url (which accepts any valid http(s) URL), and pushed the result into the request queue. RobotsTxtFile.get_sitemaps() returned every Sitemap: directive verbatim. Neither imposed any host check against the parent sitemap or robots.txt URL, so an attacker controlling that content could push internal-network HTTP URLs into the queue and have them crawled by whichever HTTP client was configured.
2) Nested sitemap fetching bypassed the Request chokepoint entirely
When _XmlSitemapParser encountered <sitemapindex><sitemap><loc>…</loc></sitemap></sitemapindex>, or when RobotsTxtFile.parse_sitemaps forwarded Sitemap: directives into the same pipeline, _fetch_and_process_sitemap dispatched the URL directly to the HTTP client:
async with http_client.stream(
sitemap_url,
method='GET',
headers=SITEMAP_HEADERS,
proxy_info=proxy_info,
timeout=timeout,
) as response:
...
No Request was constructed, so the Pydantic validator never ran. Before the fix, the HTTP clients' own send_request() and stream() methods did not call validate_http_url either, so a non-http(s) scheme could pass straight through to the backend client.
The non-HTTP escalation in layer 2 is specific to CurlImpersonateHttpClient, which is backed by curl-cffi / libcurl and speaks gopher, file, dict, ftp, and other non-HTTP protocols. The other clients shipped with Crawlee (HttpxHttpClient, ImpitHttpClient, PlaywrightHttpClient) reject non-http(s) schemes at their own backend layer, regardless of what Crawlee passes in, so they were only affected by layer 1.
Vulnerable paths
Layer 1 — cross-host HTTP (all HTTP clients)
- Source: an attacker-controlled sitemap that lists internal URLs under
<urlset><url><loc> or <sitemapindex><sitemap><loc>, or an attacker-controlled robots.txt that lists internal URLs under Sitemap:.
- Sink: the configured HTTP client issues
GET requests against those URLs — either via client.request(url=request.url, …) inside crawl() for regular sitemap URLs, or via client.stream(url, …) inside the nested-sitemap fetch.
Layer 2 — non-HTTP schemes (CurlImpersonateHttpClient only)
- Source: a nested
<sitemap><loc> entry or a robots.txt Sitemap: directive pointing to a non-http(s) URL.
- Sink:
CurlImpersonateHttpClient.stream(...) hands the URL string verbatim to client.request(url=…, …), which dispatches via libcurl.
Hardening in 1.7.0 was added at both producer and consumer ends — see Remediation.
Exploitation preconditions
- The crawler uses sitemap loading: any of
SitemapRequestLoader, Sitemap.load / parse_sitemap, discover_valid_sitemaps, or RobotsTxtFile.parse_sitemaps.
- The attacker controls the body of a sitemap or
robots.txt that the crawler fetches — typically by being the target site, or by getting a target site to publish a malicious sitemap.
- The crawler's network egress can reach the attacker-chosen destination (e.g., internal services on the same network).
- The targeted endpoint accepts unauthenticated requests. Crawlee does not supply credentials to the forged destination, so authenticated services (IMDSv2 with token, password-protected Redis, protected admin panels) are not reachable through this path.
For layer 2 (non-HTTP), the configured HTTP client must additionally be CurlImpersonateHttpClient.
Impact
Layer 1 — cross-host HTTP (any client)
The crawler can be coerced into issuing GET requests against internal HTTP services on its own network: admin panels, unauthenticated internal APIs, cloud metadata endpoints, etc. Read-back is blind — Crawlee surfaces fetched content only through its local Dataset / KeyValueStore (push_data() etc.) and does not natively forward scraped bodies anywhere external — so direct impact is mostly existence/timing probing and occasional state changes via side-effecting GET endpoints. Read-side leakage of internal content is only exploitable end-to-end if the deployer's own application separately exposes scraped data (for example, a public summariser or aggregator built on top of Crawlee).
Layer 2 — non-HTTP escalation (only CurlImpersonateHttpClient)
Under the affected client, attackers gain the libcurl scheme set:
gopher:// is the canonical RESP-injection vector: pipeline FLUSHALL, CONFIG SET dir, CONFIG SET dbfilename, SAVE to an unauthenticated Redis on the crawler's network — enough to write attacker-controlled bytes to disk and, in the standard escalation, achieve remote code execution on the Redis host.
file:// allows the crawler to read local files (application secrets, configuration) on the crawler host.
dict:// and ftp:// permit fingerprinting and limited interaction with text-protocol services.
In both layers, the SSRF is blind in the default configuration. Write-side impact (gopher:// → Redis) and timing-based internal probing do not depend on read-back and remain viable regardless of whether the deployer surfaces scraped content.
Remediation
Both layers are fixed in crawlee==1.7.0. The fix is split across two PRs, applied at the two complementary boundaries of the affected pipeline:
- Producer-side filtering — sitemap and robots.txt loaders (PR #1864).
SitemapRequestLoader and RobotsTxtFile.get_sitemaps() now run every nested-sitemap entry, every regular sitemap URL, and every Sitemap: directive through crawlee._utils.urls.filter_url. This applies to an EnqueueStrategy (default 'same-hostname') against the parent sitemap / robots.txt URL — cross-host entries are dropped — and rejects non-http(s) schemes. The strategy is stamped onto the emitted Requests, so BasicCrawler._check_url_after_redirects continues policing the policy across redirects.
- Consumer-side validation — HTTP-client boundary (PR #1862).
validate_http_url(url) is now called at the top of send_request() and stream() in ImpitHttpClient, HttpxHttpClient, CurlImpersonateHttpClient, and PlaywrightHttpClient. Non-http(s) schemes raise pydantic.ValidationError before any backend call. crawl() was already covered, because Request.url is validated by Pydantic on construction.
After these changes, validation is enforced both where sitemap-derived HTTP requests are produced (sitemap and robots.txt loaders) and where they are consumed (HTTP clients). A regression at either layer is caught by the other.
Behaviour change for upgraders
SitemapRequestLoader and RobotsTxtFile.get_sitemaps() now default to enqueue_strategy='same-hostname'. Deployers that legitimately relied on cross-host sitemap entries (e.g., a sitemap index on sitemaps.example.com that points to content on www.example.com) must opt in explicitly with enqueue_strategy='same-domain' or enqueue_strategy='all'.
Finder credits
Overview
src/crawlee/_utils/sitemap.py,src/crawlee/_utils/robots.py,src/crawlee/request_loaders/_sitemap_request_loader.py, and all built-in HTTP clients.robots.txtcontaining a URL that points to an internal host (layer 1) or uses a non-http scheme (layer 2).Two-layer SSRF via sitemap-derived URLs:
1) Cross-host HTTP SSRF
Base case, affects every HTTP client.** Sitemap entries and
robots.txtSitemap:directives were accepted regardless of the host they pointed to. A sitemap onexample.comcould pushhttp://internal.corp/admininto the crawler's queue, and the configured HTTP client would dispatch the request.2) Non-HTTP scheme SSRF
Escalation, only
CurlImpersonateHttpClient.** Nested-sitemap fetching dispatches the URL straight to the HTTP client, bypassing theRequestconstruction step where Pydantic enforceshttp(s). Combined with the libcurl-backedCurlImpersonateHttpClient, this letsgopher://,file://,dict://,ftp://, etc., through.Root cause
Crawlee already validates URL schemes through Pydantic's
AnyHttpUrl(viavalidate_http_urlinsrc/crawlee/_utils/urls.py) wherever a crawl target is materialised as aRequest: theRequest.urlfield is declared asAnnotated[str, BeforeValidator(validate_http_url), Field(frozen=True)]. Anything that becomes aRequestis therefore guaranteed to behttp(s).Two parts of the sitemap pipeline sidestepped this property in different ways:
1) Sitemap-derived URLs were enqueued without any host policy
SitemapRequestLoadertook every<urlset><url><loc>entry, wrapped it inRequest.from_url(which accepts any validhttp(s)URL), and pushed the result into the request queue.RobotsTxtFile.get_sitemaps()returned everySitemap:directive verbatim. Neither imposed any host check against the parent sitemap orrobots.txtURL, so an attacker controlling that content could push internal-network HTTP URLs into the queue and have them crawled by whichever HTTP client was configured.2) Nested sitemap fetching bypassed the
Requestchokepoint entirelyWhen
_XmlSitemapParserencountered<sitemapindex><sitemap><loc>…</loc></sitemap></sitemapindex>, or whenRobotsTxtFile.parse_sitemapsforwardedSitemap:directives into the same pipeline,_fetch_and_process_sitemapdispatched the URL directly to the HTTP client:No
Requestwas constructed, so the Pydantic validator never ran. Before the fix, the HTTP clients' ownsend_request()andstream()methods did not callvalidate_http_urleither, so a non-http(s)scheme could pass straight through to the backend client.The non-HTTP escalation in layer 2 is specific to
CurlImpersonateHttpClient, which is backed bycurl-cffi/ libcurl and speaksgopher,file,dict,ftp, and other non-HTTP protocols. The other clients shipped with Crawlee (HttpxHttpClient,ImpitHttpClient,PlaywrightHttpClient) reject non-http(s)schemes at their own backend layer, regardless of what Crawlee passes in, so they were only affected by layer 1.Vulnerable paths
Layer 1 — cross-host HTTP (all HTTP clients)
<urlset><url><loc>or<sitemapindex><sitemap><loc>, or an attacker-controlledrobots.txtthat lists internal URLs underSitemap:.GETrequests against those URLs — either viaclient.request(url=request.url, …)insidecrawl()for regular sitemap URLs, or viaclient.stream(url, …)inside the nested-sitemap fetch.Layer 2 — non-HTTP schemes (
CurlImpersonateHttpClientonly)<sitemap><loc>entry or arobots.txtSitemap:directive pointing to a non-http(s)URL.CurlImpersonateHttpClient.stream(...)hands the URL string verbatim toclient.request(url=…, …), which dispatches via libcurl.Hardening in 1.7.0 was added at both producer and consumer ends — see Remediation.
Exploitation preconditions
SitemapRequestLoader,Sitemap.load/parse_sitemap,discover_valid_sitemaps, orRobotsTxtFile.parse_sitemaps.robots.txtthat the crawler fetches — typically by being the target site, or by getting a target site to publish a malicious sitemap.For layer 2 (non-HTTP), the configured HTTP client must additionally be
CurlImpersonateHttpClient.Impact
Layer 1 — cross-host HTTP (any client)
The crawler can be coerced into issuing
GETrequests against internal HTTP services on its own network: admin panels, unauthenticated internal APIs, cloud metadata endpoints, etc. Read-back is blind — Crawlee surfaces fetched content only through its localDataset/KeyValueStore(push_data()etc.) and does not natively forward scraped bodies anywhere external — so direct impact is mostly existence/timing probing and occasional state changes via side-effectingGETendpoints. Read-side leakage of internal content is only exploitable end-to-end if the deployer's own application separately exposes scraped data (for example, a public summariser or aggregator built on top of Crawlee).Layer 2 — non-HTTP escalation (only
CurlImpersonateHttpClient)Under the affected client, attackers gain the libcurl scheme set:
gopher://is the canonical RESP-injection vector: pipelineFLUSHALL,CONFIG SET dir,CONFIG SET dbfilename,SAVEto an unauthenticated Redis on the crawler's network — enough to write attacker-controlled bytes to disk and, in the standard escalation, achieve remote code execution on the Redis host.file://allows the crawler to read local files (application secrets, configuration) on the crawler host.dict://andftp://permit fingerprinting and limited interaction with text-protocol services.In both layers, the SSRF is blind in the default configuration. Write-side impact (
gopher://→ Redis) and timing-based internal probing do not depend on read-back and remain viable regardless of whether the deployer surfaces scraped content.Remediation
Both layers are fixed in
crawlee==1.7.0. The fix is split across two PRs, applied at the two complementary boundaries of the affected pipeline:SitemapRequestLoaderandRobotsTxtFile.get_sitemaps()now run every nested-sitemap entry, every regular sitemap URL, and everySitemap:directive throughcrawlee._utils.urls.filter_url. This applies to anEnqueueStrategy(default'same-hostname') against the parent sitemap /robots.txtURL — cross-host entries are dropped — and rejects non-http(s)schemes. The strategy is stamped onto the emittedRequests, soBasicCrawler._check_url_after_redirectscontinues policing the policy across redirects.validate_http_url(url)is now called at the top ofsend_request()andstream()inImpitHttpClient,HttpxHttpClient,CurlImpersonateHttpClient, andPlaywrightHttpClient. Non-http(s)schemes raisepydantic.ValidationErrorbefore any backend call.crawl()was already covered, becauseRequest.urlis validated by Pydantic on construction.After these changes, validation is enforced both where sitemap-derived HTTP requests are produced (sitemap and robots.txt loaders) and where they are consumed (HTTP clients). A regression at either layer is caught by the other.
Behaviour change for upgraders
SitemapRequestLoaderandRobotsTxtFile.get_sitemaps()now default toenqueue_strategy='same-hostname'. Deployers that legitimately relied on cross-host sitemap entries (e.g., a sitemap index onsitemaps.example.comthat points to content onwww.example.com) must opt in explicitly withenqueue_strategy='same-domain'orenqueue_strategy='all'.Finder credits