fix(sitemap-scraper): enforce stateless no-cookie#236
Conversation
…kie sitemap discovery
|
@nikitachapovskii-dev , why do we remove proxy support? |
…s no-cookie sitemap discovery" This reverts commit aab35d0.
Speedrun moment: went to kill cookies on a hurry at the end of WD, eliminated proxy too. Reverted now — proxy lives, cookies don’t. |
|
Added a focused fix to make sitemap discovery fully stateless on Crawlee v4: |
|
This commit fixes runtime failures when sitemap endpoints return non-default XML MIME types (e.g. application/rss+xml). |
| const [request, options] = sendRequestArgs; | ||
| return originalSendRequest(request, { | ||
| ...(options ?? {}), | ||
| cookieJar: NOOP_COOKIE_JAR as any, |
There was a problem hiding this comment.
I would assume passing persistCookiesPerSession: false would be enough (Crawlee then shouldn't pass a cookie jar to impit in the first place).
Can you please share the reasoning behind this? Maybe there's a bug in Crawlee v4 👀
There was a problem hiding this comment.
persistCookiesPerSession: false only disables per-session cookie persistence in the crawler flow.
When I tested commit ca0b7ef a crash appeared in sitemap discovery, where Crawlee v4 still creates a default cookie jar inside BaseHttpClient and tries to store Set-Cookie headers.
That’s why I forced a no-op cookie jar here.
I'm going to go test ca0b7ef again on a new branch and send a run link here. If it confirms we could create a new issue in crawlee
There was a problem hiding this comment.
There was a problem hiding this comment.
Thank you @nikitachapovskii-dev , I see now 👀
tough-cookie is throwing uncaught exceptions because of an invalid cookie in the server response. We should imo be more defensive and soft-fail such operations with a warning message (this is what e.g. Chrome does).
I'll post a Crawlee issue in a minute 👍
There was a problem hiding this comment.
I forgot to request a review 😄 requesting rn
If everything looks good on your side, please approve
There was a problem hiding this comment.
with new npm version it works but I get logs like:
2026-02-23T15:33:58.731Z WARN Failed to set cookie for URL "https://seomator.com/de/uber-uns": Cookie not in this host's domain. Cookie:cdn.webflow.com Request:seomator.com
2026-02-23T15:33:58.753Z WARN Failed to set cookie for URL "https://seomator.com/about": Cookie not in this host's domain. Cookie:cdn.webflow.com Request:seomator.com
https://console.apify.com/view/runs/fWWUictBRI1yXDUKa
I assume this is an expected behaviour?
barjin
left a comment
There was a problem hiding this comment.
Cool, please check if everything works (doesn't throw the cookie error), if so, feel free to merge. Thanks!
Disabled cookie behavior for discovery/parsing requests
Closes #221