Skip to content

fix(sitemap-scraper): enforce stateless no-cookie#236

Merged
nikitachapovskii-dev merged 12 commits into
masterfrom
fix/discovery-sitemap-no-cookie
Feb 23, 2026
Merged

fix(sitemap-scraper): enforce stateless no-cookie#236
nikitachapovskii-dev merged 12 commits into
masterfrom
fix/discovery-sitemap-no-cookie

Conversation

@nikitachapovskii-dev
Copy link
Copy Markdown
Contributor

@nikitachapovskii-dev nikitachapovskii-dev commented Feb 18, 2026

Disabled cookie behavior for discovery/parsing requests

Closes #221

@nicklamonov
Copy link
Copy Markdown
Contributor

@nikitachapovskii-dev , why do we remove proxy support?
It might be useful for some users.

@nikitachapovskii-dev
Copy link
Copy Markdown
Contributor Author

why do we remove proxy support?
It might be useful for some users.

Speedrun moment: went to kill cookies on a hurry at the end of WD, eliminated proxy too. Reverted now — proxy lives, cookies don’t.

@nikitachapovskii-dev nikitachapovskii-dev changed the title fix(sitemap-scraper): remove proxy input and enforce stateless no-coo… fix(sitemap-scraper): enforce stateless no-cookie Feb 19, 2026
Comment thread packages/actor-scraper/sitemap-scraper/src/internals/crawler_setup.ts Outdated
@nikitachapovskii-dev
Copy link
Copy Markdown
Contributor Author

Added a focused fix to make sitemap discovery fully stateless on Crawlee v4:
sitemap-scraper now uses an Impit client wrapper that injects a no-op cookie jar for requests.

@nikitachapovskii-dev
Copy link
Copy Markdown
Contributor Author

This commit fixes runtime failures when sitemap endpoints return non-default XML MIME types (e.g. application/rss+xml).
In v4, HttpCrawler enforces a stricter default MIME allowlist and rejects sitemap responses with content types like application/rss+xml.
We explicitly allow sitemap-related MIME types (application/rss+xml, application/atom+xml, text/plain) so discovery/parsing behavior remains compatible after the v4 upgrade.

const [request, options] = sendRequestArgs;
return originalSendRequest(request, {
...(options ?? {}),
cookieJar: NOOP_COOKIE_JAR as any,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would assume passing persistCookiesPerSession: false would be enough (Crawlee then shouldn't pass a cookie jar to impit in the first place).

Can you please share the reasoning behind this? Maybe there's a bug in Crawlee v4 👀

Copy link
Copy Markdown
Contributor Author

@nikitachapovskii-dev nikitachapovskii-dev Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

persistCookiesPerSession: false only disables per-session cookie persistence in the crawler flow.

When I tested commit ca0b7ef a crash appeared in sitemap discovery, where Crawlee v4 still creates a default cookie jar inside BaseHttpClient and tries to store Set-Cookie headers.

That’s why I forced a no-op cookie jar here.

I'm going to go test ca0b7ef again on a new branch and send a run link here. If it confirms we could create a new issue in crawlee

Copy link
Copy Markdown
Contributor Author

@nikitachapovskii-dev nikitachapovskii-dev Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @nikitachapovskii-dev , I see now 👀

tough-cookie is throwing uncaught exceptions because of an invalid cookie in the server response. We should imo be more defensive and soft-fail such operations with a warning message (this is what e.g. Chrome does).

I'll post a Crawlee issue in a minute 👍

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot to request a review 😄 requesting rn
If everything looks good on your side, please approve

Copy link
Copy Markdown
Contributor Author

@nikitachapovskii-dev nikitachapovskii-dev Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with new npm version it works but I get logs like:

2026-02-23T15:33:58.731Z WARN Failed to set cookie for URL "https://seomator.com/de/uber-uns": Cookie not in this host's domain. Cookie:cdn.webflow.com Request:seomator.com
2026-02-23T15:33:58.753Z WARN Failed to set cookie for URL "https://seomator.com/about": Cookie not in this host's domain. Cookie:cdn.webflow.com Request:seomator.com
https://console.apify.com/view/runs/fWWUictBRI1yXDUKa

I assume this is an expected behaviour?

Copy link
Copy Markdown
Member

@barjin barjin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, please check if everything works (doesn't throw the cookie error), if so, feel free to merge. Thanks!

@nikitachapovskii-dev nikitachapovskii-dev merged commit 63402bd into master Feb 23, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

discoverValidSitemaps fails on cross‑domain Set‑Cookie (Cookie not in this host's domain)

4 participants