Skip to content

SitemapEx: Handle gzipped sitemaps #241

@nicklamonov

Description

@nicklamonov

Currently we seem to be handling gzipped sitemaps in some cases, but in some others - not.

Looks like content type application/gzip is missing.

Example:
https://console.apify.com/admin/users/sbayOaHcZRfopIRjh/actors/rGeTNESChDZ65EbYh/runs/FCDWJn6flif5mwMW0#log

2026-02-24T15:27:47.164Z     at HttpCrawler._abortDownloadOfBody (file:///usr/src/app/node_modules/@crawlee/http/internals/http-crawler.js:456:19)
2026-02-24T15:27:47.165Z     at HttpCrawler.postNavigationHooks (file:///usr/src/app/node_modules/@crawlee/http/internals/http-crawler.js:152:45)
2026-02-24T15:27:47.166Z     at HttpCrawler._executeHooks (file:///usr/src/app/node_modules/@crawlee/basic/internals/basic-crawler.js:1214:23)
2026-02-24T15:27:47.166Z     at HttpCrawler.processHttpResponse (file:///usr/src/app/node_modules/@crawlee/http/internals/http-crawler.js:222:20)
2026-02-24T15:27:47.167Z     at ContextPipelineImpl.call (file:///usr/src/app/node_modules/@crawlee/core/crawlers/context_pipeline.js:58:52)
2026-02-24T15:27:47.168Z     at async HttpCrawler.runRequestHandler (file:///usr/src/app/node_modules/@crawlee/basic/internals/basic-crawler.js:761:9)
2026-02-24T15:27:47.168Z     at async HttpCrawler._runTaskFunction (file:///usr/src/app/node_modules/@crawlee/basic/internals/basic-crawler.js:962:13)
2026-02-24T15:27:47.169Z     at async AutoscaledPool._maybeRunTask (file:///usr/src/app/node_modules/@crawlee/core/autoscaling/autoscaled_pool.js:378:17) {"id":"dbg1hhTdKALzCNx","url":"https://www.producthunt.com/sitemaps_v3/stories_sitemap.xml.gz","method":"GET","uniqueKey":"GET():https://www.producthunt.com/sitemaps_v3/stories_sitemap.xml.gz"}
2026-02-24T15:27:47.170Z ERROR Request https://www.producthunt.com/sitemaps_v3/stories_sitemap.xml.gz failed and will not be retried anymore. Marking as failed.
2026-02-24T15:27:47.170Z Last Error Message: Error: Resource https://www.producthunt.com/sitemaps_v3/stories_sitemap.xml.gz served Content-Type application/gzip, but only text/html, text/xml, application/xhtml+xml, application/xml, application/json, application/rss+xml, application/atom+xml, text/plain are allowed. Skipping resource.
2026-02-24T15:27:47.608Z ERROR HttpCrawler: Request failed and reached maximum retries. Error: Resource https://www.producthunt.com/sitemaps_v3/root_sitemap.xml.gz served Content-Type application/gzip, but only text/html, text/xml, application/xhtml+xml, application/xml, application/json, application/rss+xml, application/atom+xml, text/plain are allowed. Skipping resource.
2026-02-24T15:27:47.609Z     at HttpCrawler._abortDownloadOfBody (file:///usr/src/app/node_modules/@crawlee/http/internals/http-crawler.js:456:19)
2026-02-24T15:27:47.609Z     at HttpCrawler.postNavigationHooks (file:///usr/src/app/node_modules/@crawlee/http/internals/http-crawler.js:152:45)
2026-02-24T15:27:47.610Z     at HttpCrawler._executeHooks (file:///usr/src/app/node_modules/@crawlee/basic/internals/basic-crawler.js:1214:23)
2026-02-24T15:27:47.610Z     at HttpCrawler.processHttpResponse (file:///usr/src/app/node_modules/@crawlee/http/internals/http-crawler.js:222:20)
2026-02-24T15:27:47.611Z     at ContextPipelineImpl.call (file:///usr/src/app/node_modules/@crawlee/core/crawlers/context_pipeline.js:58:52)
2026-02-24T15:27:47.612Z     at async HttpCrawler.runRequestHandler (file:///usr/src/app/node_modules/@crawlee/basic/internals/basic-crawler.js:761:9)
2026-02-24T15:27:47.613Z     at async HttpCrawler._runTaskFunction (file:///usr/src/app/node_modules/@crawlee/basic/internals/basic-crawler.js:962:13)
2026-02-24T15:27:47.613Z     at async AutoscaledPool._maybeRunTask (file:///usr/src/app/node_modules/@crawlee/core/autoscaling/autoscaled_pool.js:378:17) {"id":"bNCT74ZDN5aeF02","url":"https://www.producthunt.com/sitemaps_v3/root_sitemap.xml.gz","method":"GET","uniqueKey":"GET():https://www.producthunt.com/sitemaps_v3/root_sitemap.xml.gz"}
2026-02-24T15:27:47.614Z ERROR Request https://www.producthunt.com/sitemaps_v3/root_sitemap.xml.gz failed and will not be retried anymore. Marking as failed.
2026-02-24T15:27:47.614Z Last Error Message: Error: Resource https://www.producthunt.com/sitemaps_v3/root_sitemap.xml.gz served Content-Type application/gzip, but only text/html, text/xml, application/xhtml+xml, application/xml, application/json, application/rss+xml, application/atom+xml, text/plain are allowed. Skipping resource.```

Metadata

Metadata

Labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions