fix: filter sitemap pathnames from generated sitemap.xml#19
fix: filter sitemap pathnames from generated sitemap.xml#19rubenmarcus merged 4 commits intomultivmlabs:mainfrom
Conversation
When sitemap plugins (e.g. @astrojs/sitemap) generate sitemap-index.xml, sitemap-0.xml, etc., these pathnames could leak into the aeo.js generated sitemap.xml as regular page URLs if they reach config.pages. Adds isSitemapPathname() guard to exclude any pathname starting with /sitemap from the generated sitemap. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Someone is attempting to deploy a commit to the Cytonic Team on Vercel. A member of the Team first needs to authorize it. |
The original /^\/sitemap/i guard had two issues: 1. Too broad — matched legitimate user pages like /sitemap-guide, /sitemap-tutorial, /sitemaps-explained. Tightened to match only the actual @astrojs/sitemap output patterns: /sitemap, /sitemap-N, /sitemap-index, with optional .xml suffix. 2. Only filtered config.pages — sitemap-named .md/.html files in contentDir still leaked through collectUrls(). Now filtered there too. Adds 3 tests covering: (a) the original leak case, (b) false-positive prevention for legitimate sitemap-* paths, (c) contentDir filtering. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Greptile SummaryAdds a targeted Confidence Score: 5/5Safe to merge — only additive filtering, no breaking changes to existing behavior. All findings are P2 (style/test coverage). The implementation logic is correct and the regex is precise; only one promised test case is absent. src/core/sitemap.test.ts — missing the cross-host URL defense-in-depth test case mentioned in the PR description. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[generateSitemap called] --> B{config.pages?}
B -- yes --> C[For each page]
C --> D{isSitemapPathname pathname}
D -- true --> E[skip / continue]
D -- false --> F[push URL to list]
B -- no --> G{config.contentDir?}
F --> G
E --> C
G -- yes --> H[collectUrls]
H --> I[filter via pathnameFromUrl + isSitemapPathname]
I --> J[push remaining URLs]
G -- no --> K[Always push config.url root]
J --> K
K --> L[deduplicate and sort]
L --> M[render XML]
Prompt To Fix All With AIThis is a comment left during a code review.
Path: src/core/sitemap.test.ts
Line: 236-255
Comment:
**Missing promised 4th test case**
The PR description explicitly lists four new test cases, the last being: *"defense in depth: filter still applies when a URL has a different host than `config.url`"*. Only three `it()` blocks were added. The cross-host scenario — where `pathnameFromUrl` must extract the pathname via the WHATWG parser before testing it against the regex — is the one case that exercises the `new URL()` path in `pathnameFromUrl` non-trivially, yet it has no test. Consider adding something like:
```ts
it('filters sitemap pathnames from cross-host URLs', () => {
mockFs.readdirSync.mockReturnValue([]);
// pathnameFromUrl extracts /sitemap-0 via URL parsing even for a different host
// Exercise this branch by constructing a URL with a different origin
});
```
How can I resolve this? If you propose a fix, please make it concise.Reviews (3): Last reviewed commit: "test(sitemap): remove misleading "differ..." | Re-trigger Greptile |
… URLs Greptile flagged that pathnameFromUrl returned the raw URL string when url didn't start with baseUrl. The downstream isSitemapPathname regex ^\/sitemap... would never match a string starting with https://, so the filter silently no-op'd. Unreachable today (collectUrls always prepends config.url), but a latent gap if URL construction ever changes. Switching to new URL(url, baseUrl).pathname always yields a real pathname for absolute URLs, relative URLs, and cross-host URLs, with a safe fallback if parsing fails. Adds one test exercising the cross-host case. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@greptileai re-review please — addressed both P2 comments:
Latest commit: 0d8d7d3 |
Greptile correctly observed that collectUrls always prepends config.url, so the test never actually exercised a cross-host URL — it passed for the same reason the contentDir filter test already passes. Removing the duplicate-but-mislabeled case keeps test intent honest. The pathname parser change in pathnameFromUrl stands on its own as defense-in-depth. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Problem
When
@astrojs/sitemap(or similar plugins) is used alongside aeo.js, it generatessitemap-index.xml,sitemap-0.xml,sitemap-1.xml, etc. in the build output. If these pathnames reachconfig.pages— through manual configuration, a different framework plugin, or a future change in page discovery —generateSitemap()includes them as regular page URLs in the generatedsitemap.xml.Real-world scenario
Discovered on a production Astro site using
@astrojs/sitemap. The generatedsitemap-index.xmlwas being indexed as a regular page by external crawlers.What changed
File:
src/core/sitemap.tsAdded an
isSitemapPathname()guard that filters plugin-generated sitemap pathnames before they enter the generatedsitemap.xml. The filter applies to bothconfig.pagesand any URLs discovered viacollectUrls()fromcontentDir.The regex
/^\/sitemap(-\d+|-index)?(\.xml)?$/iis intentionally precise — it matches:/sitemap/sitemap-0,/sitemap-1, …/sitemap-N(output of@astrojs/sitemap)/sitemap-index/sitemap.xml,/sitemap-0.xml, etc.It does not match legitimate user pages like
/sitemap-guide,/sitemap-tutorial, or/sitemaps-explained.For cross-host URLs that may slip through
collectUrls(),pathnameFromUrl()uses the WHATWGURLparser to always extract a real pathname before testing it.Tests
src/core/sitemap.test.tsadds 4 new cases:/sitemap,/sitemap-N,/sitemap-index,/sitemap.xml,/sitemap-0.xmlfromconfig.pagessitemap-(e.g./sitemap-guide).html/.mdfiles discovered incontentDirconfig.urlImpact
sitemap-*contentconfig.pages,contentDirscan, and future cross-host edge cases