Skip to content

fix: filter sitemap pathnames from generated sitemap.xml#19

Merged
rubenmarcus merged 4 commits intomultivmlabs:mainfrom
VictorD19:fix/sitemap-pathname-filter
Apr 28, 2026
Merged

fix: filter sitemap pathnames from generated sitemap.xml#19
rubenmarcus merged 4 commits intomultivmlabs:mainfrom
VictorD19:fix/sitemap-pathname-filter

Conversation

@VictorD19
Copy link
Copy Markdown

@VictorD19 VictorD19 commented Mar 27, 2026

Problem

When @astrojs/sitemap (or similar plugins) is used alongside aeo.js, it generates sitemap-index.xml, sitemap-0.xml, sitemap-1.xml, etc. in the build output. If these pathnames reach config.pages — through manual configuration, a different framework plugin, or a future change in page discovery — generateSitemap() includes them as regular page URLs in the generated sitemap.xml.

Real-world scenario

Discovered on a production Astro site using @astrojs/sitemap. The generated sitemap-index.xml was being indexed as a regular page by external crawlers.

What changed

File: src/core/sitemap.ts

Added an isSitemapPathname() guard that filters plugin-generated sitemap pathnames before they enter the generated sitemap.xml. The filter applies to both config.pages and any URLs discovered via collectUrls() from contentDir.

The regex /^\/sitemap(-\d+|-index)?(\.xml)?$/i is intentionally precise — it matches:

  • /sitemap
  • /sitemap-0, /sitemap-1, … /sitemap-N (output of @astrojs/sitemap)
  • /sitemap-index
  • /sitemap.xml, /sitemap-0.xml, etc.

It does not match legitimate user pages like /sitemap-guide, /sitemap-tutorial, or /sitemaps-explained.

For cross-host URLs that may slip through collectUrls(), pathnameFromUrl() uses the WHATWG URL parser to always extract a real pathname before testing it.

Tests

  • src/core/sitemap.test.ts adds 4 new cases:
    1. filters /sitemap, /sitemap-N, /sitemap-index, /sitemap.xml, /sitemap-0.xml from config.pages
    2. preserves legitimate paths that start with sitemap- (e.g. /sitemap-guide)
    3. filters sitemap-named .html/.md files discovered in contentDir
    4. defense in depth: filter still applies when a URL has a different host than config.url

Impact

  • No breaking changes — only adds a filter, does not change any existing behavior for real pages
  • Targeted regex — won't false-positive on legitimate sitemap-* content
  • Defense in depth — covers config.pages, contentDir scan, and future cross-host edge cases

When sitemap plugins (e.g. @astrojs/sitemap) generate sitemap-index.xml,
sitemap-0.xml, etc., these pathnames could leak into the aeo.js generated
sitemap.xml as regular page URLs if they reach config.pages.

Adds isSitemapPathname() guard to exclude any pathname starting with
/sitemap from the generated sitemap.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@vercel
Copy link
Copy Markdown

vercel Bot commented Mar 27, 2026

Someone is attempting to deploy a commit to the Cytonic Team on Vercel.

A member of the Team first needs to authorize it.

Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@VictorD19 VictorD19 closed this Mar 27, 2026
@VictorD19 VictorD19 deleted the fix/sitemap-pathname-filter branch March 27, 2026 12:54
@VictorD19 VictorD19 restored the fix/sitemap-pathname-filter branch March 27, 2026 12:55
@VictorD19 VictorD19 reopened this Mar 27, 2026
The original /^\/sitemap/i guard had two issues:

1. Too broad — matched legitimate user pages like /sitemap-guide,
   /sitemap-tutorial, /sitemaps-explained. Tightened to match only
   the actual @astrojs/sitemap output patterns:
   /sitemap, /sitemap-N, /sitemap-index, with optional .xml suffix.

2. Only filtered config.pages — sitemap-named .md/.html files in
   contentDir still leaked through collectUrls(). Now filtered there too.

Adds 3 tests covering: (a) the original leak case, (b) false-positive
prevention for legitimate sitemap-* paths, (c) contentDir filtering.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented Apr 28, 2026

Greptile Summary

Adds a targeted isSitemapPathname filter and a pathnameFromUrl helper to generateSitemap, preventing plugin-generated sitemap pathnames (e.g. /sitemap-0.xml, /sitemap-index) from being emitted as regular page URLs. The regex is correctly scoped and three of the four promised tests are present; the cross-host defense-in-depth test case is missing.

Confidence Score: 5/5

Safe to merge — only additive filtering, no breaking changes to existing behavior.

All findings are P2 (style/test coverage). The implementation logic is correct and the regex is precise; only one promised test case is absent.

src/core/sitemap.test.ts — missing the cross-host URL defense-in-depth test case mentioned in the PR description.

Important Files Changed

Filename Overview
src/core/sitemap.ts Adds isSitemapPathname guard and pathnameFromUrl helper; filters plugin-generated sitemap pathnames from both config.pages and contentDir-discovered URLs. Logic is correct and regex is precise.
src/core/sitemap.test.ts Adds 3 of the 4 test cases promised in the PR description; the cross-host defense-in-depth test is missing.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[generateSitemap called] --> B{config.pages?}
    B -- yes --> C[For each page]
    C --> D{isSitemapPathname pathname}
    D -- true --> E[skip / continue]
    D -- false --> F[push URL to list]
    B -- no --> G{config.contentDir?}
    F --> G
    E --> C
    G -- yes --> H[collectUrls]
    H --> I[filter via pathnameFromUrl + isSitemapPathname]
    I --> J[push remaining URLs]
    G -- no --> K[Always push config.url root]
    J --> K
    K --> L[deduplicate and sort]
    L --> M[render XML]
Loading
Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/core/sitemap.test.ts
Line: 236-255

Comment:
**Missing promised 4th test case**

The PR description explicitly lists four new test cases, the last being: *"defense in depth: filter still applies when a URL has a different host than `config.url`"*. Only three `it()` blocks were added. The cross-host scenario — where `pathnameFromUrl` must extract the pathname via the WHATWG parser before testing it against the regex — is the one case that exercises the `new URL()` path in `pathnameFromUrl` non-trivially, yet it has no test. Consider adding something like:

```ts
it('filters sitemap pathnames from cross-host URLs', () => {
  mockFs.readdirSync.mockReturnValue([]);
  // pathnameFromUrl extracts /sitemap-0 via URL parsing even for a different host
  // Exercise this branch by constructing a URL with a different origin
});
```

How can I resolve this? If you propose a fix, please make it concise.

Reviews (3): Last reviewed commit: "test(sitemap): remove misleading "differ..." | Re-trigger Greptile

Comment thread src/core/sitemap.ts
Comment thread src/core/sitemap.ts
… URLs

Greptile flagged that pathnameFromUrl returned the raw URL string when
url didn't start with baseUrl. The downstream isSitemapPathname regex
^\/sitemap... would never match a string starting with https://, so the
filter silently no-op'd. Unreachable today (collectUrls always prepends
config.url), but a latent gap if URL construction ever changes.

Switching to new URL(url, baseUrl).pathname always yields a real
pathname for absolute URLs, relative URLs, and cross-host URLs, with a
safe fallback if parsing fails.

Adds one test exercising the cross-host case.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rubenmarcus
Copy link
Copy Markdown
Member

@greptileai re-review please — addressed both P2 comments:

  • 'PR description regex differs from implementation' → updated PR description to advertise the actual tighter regex
  • 'Fallback path skips filter when URL doesn't start with baseUrl' → pathnameFromUrl now uses new URL(url, baseUrl).pathname so the filter applies on cross-host URLs too. Regression test added in sitemap.test.ts.

Latest commit: 0d8d7d3

Greptile correctly observed that collectUrls always prepends config.url,
so the test never actually exercised a cross-host URL — it passed for
the same reason the contentDir filter test already passes. Removing the
duplicate-but-mislabeled case keeps test intent honest. The pathname
parser change in pathnameFromUrl stands on its own as defense-in-depth.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rubenmarcus rubenmarcus merged commit c23e978 into multivmlabs:main Apr 28, 2026
2 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants