Skip to content

Commit ba53daa

Browse files
BYKclaude
andauthored
fix(md-exports): Stabilize cache keys by stripping build-specific elements (#16079)
## Summary Fixes unstable cache keys in the markdown export script that caused cache misses on every build. ## Problem Next.js build output contains non-deterministic elements that change between builds even when content is unchanged: - `<script>` tags with RSC/Flight payloads and JS chunk references - `<link>` tags referencing `/_next/static/` (CSS, fonts, JS preloads with content hashes) - `<style>` tags with `href` attribute (inlined CSS with build hashes) The previous approach hashed the raw HTML for cache keys, causing instability. An earlier fix stripped scripts via regex but was removed due to CodeQL warnings. ## Solution - Add `stripUnstableElements()` function that uses regex to remove build-specific elements - Use stripped HTML for both cache key calculation AND unified pipeline processing (faster parsing) - Bump `CACHE_VERSION` to invalidate old cache entries The regex approach is safe since: - Input is trusted (Next.js build output, not user input) - Worst case for any regex edge cases is a cache miss (current behavior anyway) - Stripped content is irrelevant for markdown generation (we only use title, canonical link, and main content) Co-authored-by: Claude <noreply@anthropic.com>
1 parent 9f98b14 commit ba53daa

1 file changed

Lines changed: 39 additions & 12 deletions

File tree

scripts/generate-md-exports.mjs

Lines changed: 39 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ import {remove} from 'unist-util-remove';
3030
const DOCS_ORIGIN = process.env.NEXT_PUBLIC_DEVELOPER_DOCS
3131
? 'https://develop.sentry.dev'
3232
: 'https://docs.sentry.io';
33-
const CACHE_VERSION = 3;
33+
const CACHE_VERSION = 4;
3434
const CACHE_COMPRESS_LEVEL = 4;
3535
const R2_BUCKET = process.env.NEXT_PUBLIC_DEVELOPER_DOCS
3636
? 'sentry-develop-docs'
@@ -407,12 +407,42 @@ ${
407407

408408
const md5 = data => createHash('md5').update(data).digest('hex');
409409

410+
/**
411+
* Strips build-specific elements from HTML for stable cache keys and faster processing.
412+
*
413+
* Next.js build output contains non-deterministic elements that change between builds
414+
* even when content is unchanged:
415+
* - <script> tags: RSC/Flight payloads, JS chunk references with content hashes
416+
* - <link> tags referencing /_next/static/: CSS files, fonts, JS preloads with hashes
417+
* - <style> tags with href: inlined CSS with build-specific hash in href attribute
418+
*
419+
* These elements are irrelevant for markdown generation (we only use title, canonical
420+
* link, and div#main content), so stripping them:
421+
* 1. Makes cache keys stable across builds
422+
* 2. Speeds up HTML parsing by reducing input size significantly
423+
*
424+
* We use regex instead of proper HTML parsing for performance - this runs on every file
425+
* and regex is much faster. The input is trusted (Next.js build output), and worst case
426+
* for any regex edge cases is a cache miss, which is acceptable.
427+
*/
428+
function stripUnstableElements(html) {
429+
return (
430+
html
431+
// Remove script tags (RSC payloads, JS chunk references)
432+
.replace(/<script[^>]*>[\s\S]*?<\/script>/gi, '')
433+
// Remove link tags referencing Next.js build assets (CSS, fonts, JS preloads)
434+
.replace(/<link[^>]*\/_next\/[^>]*>/gi, '')
435+
// Remove style tags with href attribute (inlined CSS with build hashes)
436+
.replace(/<style[^>]*href="[^"]*"[^>]*>[\s\S]*?<\/style>/gi, '')
437+
);
438+
}
439+
410440
async function genMDFromHTML(source, target, {cacheDir, noCache, usedCacheFiles}) {
411441
const rawHTML = await readFile(source, {encoding: 'utf8'});
412-
// Note: Scripts in the HTML may cause cache misses between builds since they
413-
// contain build-specific hashes. We accept this trade-off to avoid regex-based
414-
// script stripping which triggers CodeQL security warnings.
415-
const cacheKey = `v${CACHE_VERSION}_${md5(rawHTML)}`;
442+
// Strip build-specific elements for stable cache keys and faster parsing.
443+
// See stripUnstableElements() for details on what's removed and why.
444+
const strippedHTML = stripUnstableElements(rawHTML);
445+
const cacheKey = `v${CACHE_VERSION}_${md5(strippedHTML)}`;
416446
const cacheFile = path.join(cacheDir, cacheKey);
417447
if (!noCache) {
418448
try {
@@ -437,12 +467,9 @@ async function genMDFromHTML(source, target, {cacheDir, noCache, usedCacheFiles}
437467
const data = String(
438468
await unified()
439469
.use(rehypeParse)
440-
// Remove all script elements (they're not needed in markdown and aren't stable across builds)
441-
.use(() => tree => {
442-
remove(tree, {tagName: 'script'});
443-
return tree;
444-
})
445-
// Need the `head > title` selector for the headers
470+
// Select only the elements we need for markdown (title, canonical URL, main content).
471+
// Build-specific elements (scripts, CSS links, etc.) are already stripped by
472+
// stripUnstableElements() above, so we don't need to remove them here.
446473
.use(
447474
() => tree =>
448475
selectAll('head > title, head > link[rel="canonical"], div#main', tree)
@@ -500,7 +527,7 @@ async function genMDFromHTML(source, target, {cacheDir, noCache, usedCacheFiles}
500527
.use(() => tree => remove(tree, {type: 'inlineCode', value: ''}))
501528
.use(remarkGfm)
502529
.use(remarkStringify)
503-
.process(rawHTML)
530+
.process(strippedHTML)
504531
);
505532
const reader = Readable.from(data);
506533

0 commit comments

Comments
 (0)