Switch llms.txt plugin so generated .md files mirror live URLs#3281
Open
hkalodner wants to merge 10 commits into
Open
Switch llms.txt plugin so generated .md files mirror live URLs#3281hkalodner wants to merge 10 commits into
hkalodner wants to merge 10 commits into
Conversation
Replace docusaurus-plugin-llms with @signalwire/docusaurus-plugin-llms-txt. The previous plugin derived output paths from source filenames, leaving files at /docs/launch-arbitrum-chain/01-a-gentle-introduction.md while the actual page lives at /launch-arbitrum-chain/a-gentle-introduction. The signalwire plugin processes the rendered HTML and writes .md files that mirror Docusaurus's route structure exactly. Add a small rehype plugin that drops heading-anchor links and unwraps quicklook anchors (which render with no href) before HTML-to-markdown conversion. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Docusaurus puts language-* classes on <pre> but hast-util-to-mdast reads them from <code>; propagate the class so fences come out as ```shell instead of bare ```. Admonitions (theme-admonition-info/caution/...) were flattening to a plain title line followed by paragraph body. Convert them to blockquotes with a bold type prefix: "> **CAUTION**" or "> **INFO** — Custom title". Skip the title when it's just the type word (Docusaurus's default). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add a two-stage cleanup: the rehype plugin replaces tab containers and katex spans with marker elements (<code> for inline, <pre><code> for block — chosen because comments are treated as skippable by rehype-minify-whitespace and would swallow adjacent spaces). A new remark plugin then rewrites those mdast inlineCode/code nodes into html nodes containing the final output. - Docusaurus tabs -> <details><summary>label</summary>…</details> - KaTeX inline -> $latex$ with surrounding whitespace preserved - KaTeX block -> $$\nlatex\n$$ - Empty <!-- --> serializer artifacts stripped Also fix a walk bug where replacement nodes were skipped instead of revisited, so nested cleanup (hash-links inside tab panels, etc.) now fires correctly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extract the LLMS_ prefix and marker regex into src/plugins/llms-markers.js
so the rehype and remark passes import a single source of truth. Drift
between the construction site and the recognition site would silently
leave markers in the output.
Add why-comments for two non-obvious decisions that are easy to "simplify"
in the wrong direction:
- <code>/<pre><code> over hast comments as the marker carrier — comments
are skippable to rehype-minify-whitespace and collapse adjacent text.
- splice + i-- in the walk loop relies on visitor rules not producing
output that re-matches their own check.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add buildMarkerValue to llms-markers.js that throws on an unknown kind, and route the rehype side through it. Closes the silent-failure gap where a typo'd kind in rehype would emit markers that remark wouldn't recognize, leaving raw LLMS_ text in the generated markdown. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both plugin walks were carrying three return modes (string sentinel, array, undefined) plus a splice/i-- dance that documented a no-self- rematch invariant. Post-order walks let the visitor use plain return values — null to drop, array to splice many, a node to replace, or undefined to keep — because nested cleanup naturally happens before the parent's visit. The rehype visitor takes a parent argument so the inline-math rule can defer when wrapped in <span class="katex-display"> (post-order would otherwise see the inner <span class="katex"> first and demote block math to inline). Verified byte-identical output across the corpus. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace literal object construction (~30 lines) with h() calls. Pulls
hastscript@6 up from a transitive dep to a direct dev dep so the
version isn't at the mercy of upstream changes.
I also tried hast-util-select for predicates and tree-search; neither
the v6 (ESM) nor the v4 (CJS) line works:
- v6 is ESM-only, which forces await import() in our transformer.
The signalwire plugin calls processor.runSync() internally, which
fails when any transformer is async — the build produced 0
documents.
- v4 is CJS but ships with an internal "Cannot collect multiple
nodes" guard that fires for our descendant queries on the 9 pages
containing Docusaurus tabs. Substituting selectAll(...)[0] for
select(...) only worked around part of it.
Keeping the custom predicates and helpers — they're 50 lines of
straightforward code with no third-party-bug surface.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace ~90 lines of custom helpers (isHashLink, isQuicklook, isTabContainer, isKatex, isKatexDisplay, findChild, findDescendant, findAllDescendants, getClassList, hasClass, hasClassPrefix) with matches(), select(), and selectAll() from hast-util-select. The package is ESM-only at v6; Node 22.12+ supports require() of ESM modules that don't use top-level await, so we can load it synchronously without making the rehype transformer async (which would break the signalwire plugin's processor.runSync() call). Our engines field already pins node@22.x. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- getAdmonitionType: startsWith + slice instead of regex - admonitionToBlockquote: rely on hastscript's string-to-text coercion for the " — " separator - tabsToMarkers: flatMap over labels instead of imperative push loop Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
The signalwire plugin treats every Docusaurus route as standalone content,
including 186 underscored partial files (_*.mdx building blocks imported
into other pages), 10 non-underscored partials in partials/ dirs, and
auto-generated category index pages. docusaurus-plugin-llms had a built-in
underscore convention; signalwire doesn't.
Two patterns:
- **/_* covers the Docusaurus partial convention (also catches any
future underscored file outside of partials/ dirs)
- **/partials/** covers the 10 non-underscored files in this repo's
partials/ dirs (e.g. config-account-abstraction.mdx loaded as a
FloatingHoverModal body)
Drops llms.txt from 474 to 275 entries — close to the old plugin's 284.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
docusaurus-plugin-llmsfor@signalwire/docusaurus-plugin-llms-txt. The previous plugin derived output paths from source filenames, so01-a-gentle-introduction.mdxended up at/docs/launch-arbitrum-chain/01-a-gentle-introduction.mdwhile the live page is/launch-arbitrum-chain/a-gentle-introduction. The signalwire plugin processes rendered HTML and writes.mdfiles that mirror Docusaurus's route structure exactly.src/plugins/rehype-llms-cleanup.js,src/plugins/remark-llms-cleanup.js) to preserve fidelity through HTML→Markdown conversion:> **CAUTION**or> **INFO** — Custom title)<details><summary>Label</summary>…</details>blocks$x$inline,$$…$$block) instead of rendering to Unicode textshell,rust, etc.) survive on fences<!-- -->separators emitted by the markdown serializer are droppedUses
hastscriptfor hast node construction andhast-util-selectfor CSS-selector predicates and tree search. The latter is ESM; Node 22.x (already pinned inengines) supportsrequire()of ESM modules without top-level await, so the rehype transformer stays synchronous — required because the signalwire plugin callsprocessor.runSync()internally.Test plan
yarn buildsucceedsbuild/launch-arbitrum-chain/a-gentle-introduction.mdexists (path alignment)build/run-arbitrum-node/start-here.mdcontains<details>blocks for the parameters tabs, with both tab contents preservedbuild/how-arbitrum-works/deep-dives/gas-and-fees.mdcontains LaTeX ($$\nU_{\\text{upd}} = …\n$$) instead of Unicode-rendered mathbuild/run-arbitrum-node/nitro/migrate-state-and-history-from-classic.mdshows admonitions as> **INFO**/> **CAUTION**blockquotesbuild/llms.txtandbuild/llms-full.txtare generated🤖 Generated with Claude Code