Skip to content

Switch llms.txt plugin so generated .md files mirror live URLs#3281

Open
hkalodner wants to merge 10 commits into
masterfrom
llms-markdown-output
Open

Switch llms.txt plugin so generated .md files mirror live URLs#3281
hkalodner wants to merge 10 commits into
masterfrom
llms-markdown-output

Conversation

@hkalodner
Copy link
Copy Markdown
Contributor

@hkalodner hkalodner commented May 14, 2026

Summary

  • Swaps docusaurus-plugin-llms for @signalwire/docusaurus-plugin-llms-txt. The previous plugin derived output paths from source filenames, so 01-a-gentle-introduction.mdx ended up at /docs/launch-arbitrum-chain/01-a-gentle-introduction.md while the live page is /launch-arbitrum-chain/a-gentle-introduction. The signalwire plugin processes rendered HTML and writes .md files that mirror Docusaurus's route structure exactly.
  • Adds two local plugins (src/plugins/rehype-llms-cleanup.js, src/plugins/remark-llms-cleanup.js) to preserve fidelity through HTML→Markdown conversion:
    • Heading anchors and Docusaurus quicklook links are stripped/unwrapped
    • Admonitions become typed blockquotes (e.g. > **CAUTION** or > **INFO** — Custom title)
    • Tabs become <details><summary>Label</summary>…</details> blocks
    • KaTeX math keeps its LaTeX source ($x$ inline, $$…$$ block) instead of rendering to Unicode text
    • Code-block language tags (shell, rust, etc.) survive on fences
    • Stray empty <!-- --> separators emitted by the markdown serializer are dropped

Uses hastscript for hast node construction and hast-util-select for CSS-selector predicates and tree search. The latter is ESM; Node 22.x (already pinned in engines) supports require() of ESM modules without top-level await, so the rehype transformer stays synchronous — required because the signalwire plugin calls processor.runSync() internally.

Test plan

  • yarn build succeeds
  • build/launch-arbitrum-chain/a-gentle-introduction.md exists (path alignment)
  • build/run-arbitrum-node/start-here.md contains <details> blocks for the parameters tabs, with both tab contents preserved
  • build/how-arbitrum-works/deep-dives/gas-and-fees.md contains LaTeX ($$\nU_{\\text{upd}} = …\n$$) instead of Unicode-rendered math
  • build/run-arbitrum-node/nitro/migrate-state-and-history-from-classic.md shows admonitions as > **INFO** / > **CAUTION** blockquotes
  • build/llms.txt and build/llms-full.txt are generated

🤖 Generated with Claude Code

hkalodner and others added 9 commits May 13, 2026 16:46
Replace docusaurus-plugin-llms with @signalwire/docusaurus-plugin-llms-txt.
The previous plugin derived output paths from source filenames, leaving
files at /docs/launch-arbitrum-chain/01-a-gentle-introduction.md while
the actual page lives at /launch-arbitrum-chain/a-gentle-introduction.
The signalwire plugin processes the rendered HTML and writes .md files
that mirror Docusaurus's route structure exactly.

Add a small rehype plugin that drops heading-anchor links and unwraps
quicklook anchors (which render with no href) before HTML-to-markdown
conversion.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Docusaurus puts language-* classes on <pre> but hast-util-to-mdast reads
them from <code>; propagate the class so fences come out as ```shell
instead of bare ```.

Admonitions (theme-admonition-info/caution/...) were flattening to a
plain title line followed by paragraph body. Convert them to blockquotes
with a bold type prefix: "> **CAUTION**" or "> **INFO** — Custom title".
Skip the title when it's just the type word (Docusaurus's default).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add a two-stage cleanup: the rehype plugin replaces tab containers and
katex spans with marker elements (<code> for inline, <pre><code> for
block — chosen because comments are treated as skippable by
rehype-minify-whitespace and would swallow adjacent spaces). A new
remark plugin then rewrites those mdast inlineCode/code nodes into
html nodes containing the final output.

  - Docusaurus tabs -> <details><summary>label</summary>…</details>
  - KaTeX inline -> $latex$ with surrounding whitespace preserved
  - KaTeX block -> $$\nlatex\n$$
  - Empty <!-- --> serializer artifacts stripped

Also fix a walk bug where replacement nodes were skipped instead of
revisited, so nested cleanup (hash-links inside tab panels, etc.) now
fires correctly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extract the LLMS_ prefix and marker regex into src/plugins/llms-markers.js
so the rehype and remark passes import a single source of truth. Drift
between the construction site and the recognition site would silently
leave markers in the output.

Add why-comments for two non-obvious decisions that are easy to "simplify"
in the wrong direction:

  - <code>/<pre><code> over hast comments as the marker carrier — comments
    are skippable to rehype-minify-whitespace and collapse adjacent text.
  - splice + i-- in the walk loop relies on visitor rules not producing
    output that re-matches their own check.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add buildMarkerValue to llms-markers.js that throws on an unknown kind,
and route the rehype side through it. Closes the silent-failure gap
where a typo'd kind in rehype would emit markers that remark wouldn't
recognize, leaving raw LLMS_ text in the generated markdown.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both plugin walks were carrying three return modes (string sentinel,
array, undefined) plus a splice/i-- dance that documented a no-self-
rematch invariant. Post-order walks let the visitor use plain return
values — null to drop, array to splice many, a node to replace, or
undefined to keep — because nested cleanup naturally happens before
the parent's visit.

The rehype visitor takes a parent argument so the inline-math rule can
defer when wrapped in <span class="katex-display"> (post-order would
otherwise see the inner <span class="katex"> first and demote block
math to inline). Verified byte-identical output across the corpus.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace literal object construction (~30 lines) with h() calls. Pulls
hastscript@6 up from a transitive dep to a direct dev dep so the
version isn't at the mercy of upstream changes.

I also tried hast-util-select for predicates and tree-search; neither
the v6 (ESM) nor the v4 (CJS) line works:

  - v6 is ESM-only, which forces await import() in our transformer.
    The signalwire plugin calls processor.runSync() internally, which
    fails when any transformer is async — the build produced 0
    documents.
  - v4 is CJS but ships with an internal "Cannot collect multiple
    nodes" guard that fires for our descendant queries on the 9 pages
    containing Docusaurus tabs. Substituting selectAll(...)[0] for
    select(...) only worked around part of it.

Keeping the custom predicates and helpers — they're 50 lines of
straightforward code with no third-party-bug surface.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace ~90 lines of custom helpers (isHashLink, isQuicklook,
isTabContainer, isKatex, isKatexDisplay, findChild, findDescendant,
findAllDescendants, getClassList, hasClass, hasClassPrefix) with
matches(), select(), and selectAll() from hast-util-select.

The package is ESM-only at v6; Node 22.12+ supports require() of ESM
modules that don't use top-level await, so we can load it synchronously
without making the rehype transformer async (which would break the
signalwire plugin's processor.runSync() call). Our engines field
already pins node@22.x.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- getAdmonitionType: startsWith + slice instead of regex
- admonitionToBlockquote: rely on hastscript's string-to-text coercion
  for the " — " separator
- tabsToMarkers: flatMap over labels instead of imperative push loop

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@vercel
Copy link
Copy Markdown

vercel Bot commented May 14, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
arbitrum-docs Ready Ready Preview May 14, 2026 0:53am

Request Review

The signalwire plugin treats every Docusaurus route as standalone content,
including 186 underscored partial files (_*.mdx building blocks imported
into other pages), 10 non-underscored partials in partials/ dirs, and
auto-generated category index pages. docusaurus-plugin-llms had a built-in
underscore convention; signalwire doesn't.

Two patterns:
  - **/_* covers the Docusaurus partial convention (also catches any
    future underscored file outside of partials/ dirs)
  - **/partials/** covers the 10 non-underscored files in this repo's
    partials/ dirs (e.g. config-account-abstraction.mdx loaded as a
    FloatingHoverModal body)

Drops llms.txt from 474 to 275 entries — close to the old plugin's 284.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant