This directory contains scripts to automatically generate llms.txt and llms-full.txt files for LLM consumption.
The LLM files provide structured documentation references that help AI assistants:
- Find the correct documentation pages
- Understand the documentation structure
- Reduce hallucinations by providing accurate URLs
- Discover all available integration options
generate-llm-files.js- Node.js script that generates the LLM filesgenerate-llm-files.sh- Shell wrapper script for easier execution
-
Build the documentation site:
yarn antora ./antora-playbook.yml
-
Generate LLM files from local sitemap:
yarn generate-llm-files # or ./-scripts/generate-llm-files.sh
Generate directly from the published sitemap (useful for syncing with production):
yarn generate-llm-files-from-url
# or
node ./-scripts/generate-llm-files.js https://www.tiny.cloud/docs/antora-sitemap.xmlnode ./-scripts/generate-llm-files.js /path/to/sitemap.xml
# or
node ./-scripts/generate-llm-files.js https://example.com/sitemap.xmlAfter major/minor/patch releases:
-
Run the script to regenerate files from production sitemap:
yarn generate-llm-files-from-url
This ensures the LLM files match what's actually published on the live site.
Alternatively, if you need to generate from a local build:
yarn generate-llm-files
-
Review the generated files in a PR
-
Commit and merge
Why not automated in CI/CD?
- The script makes 400+ HTTP requests to fetch H1 titles (~4-5 minutes)
- Resource-intensive and slow for every build
- Manual review ensures quality before committing
- Validates no 404s are listed and titles match actual page content
The files are generated in modules/ROOT/attachments/:
llms.txt- Simplified, curated documentation index (~105 lines)llms-full.txt- Complete documentation index with all pages (~700 lines)
Post-build: Files are moved to the root directory (handled in separate PR) and accessible at:
https://www.tiny.cloud/docs/tinymce/latest/llms.txthttps://www.tiny.cloud/docs/llms-full.txt
- Reads sitemap.xml - Extracts all documentation URLs from the sitemap (only
/latest/URLs) - Fetches H1 titles - Makes HTTP requests to each page to get the actual H1 title (validates no 404s)
- Generates titles - Uses fetched H1 titles, falls back to URL-based titles if fetch fails
- Categorizes pages - Groups by topic (integrations, plugins, API, etc.) based on URL patterns
- Deduplicates - Removes duplicate URLs and makes titles unique within categories
- Generates structured markdown - Creates both simplified (
llms.txt) and complete (llms-full.txt) indexes
The script uses hardcoded categorization logic. To customize:
- Edit
generate-llm-files.js - Modify the
categorizeUrl()function to adjust categorization - Update
generateLLMsTxt()andgenerateLLMsFullTxt()to change output format
- The script requires Node.js and
sanitize-htmlpackage (installed viayarn install) - Generated files are written to
modules/ROOT/attachments/ - Uses only the sitemap (no dependency on
nav.adoc) - Fetches actual H1 titles from pages (validates no 404s)
- Rate-limited fetching: 10 concurrent requests with 100ms delay between batches
- Request timeout: 10 seconds per page
- Security: Validates URLs to prevent SSRF attacks (only allows tiny.cloud domains)
- Handles HTML entity decoding (
’→') - Filters out error pages and duplicate URLs
- Makes titles unique within categories (e.g., "ES6 and npm (Webpack)", "ES6 and npm (Rollup)")
- Falls back to URL-based title generation if H1 fetch fails
Error: "Source not found"
- Make sure the sitemap path is correct
- For remote URLs, check your internet connection
- For local files, ensure Antora has generated the site first
Missing page titles
- If H1 fetch fails, the script uses URL-based title generation as fallback
- Check that pages return valid HTML with H1 tags
- 404 pages are automatically filtered out
Incorrect categorization
- Review the
categorizeUrl()function (note: function name is singular, not plural) - Add custom patterns for new page types