Skip to content

Commit c48711d

Browse files
dstrodtmanclaude
authored andcommitted
[doc][DOC-908] exclude Jupyter notebooks from llms-full.txt (ray-project#63228)
## Why After ray-project#63130 shipped, the generated `llms-full.txt` corpus is polluted by raw Jupyter notebook source. `sphinx-llms-txt` reads each docname's source from `_sources/` verbatim, so for the 127 `.ipynb` pages under `doc/source/` it appends the full notebook JSON (cells, outputs, embedded base64 images, metadata) into the file. That's the largest source of low-signal bytes in the corpus targeted at agents. ## What Append computed notebook docnames to the existing `llms_txt_exclude` list in `doc/source/conf.py`. `llms_txt_exclude` matches docnames (extension stripped) via `fnmatch.fnmatch` — see [`sphinx_llms_txt/collector.py`](https://github.com/jdillard/sphinx-llms-txt/blob/main/sphinx_llms_txt/collector.py). A pattern such as `**/*.ipynb` can't match because the docname carries no extension. The change enumerates `*.ipynb` files under the source directory at conf-load time and converts each path to its docname (relative to the source dir, suffix stripped, posix separators). Scope: - Affects only `llms.txt` / `llms-full.txt`. The Sphinx HTML build is governed by the separate `exclude_patterns` list (line 351 of `conf.py`), which is untouched. All 127 notebooks remain fully rendered on `docs.ray.io`. - Notebooks already in `exclude_patterns` (e.g. `serve/tutorials/video-analysis/*.ipynb`) aren't built, so adding their docnames to `llms_txt_exclude` is a harmless no-op. ## Verification After RtD builds the PR preview: - Fetch `llms-full.txt` from the PR build and confirm no `"cell_type": "code"` / `"output_type": "stream"` / base64 image data appears in the corpus. - Confirm `llms.txt` (the summary index) still resolves and looks sane. - Spot-check that a couple of notebook pages still render normally in the HTML preview. ## Context Tracked under [DOC-908]. Follow-up to ray-project#63130 (DOC-875). Part of [DOC-844] (Agent Ray docs umbrella). Tuning this list further (other low-signal page types) is deferred until the rebuilt corpus can be inspected directly. Signed-off-by: Douglas Strodtman <douglas@anyscale.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 8f71351 commit c48711d

1 file changed

Lines changed: 13 additions & 0 deletions

File tree

doc/source/conf.py

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -111,6 +111,19 @@
111111
"rllib/package_ref/*",
112112
]
113113

114+
# Exclude Jupyter notebooks from llms-full.txt. sphinx-llms-txt reads each
115+
# docname's source verbatim from `_sources/`, so for `.ipynb` pages it
116+
# appends raw notebook JSON (cells, outputs, embedded base64 images) into
117+
# the corpus. `llms_txt_exclude` matches docnames (extension stripped) via
118+
# fnmatch, so a `**/*.ipynb` pattern can't work — we enumerate each
119+
# notebook's docname instead. Notebooks remain fully rendered in the HTML
120+
# build; only the agent corpus drops them.
121+
_conf_dir = pathlib.Path(__file__).parent
122+
llms_txt_exclude += sorted(
123+
p.relative_to(_conf_dir).with_suffix("").as_posix()
124+
for p in _conf_dir.rglob("*.ipynb")
125+
)
126+
114127
# -- sphinx-collections: pull external template files at build time -----------
115128

116129
_TEMPLATES_CI_BASE = "https://templates.ci.ray.io"

0 commit comments

Comments
 (0)