Commit c48711d
[doc][DOC-908] exclude Jupyter notebooks from llms-full.txt (ray-project#63228)
## Why
After ray-project#63130 shipped, the generated `llms-full.txt` corpus is polluted
by raw Jupyter notebook source. `sphinx-llms-txt` reads each docname's
source from `_sources/` verbatim, so for the 127 `.ipynb` pages under
`doc/source/` it appends the full notebook JSON (cells, outputs,
embedded base64 images, metadata) into the file. That's the largest
source of low-signal bytes in the corpus targeted at agents.
## What
Append computed notebook docnames to the existing `llms_txt_exclude`
list in `doc/source/conf.py`.
`llms_txt_exclude` matches docnames (extension stripped) via
`fnmatch.fnmatch` — see
[`sphinx_llms_txt/collector.py`](https://github.com/jdillard/sphinx-llms-txt/blob/main/sphinx_llms_txt/collector.py).
A pattern such as `**/*.ipynb` can't match because the docname carries
no extension. The change enumerates `*.ipynb` files under the source
directory at conf-load time and converts each path to its docname
(relative to the source dir, suffix stripped, posix separators).
Scope:
- Affects only `llms.txt` / `llms-full.txt`. The Sphinx HTML build is
governed by the separate `exclude_patterns` list (line 351 of
`conf.py`), which is untouched. All 127 notebooks remain fully rendered
on `docs.ray.io`.
- Notebooks already in `exclude_patterns` (e.g.
`serve/tutorials/video-analysis/*.ipynb`) aren't built, so adding their
docnames to `llms_txt_exclude` is a harmless no-op.
## Verification
After RtD builds the PR preview:
- Fetch `llms-full.txt` from the PR build and confirm no `"cell_type":
"code"` / `"output_type": "stream"` / base64 image data appears in the
corpus.
- Confirm `llms.txt` (the summary index) still resolves and looks sane.
- Spot-check that a couple of notebook pages still render normally in
the HTML preview.
## Context
Tracked under [DOC-908]. Follow-up to ray-project#63130 (DOC-875). Part of
[DOC-844] (Agent Ray docs umbrella).
Tuning this list further (other low-signal page types) is deferred until
the rebuilt corpus can be inspected directly.
Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>1 parent 8f71351 commit c48711d
1 file changed
Lines changed: 13 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
111 | 111 | | |
112 | 112 | | |
113 | 113 | | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
114 | 127 | | |
115 | 128 | | |
116 | 129 | | |
| |||
0 commit comments